Skip to content

Instruction Following Evaluator

The InstructionFollowingEvaluator assesses whether an agent’s response follows all explicit instructions provided in the user’s prompt. It focuses strictly on instruction compliance — whether specific constraints, requirements, and directives were satisfied — regardless of response quality or factual accuracy.

  • Trace-Level Evaluation: Evaluates the most recent turn in the conversation
  • Binary Scoring: Clear Yes (instructions followed) / No (instructions not followed) classification
  • Async Support: Supports both synchronous and asynchronous evaluation
  • Constraint-Focused: Evaluates compliance with explicit directives, not overall quality

Use the InstructionFollowingEvaluator when you need to:

  • Verify that agents respect format, length, or style constraints
  • Check compliance with specific answer options or output types
  • Assess whether agents follow multi-part instructions completely
  • Evaluate instruction adherence independently from correctness

This evaluator operates at the TRACE_LEVEL, evaluating the most recent turn in the conversation.

  • Type: Model | str | None
  • Default: None (uses default Bedrock model)
  • Description: The model to use as the judge.
  • Type: str | None
  • Default: None (uses built-in template)
  • Description: Custom system prompt for the judge model.
  • Type: str
  • Default: "v0"
  • Description: Prompt template version.
RatingScoreDescription
Yes1.0All explicit instructions in the prompt are satisfied
No0.0One or more explicit instructions are not satisfied

A response passes the evaluation if all explicit instructions are followed (score = 1.0).

The evaluator checks for compliance with specific directives such as:

  • Information constraints: “Based on this text passage, give an overview about […]”
  • Length requirements: “Summarize this text in one sentence”
  • Answer options: “Which of the following is the tallest mountain…”
  • Target audience: “Write an explanation for middle schoolers”
  • Genre: “Write an ad for a laundry service”
  • Style: “Write an ad for a sports car like it’s an obituary”
  • Content type: “Write a body for this email” vs “Write a subject line”
  • If a response includes more information than requested, it still passes as long as all requested elements are present
  • If a response is purely evasive without any partial or related answer, it defaults to Yes
  • If a response is partially evasive but provides a partial answer, the partial answer is judged
  • If there are no explicit instructions in the input (casual or open-ended requests), defaults to Yes
  • The evaluator does not assess factual accuracy, writing quality, or response effectiveness
from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import InstructionFollowingEvaluator
from strands_evals.mappers import StrandsInMemorySessionMapper
from strands_evals.telemetry import StrandsEvalsTelemetry
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
def task_function(case: Case) -> dict:
telemetry.in_memory_exporter.clear()
agent = Agent(
trace_attributes={"session.id": case.session_id},
callback_handler=None
)
response = agent(case.input)
spans = telemetry.in_memory_exporter.get_finished_spans()
mapper = StrandsInMemorySessionMapper()
session = mapper.map_to_session(spans, session_id=case.session_id)
return {"output": str(response), "trajectory": session}
cases = [
Case(
name="format-constraint",
input="List the top 3 programming languages in bullet points."
),
Case(
name="length-constraint",
input="Explain quantum computing in exactly one sentence."
),
]
experiment = Experiment(cases=cases, evaluators=[InstructionFollowingEvaluator()])
reports = experiment.run_evaluations(task_function)
reports[0].run_display()

Pair with quality evaluators to assess both compliance and correctness:

evaluators = [
InstructionFollowingEvaluator(), # Did it follow the format/constraints?
CorrectnessEvaluator(), # Is the content factually correct?
ConcisenessEvaluator(), # Is it appropriately concise?
]