Instruction Following Evaluator
Overview
Section titled “Overview”The InstructionFollowingEvaluator assesses whether an agent’s response follows all explicit instructions provided in the user’s prompt. It focuses strictly on instruction compliance — whether specific constraints, requirements, and directives were satisfied — regardless of response quality or factual accuracy.
Key Features
Section titled “Key Features”- Trace-Level Evaluation: Evaluates the most recent turn in the conversation
- Binary Scoring: Clear Yes (instructions followed) / No (instructions not followed) classification
- Async Support: Supports both synchronous and asynchronous evaluation
- Constraint-Focused: Evaluates compliance with explicit directives, not overall quality
When to Use
Section titled “When to Use”Use the InstructionFollowingEvaluator when you need to:
- Verify that agents respect format, length, or style constraints
- Check compliance with specific answer options or output types
- Assess whether agents follow multi-part instructions completely
- Evaluate instruction adherence independently from correctness
Evaluation Level
Section titled “Evaluation Level”This evaluator operates at the TRACE_LEVEL, evaluating the most recent turn in the conversation.
Parameters
Section titled “Parameters”model (optional)
Section titled “model (optional)”- Type:
Model | str | None - Default:
None(uses default Bedrock model) - Description: The model to use as the judge.
system_prompt (optional)
Section titled “system_prompt (optional)”- Type:
str | None - Default:
None(uses built-in template) - Description: Custom system prompt for the judge model.
version (optional)
Section titled “version (optional)”- Type:
str - Default:
"v0" - Description: Prompt template version.
Scoring System
Section titled “Scoring System”| Rating | Score | Description |
|---|---|---|
| Yes | 1.0 | All explicit instructions in the prompt are satisfied |
| No | 0.0 | One or more explicit instructions are not satisfied |
A response passes the evaluation if all explicit instructions are followed (score = 1.0).
What Counts as Explicit Instructions
Section titled “What Counts as Explicit Instructions”The evaluator checks for compliance with specific directives such as:
- Information constraints: “Based on this text passage, give an overview about […]”
- Length requirements: “Summarize this text in one sentence”
- Answer options: “Which of the following is the tallest mountain…”
- Target audience: “Write an explanation for middle schoolers”
- Genre: “Write an ad for a laundry service”
- Style: “Write an ad for a sports car like it’s an obituary”
- Content type: “Write a body for this email” vs “Write a subject line”
Evaluation Rules
Section titled “Evaluation Rules”- If a response includes more information than requested, it still passes as long as all requested elements are present
- If a response is purely evasive without any partial or related answer, it defaults to Yes
- If a response is partially evasive but provides a partial answer, the partial answer is judged
- If there are no explicit instructions in the input (casual or open-ended requests), defaults to Yes
- The evaluator does not assess factual accuracy, writing quality, or response effectiveness
Basic Usage
Section titled “Basic Usage”from strands import Agentfrom strands_evals import Case, Experimentfrom strands_evals.evaluators import InstructionFollowingEvaluatorfrom strands_evals.mappers import StrandsInMemorySessionMapperfrom strands_evals.telemetry import StrandsEvalsTelemetry
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
def task_function(case: Case) -> dict: telemetry.in_memory_exporter.clear() agent = Agent( trace_attributes={"session.id": case.session_id}, callback_handler=None ) response = agent(case.input) spans = telemetry.in_memory_exporter.get_finished_spans() mapper = StrandsInMemorySessionMapper() session = mapper.map_to_session(spans, session_id=case.session_id) return {"output": str(response), "trajectory": session}
cases = [ Case( name="format-constraint", input="List the top 3 programming languages in bullet points." ), Case( name="length-constraint", input="Explain quantum computing in exactly one sentence." ),]
experiment = Experiment(cases=cases, evaluators=[InstructionFollowingEvaluator()])reports = experiment.run_evaluations(task_function)reports[0].run_display()Combining with Other Evaluators
Section titled “Combining with Other Evaluators”Pair with quality evaluators to assess both compliance and correctness:
evaluators = [ InstructionFollowingEvaluator(), # Did it follow the format/constraints? CorrectnessEvaluator(), # Is the content factually correct? ConcisenessEvaluator(), # Is it appropriately concise?]Related Evaluators
Section titled “Related Evaluators”- CorrectnessEvaluator: Evaluates factual accuracy (complementary to instruction following)
- OutputEvaluator: Flexible custom rubric evaluation
- RefusalEvaluator: Detect when agents refuse to address prompts
- CoherenceEvaluator: Assess logical consistency and reasoning