Instruction Following Evaluator

Overview

The InstructionFollowingEvaluator assesses whether an agent’s response follows all explicit instructions provided in the user’s prompt. It focuses strictly on instruction compliance — whether specific constraints, requirements, and directives were satisfied — regardless of response quality or factual accuracy.

Key Features

Trace-Level Evaluation: Evaluates the most recent turn in the conversation
Binary Scoring: Clear Yes (instructions followed) / No (instructions not followed) classification
Async Support: Supports both synchronous and asynchronous evaluation
Constraint-Focused: Evaluates compliance with explicit directives, not overall quality

When to Use

Use the InstructionFollowingEvaluator when you need to:

Verify that agents respect format, length, or style constraints
Check compliance with specific answer options or output types
Assess whether agents follow multi-part instructions completely
Evaluate instruction adherence independently from correctness

Evaluation Level

This evaluator operates at the TRACE_LEVEL, evaluating the most recent turn in the conversation.

Parameters

`model` (optional)

Type: Model | str | None
Default: None (uses default Bedrock model)
Description: The model to use as the judge.

`system_prompt` (optional)

Type: str | None
Default: None (uses built-in template)
Description: Custom system prompt for the judge model.

`version` (optional)

Type: str
Default: "v0"
Description: Prompt template version.

Scoring System

Rating	Score	Description
Yes	1.0	All explicit instructions in the prompt are satisfied
No	0.0	One or more explicit instructions are not satisfied

A response passes the evaluation if all explicit instructions are followed (score = 1.0).

What Counts as Explicit Instructions

The evaluator checks for compliance with specific directives such as:

Information constraints: “Based on this text passage, give an overview about […]”
Length requirements: “Summarize this text in one sentence”
Answer options: “Which of the following is the tallest mountain…”
Target audience: “Write an explanation for middle schoolers”
Genre: “Write an ad for a laundry service”
Style: “Write an ad for a sports car like it’s an obituary”
Content type: “Write a body for this email” vs “Write a subject line”

Evaluation Rules

If a response includes more information than requested, it still passes as long as all requested elements are present
If a response is purely evasive without any partial or related answer, it defaults to Yes
If a response is partially evasive but provides a partial answer, the partial answer is judged
If there are no explicit instructions in the input (casual or open-ended requests), defaults to Yes
The evaluator does not assess factual accuracy, writing quality, or response effectiveness

Basic Usage

from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import InstructionFollowingEvaluator
from strands_evals.mappers import StrandsInMemorySessionMapper
from strands_evals.telemetry import StrandsEvalsTelemetry

telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()

def task_function(case: Case) -> dict:
    telemetry.in_memory_exporter.clear()
    agent = Agent(
        trace_attributes={"session.id": case.session_id},
        callback_handler=None
    )
    response = agent(case.input)
    spans = telemetry.in_memory_exporter.get_finished_spans()
    mapper = StrandsInMemorySessionMapper()
    session = mapper.map_to_session(spans, session_id=case.session_id)
    return {"output": str(response), "trajectory": session}

cases = [
    Case(
        name="format-constraint",
        input="List the top 3 programming languages in bullet points."
    ),
    Case(
        name="length-constraint",
        input="Explain quantum computing in exactly one sentence."
    ),
]

experiment = Experiment(cases=cases, evaluators=[InstructionFollowingEvaluator()])
reports = experiment.run_evaluations(task_function)
reports[0].run_display()

Combining with Other Evaluators

Pair with quality evaluators to assess both compliance and correctness:

evaluators = [
    InstructionFollowingEvaluator(),  # Did it follow the format/constraints?
    CorrectnessEvaluator(),           # Is the content factually correct?
    ConcisenessEvaluator(),           # Is it appropriately concise?
]

CorrectnessEvaluator: Evaluates factual accuracy (complementary to instruction following)
OutputEvaluator: Flexible custom rubric evaluation
RefusalEvaluator: Detect when agents refuse to address prompts
CoherenceEvaluator: Assess logical consistency and reasoning