Correctness Evaluator

Overview

The CorrectnessEvaluator assesses whether an agent’s response is factually correct. It supports two modes: a basic mode that evaluates correctness from conversation context alone, and a reference mode that compares the response against an expected answer.

Key Features

Dual Mode: Basic (3-level) and reference-based (binary) evaluation
Trace-Level Evaluation: Evaluates the most recent turn in the conversation
Automatic Mode Selection: Switches to reference mode when expected_assertion is provided on the case
Structured Reasoning: Provides step-by-step reasoning for each evaluation

When to Use

Use the CorrectnessEvaluator when you need to:

Verify factual accuracy of agent responses
Compare agent output against known correct answers
Assess correctness in knowledge-based Q&A scenarios
Validate that agents provide accurate information

Evaluation Level

This evaluator operates at the TRACE_LEVEL, evaluating the most recent turn in the conversation.

Parameters

`model` (optional)

Type: Union[Model, str, None]
Default: None (uses default Bedrock model)
Description: The model to use as the judge.

`system_prompt` (optional)

Type: str | None
Default: None (uses built-in template for basic mode)
Description: Custom system prompt for basic mode evaluation.

`reference_system_prompt` (optional)

Type: str | None
Default: None (uses built-in template for reference mode)
Description: Custom system prompt for reference-based evaluation.

`version` (optional)

Type: str
Default: "v0"
Description: Prompt template version.

Scoring System

Basic Mode (no reference)

Perfectly Correct (1.0): Response is fully accurate
Partially Correct (0.5): Response contains some correct and some incorrect information
Incorrect (0.0): Response is factually wrong

A response passes if the score is 1.0.

Reference Mode (with `expected_assertion`)

CORRECT (1.0): Response matches the expected answer
INCORRECT (0.0): Response does not match the expected answer

Basic Usage

Without Reference (3-level scoring)

from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import CorrectnessEvaluator
from strands_evals.mappers import StrandsInMemorySessionMapper
from strands_evals.telemetry import StrandsEvalsTelemetry

telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()

def task_function(case: Case) -> dict:
    telemetry.in_memory_exporter.clear()
    agent = Agent(
        trace_attributes={"session.id": case.session_id},
        callback_handler=None
    )
    response = agent(case.input)
    spans = telemetry.in_memory_exporter.get_finished_spans()
    mapper = StrandsInMemorySessionMapper()
    session = mapper.map_to_session(spans, session_id=case.session_id)
    return {"output": str(response), "trajectory": session}

cases = [
    Case(name="capital", input="What is the capital of France?")
]

experiment = Experiment(cases=cases, evaluators=[CorrectnessEvaluator()])
reports = experiment.run_evaluations(task_function)
reports[0].run_display()

With Reference (binary scoring)

from strands_evals import Case, Experiment
from strands_evals.evaluators import CorrectnessEvaluator

# Reuses the task_function defined in the basic mode example above.

cases = [
    Case(
        name="capital",
        input="What is the capital of France?",
        expected_assertion="The capital of France is Paris."
    )
]

experiment = Experiment(cases=cases, evaluators=[CorrectnessEvaluator()])
reports = experiment.run_evaluations(task_function)

When expected_assertion is set on the case, the evaluator automatically switches to reference mode and uses binary CORRECT/INCORRECT scoring.

OutputEvaluator: Flexible custom rubric evaluation
FaithfulnessEvaluator: Checks if responses are grounded in conversation history
ResponseRelevanceEvaluator: Evaluates relevance to user questions