Correctness Evaluator
Overview
Section titled “Overview”The CorrectnessEvaluator assesses whether an agent’s response is factually correct. It supports two modes: a basic mode that evaluates correctness from conversation context alone, and a reference mode that compares the response against an expected answer.
Key Features
Section titled “Key Features”- Dual Mode: Basic (3-level) and reference-based (binary) evaluation
- Trace-Level Evaluation: Evaluates the most recent turn in the conversation
- Automatic Mode Selection: Switches to reference mode when
expected_assertionis provided on the case - Structured Reasoning: Provides step-by-step reasoning for each evaluation
When to Use
Section titled “When to Use”Use the CorrectnessEvaluator when you need to:
- Verify factual accuracy of agent responses
- Compare agent output against known correct answers
- Assess correctness in knowledge-based Q&A scenarios
- Validate that agents provide accurate information
Evaluation Level
Section titled “Evaluation Level”This evaluator operates at the TRACE_LEVEL, evaluating the most recent turn in the conversation.
Parameters
Section titled “Parameters”model (optional)
Section titled “model (optional)”- Type:
Union[Model, str, None] - Default:
None(uses default Bedrock model) - Description: The model to use as the judge.
system_prompt (optional)
Section titled “system_prompt (optional)”- Type:
str | None - Default:
None(uses built-in template for basic mode) - Description: Custom system prompt for basic mode evaluation.
reference_system_prompt (optional)
Section titled “reference_system_prompt (optional)”- Type:
str | None - Default:
None(uses built-in template for reference mode) - Description: Custom system prompt for reference-based evaluation.
version (optional)
Section titled “version (optional)”- Type:
str - Default:
"v0" - Description: Prompt template version.
Scoring System
Section titled “Scoring System”Basic Mode (no reference)
Section titled “Basic Mode (no reference)”- Perfectly Correct (1.0): Response is fully accurate
- Partially Correct (0.5): Response contains some correct and some incorrect information
- Incorrect (0.0): Response is factually wrong
A response passes if the score is 1.0.
Reference Mode (with expected_assertion)
Section titled “Reference Mode (with expected_assertion)”- CORRECT (1.0): Response matches the expected answer
- INCORRECT (0.0): Response does not match the expected answer
Basic Usage
Section titled “Basic Usage”Without Reference (3-level scoring)
Section titled “Without Reference (3-level scoring)”from strands import Agentfrom strands_evals import Case, Experimentfrom strands_evals.evaluators import CorrectnessEvaluatorfrom strands_evals.mappers import StrandsInMemorySessionMapperfrom strands_evals.telemetry import StrandsEvalsTelemetry
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
def task_function(case: Case) -> dict: telemetry.in_memory_exporter.clear() agent = Agent( trace_attributes={"session.id": case.session_id}, callback_handler=None ) response = agent(case.input) spans = telemetry.in_memory_exporter.get_finished_spans() mapper = StrandsInMemorySessionMapper() session = mapper.map_to_session(spans, session_id=case.session_id) return {"output": str(response), "trajectory": session}
cases = [ Case(name="capital", input="What is the capital of France?")]
experiment = Experiment(cases=cases, evaluators=[CorrectnessEvaluator()])reports = experiment.run_evaluations(task_function)reports[0].run_display()With Reference (binary scoring)
Section titled “With Reference (binary scoring)”from strands_evals import Case, Experimentfrom strands_evals.evaluators import CorrectnessEvaluator
# Reuses the task_function defined in the basic mode example above.
cases = [ Case( name="capital", input="What is the capital of France?", expected_assertion="The capital of France is Paris." )]
experiment = Experiment(cases=cases, evaluators=[CorrectnessEvaluator()])reports = experiment.run_evaluations(task_function)When expected_assertion is set on the case, the evaluator automatically switches to reference mode and uses binary CORRECT/INCORRECT scoring.
Related Evaluators
Section titled “Related Evaluators”- OutputEvaluator: Flexible custom rubric evaluation
- FaithfulnessEvaluator: Checks if responses are grounded in conversation history
- ResponseRelevanceEvaluator: Evaluates relevance to user questions