Multimodal Correctness Evaluator
Overview
Section titled “Overview”The MultimodalCorrectnessEvaluator assesses whether an agent response is factually correct given the image content. It catches errors in objects, counts, colors, positions, readable text, and described actions.
Key Features
Section titled “Key Features”- Output-Level Evaluation: Scores a single agent response per case
- Binary Scoring:
1.0for correct,0.0if any factual error is found - Automatic Reference Comparison: Appends a reference suffix to the rubric when
expected_outputis provided on the case - Fact-Checking Focus: Designed to catch factual errors in image descriptions and VQA answers
When to Use
Section titled “When to Use”Use the MultimodalCorrectnessEvaluator when you need to:
- Verify factual accuracy of image captions, VQA answers, or chart/document QA responses
- Measure exact correctness on benchmark tasks with a known answer
- Catch small but important errors (off-by-one counts, wrong colors, misread text)
Evaluation Level
Section titled “Evaluation Level”This evaluator operates at the OUTPUT_LEVEL, scoring a single agent response per case.
Parameters
Section titled “Parameters”rubric (optional)
Section titled “rubric (optional)”- Type:
str | None - Default:
CORRECTNESS_RUBRIC_V0 - Description: Custom rubric. Leave unset to use the default rubric.
model (optional)
Section titled “model (optional)”- Type:
Model | str | None - Default:
None(uses default Bedrock model) - Description: Multimodal judge model.
include_inputs (optional)
Section titled “include_inputs (optional)”- Type:
bool - Default:
True
system_prompt (optional)
Section titled “system_prompt (optional)”- Type:
str | None - Default:
None(uses the built-inMLLM_JUDGE_SYSTEM_PROMPT)
reference_suffix (optional)
Section titled “reference_suffix (optional)”- Type:
str | None - Default:
None(uses the built-in default suffix) - Description: Override to customize reference-based grading.
Scoring System
Section titled “Scoring System”| Score | Label | Meaning |
|---|---|---|
| 1.0 | Correct | No factual errors found |
| 0.0 | Incorrect | One or more factual errors found |
A response passes only if the score is 1.0.
Basic Usage
Section titled “Basic Usage”Reference-Free (fact-check against the image)
Section titled “Reference-Free (fact-check against the image)”from strands_evals import Case, Experimentfrom strands_evals.evaluators import MultimodalCorrectnessEvaluatorfrom strands_evals.types import MultimodalInputfrom strands_evals.types.evaluation_report import EvaluationReport
def task_function(case: Case) -> str: # Replace with your multimodal agent invocation. return "There are 3 people sitting on the red couch."
cases = [ Case( name="scene-count", input=MultimodalInput( media="/path/to/living_room.jpg", instruction="How many people are on the couch and what color is it?", ), ),]
experiment = Experiment(cases=cases, evaluators=[MultimodalCorrectnessEvaluator()])reports = experiment.run_evaluations(task_function)EvaluationReport.flatten(reports).run_display()Reference-Based (compare against a known answer)
Section titled “Reference-Based (compare against a known answer)”cases = [ Case( name="chart-value", input=MultimodalInput( media="/path/to/revenue_chart.png", instruction="What is the Q3 revenue for Product A?", ), expected_output="$4.2M", ),]
experiment = Experiment(cases=cases, evaluators=[MultimodalCorrectnessEvaluator()])reports = experiment.run_evaluations(task_function)When expected_output is set, the evaluator automatically appends the reference suffix so the judge compares the response to the reference answer.
Related Evaluators
Section titled “Related Evaluators”- MultimodalOutputEvaluator: Parent class with full parameter reference
- MultimodalFaithfulnessEvaluator: Catches hallucinations (claims not verifiable from the image)
- CorrectnessEvaluator: Text-only counterpart