Skip to content

Multimodal Correctness Evaluator

The MultimodalCorrectnessEvaluator assesses whether an agent response is factually correct given the image content. It catches errors in objects, counts, colors, positions, readable text, and described actions.

  • Output-Level Evaluation: Scores a single agent response per case
  • Binary Scoring: 1.0 for correct, 0.0 if any factual error is found
  • Automatic Reference Comparison: Appends a reference suffix to the rubric when expected_output is provided on the case
  • Fact-Checking Focus: Designed to catch factual errors in image descriptions and VQA answers

Use the MultimodalCorrectnessEvaluator when you need to:

  • Verify factual accuracy of image captions, VQA answers, or chart/document QA responses
  • Measure exact correctness on benchmark tasks with a known answer
  • Catch small but important errors (off-by-one counts, wrong colors, misread text)

This evaluator operates at the OUTPUT_LEVEL, scoring a single agent response per case.

  • Type: str | None
  • Default: CORRECTNESS_RUBRIC_V0
  • Description: Custom rubric. Leave unset to use the default rubric.
  • Type: Model | str | None
  • Default: None (uses default Bedrock model)
  • Description: Multimodal judge model.
  • Type: bool
  • Default: True
  • Type: str | None
  • Default: None (uses the built-in MLLM_JUDGE_SYSTEM_PROMPT)
  • Type: str | None
  • Default: None (uses the built-in default suffix)
  • Description: Override to customize reference-based grading.
ScoreLabelMeaning
1.0CorrectNo factual errors found
0.0IncorrectOne or more factual errors found

A response passes only if the score is 1.0.

Reference-Free (fact-check against the image)

Section titled “Reference-Free (fact-check against the image)”
from strands_evals import Case, Experiment
from strands_evals.evaluators import MultimodalCorrectnessEvaluator
from strands_evals.types import MultimodalInput
from strands_evals.types.evaluation_report import EvaluationReport
def task_function(case: Case) -> str:
# Replace with your multimodal agent invocation.
return "There are 3 people sitting on the red couch."
cases = [
Case(
name="scene-count",
input=MultimodalInput(
media="/path/to/living_room.jpg",
instruction="How many people are on the couch and what color is it?",
),
),
]
experiment = Experiment(cases=cases, evaluators=[MultimodalCorrectnessEvaluator()])
reports = experiment.run_evaluations(task_function)
EvaluationReport.flatten(reports).run_display()

Reference-Based (compare against a known answer)

Section titled “Reference-Based (compare against a known answer)”
cases = [
Case(
name="chart-value",
input=MultimodalInput(
media="/path/to/revenue_chart.png",
instruction="What is the Q3 revenue for Product A?",
),
expected_output="$4.2M",
),
]
experiment = Experiment(cases=cases, evaluators=[MultimodalCorrectnessEvaluator()])
reports = experiment.run_evaluations(task_function)

When expected_output is set, the evaluator automatically appends the reference suffix so the judge compares the response to the reference answer.