Multimodal Correctness Evaluator

Overview

The MultimodalCorrectnessEvaluator assesses whether an agent response is factually correct given the image content. It catches errors in objects, counts, colors, positions, readable text, and described actions.

Key Features

Output-Level Evaluation: Scores a single agent response per case
Binary Scoring: 1.0 for correct, 0.0 if any factual error is found
Automatic Reference Comparison: Appends a reference suffix to the rubric when expected_output is provided on the case
Fact-Checking Focus: Designed to catch factual errors in image descriptions and VQA answers

When to Use

Use the MultimodalCorrectnessEvaluator when you need to:

Verify factual accuracy of image captions, VQA answers, or chart/document QA responses
Measure exact correctness on benchmark tasks with a known answer
Catch small but important errors (off-by-one counts, wrong colors, misread text)

Evaluation Level

This evaluator operates at the OUTPUT_LEVEL, scoring a single agent response per case.

Parameters

`rubric` (optional)

Type: str | None
Default: CORRECTNESS_RUBRIC_V0
Description: Custom rubric. Leave unset to use the default rubric.

`model` (optional)

Type: Model | str | None
Default: None (uses default Bedrock model)
Description: Multimodal judge model.

`include_inputs` (optional)

Type: bool
Default: True

`system_prompt` (optional)

Type: str | None
Default: None (uses the built-in MLLM_JUDGE_SYSTEM_PROMPT)

`reference_suffix` (optional)

Type: str | None
Default: None (uses the built-in default suffix)
Description: Override to customize reference-based grading.

Scoring System

Score	Label	Meaning
1.0	Correct	No factual errors found
0.0	Incorrect	One or more factual errors found

A response passes only if the score is 1.0.

Basic Usage

Reference-Free (fact-check against the image)

from strands_evals import Case, Experiment
from strands_evals.evaluators import MultimodalCorrectnessEvaluator
from strands_evals.types import MultimodalInput
from strands_evals.types.evaluation_report import EvaluationReport


def task_function(case: Case) -> str:
    # Replace with your multimodal agent invocation.
    return "There are 3 people sitting on the red couch."


cases = [
    Case(
        name="scene-count",
        input=MultimodalInput(
            media="/path/to/living_room.jpg",
            instruction="How many people are on the couch and what color is it?",
        ),
    ),
]

experiment = Experiment(cases=cases, evaluators=[MultimodalCorrectnessEvaluator()])
reports = experiment.run_evaluations(task_function)
EvaluationReport.flatten(reports).run_display()

Reference-Based (compare against a known answer)

cases = [
    Case(
        name="chart-value",
        input=MultimodalInput(
            media="/path/to/revenue_chart.png",
            instruction="What is the Q3 revenue for Product A?",
        ),
        expected_output="$4.2M",
    ),
]

experiment = Experiment(cases=cases, evaluators=[MultimodalCorrectnessEvaluator()])
reports = experiment.run_evaluations(task_function)

When expected_output is set, the evaluator automatically appends the reference suffix so the judge compares the response to the reference answer.

MultimodalOutputEvaluator: Parent class with full parameter reference
MultimodalFaithfulnessEvaluator: Catches hallucinations (claims not verifiable from the image)
CorrectnessEvaluator: Text-only counterpart