Multimodal Output Evaluator

Overview

The MultimodalOutputEvaluator assesses the quality of agent outputs for tasks that involve images or documents alongside text. It uses an MLLM as the judge and evaluates responses against a user-defined rubric, extending OutputEvaluator with multimodal input support. A complete example can be found here.

Key Features

MLLM-as-a-Judge: Uses a multimodal judge model to evaluate responses against the rubric and the media
Data-Driven Dispatch: Emits Strands SDK content blocks when the input carries media; falls back to a plain text prompt otherwise
Automatic Reference Comparison: Appends a reference suffix to the rubric when expected_output is provided on the case
Multiple Media Sources: Accepts file paths, base64 strings, data URLs, HTTP(S) URLs (auto-fetched via the stdlib), raw bytes, and PIL Images
Built-in Subclasses: MultimodalOverallQualityEvaluator, MultimodalCorrectnessEvaluator, MultimodalFaithfulnessEvaluator, and MultimodalInstructionFollowingEvaluator each default to a built-in rubric
Async Support: Supports both synchronous and asynchronous evaluation

When to Use

Use the MultimodalOutputEvaluator when you need to:

Evaluate image or document understanding (VQA, chart QA, document QA, image captioning, OCR-style tasks)
Assess whether a text response is grounded in the visual content it describes
Score multimodal responses against a custom rubric that cannot be expressed as exact-match assertions
Compare multimodal agent configurations or prompts
Benchmark MLLM judges against text-only LLM judges on the same tasks

Evaluation Level

This evaluator operates at the OUTPUT_LEVEL, scoring a single agent response per case.

Parameters

`rubric` (required)

Type: str
Description: The evaluation criteria that define what constitutes a good response. Should include scoring guidelines (e.g., “Score 1.0 if …, 0.0 if …”). You can author your own or reuse a built-in (see Built-in Rubrics).

`model` (optional)

Type: Model | str | None
Default: None (uses default Bedrock model)
Description: The multimodal judge model. Can be a model ID string or a Model instance. The default must support image input.

`system_prompt` (optional)

Type: str | None
Default: None (uses the built-in MLLM_JUDGE_SYSTEM_PROMPT)
Description: Custom system prompt to guide the judge model’s behavior.

`include_inputs` (optional)

Type: bool
Default: True
Description: Whether to include the user instruction in the evaluation prompt. Set to False to score the response in isolation.

`reference_suffix` (optional)

Type: str | None
Default: None (uses the built-in default suffix)
Description: Text appended to the rubric when expected_output is present on the case. Lets you customize how the judge should use the reference (e.g., strict vs. lenient grading).

`uses_environment_state` (optional)

Type: bool
Default: False
Description: Whether to include environment state in the evaluation prompt, enabling assessment of agent side effects alongside the output.

Data-Driven Dispatch

The evaluator decides whether to call the judge as a multimodal or text-only model based on the shape of case.input:

If case.input is a MultimodalInput carrying non-empty media, the composer returns a list of content blocks (media first, then the evaluation text), and the judge is invoked as an MLLM.
If case.input is anything else (e.g., a plain string, or a MultimodalInput whose media is empty), the composer returns a plain text prompt, and the judge behaves exactly like OutputEvaluator.

This means the same evaluator can grade a mixed experiment containing both multimodal and text-only cases without any flags to set.

`MultimodalInput`

MultimodalInput is the Pydantic model used for multimodal cases. It has three fields:

media (ImageData | list[ImageData] | str): one or more image sources. A plain string is accepted as shorthand for a single ImageData(source=...).
instruction (str): the user’s question or request about the media.
context (str | None): optional additional context surfaced to the judge.

MultimodalInput round-trips through Experiment.to_dict / from_dict: when a saved experiment is reloaded, raw dicts are coerced back to MultimodalInput at the prompt-composer boundary.

`ImageData`

ImageData normalizes image sources so the judge receives raw bytes in the correct format. Accepted forms for the source field:

Source	Example
Local file path	`"/path/to/chart.png"` or `pathlib.Path(...)`
Base64 string	`"iVBORw0KGgo..."`
Data URL	`"data:image/png;base64,iVBORw0KGgo..."`
HTTP(S) URL	`"https://example.com/image.jpg"`
Raw bytes	`b"\x89PNG..."`
PIL Image	`PIL.Image.Image` instance (requires `Pillow`)

HTTP(S) URLs are auto-fetched via urllib.request from the standard library, so no extra dependencies are required. The format field ("jpeg", "png", "gif", "webp") is auto-detected from the extension or data URL when omitted. S3 URIs and boto3 auto-fetch are not supported; pre-download S3 objects to bytes or a local path before constructing ImageData.

Built-in Rubrics

The following rubric constants are available from strands_evals.evaluators.prompt_templates.multimodal:

OVERALL_QUALITY_RUBRIC_V0: Likert-5 scale (0.0, 0.25, 0.5, 0.75, 1.0) across visual accuracy, instruction adherence, completeness, and coherence
CORRECTNESS_RUBRIC_V0: strict binary (1.0 / 0.0) fact-check against the image
FAITHFULNESS_RUBRIC_V0: strict binary, catches hallucinations not grounded in the image
INSTRUCTION_FOLLOWING_RUBRIC_V0: strict binary, checks count/format/scope/order/completeness/style constraints

Each built-in rubric is the default for the matching subclass (see Related Evaluators).

Scoring System

The score is a float between 0.0 and 1.0; the granularity is determined by the rubric:

Strict binary (CORRECTNESS_RUBRIC_V0, FAITHFULNESS_RUBRIC_V0, INSTRUCTION_FOLLOWING_RUBRIC_V0): 1.0 on success, 0.0 on any violation. A response passes only if the score is 1.0.
Likert-5 (OVERALL_QUALITY_RUBRIC_V0): 0.0, 0.25, 0.5, 0.75, or 1.0. A response typically passes if the score is >= 0.75.
Custom rubric: any granularity you define in the rubric text. Specify the pass threshold in the rubric itself (e.g., “Score 1.0 if …”).

Basic Usage

Reference-Free Evaluation

from strands_evals import Case, Experiment
from strands_evals.evaluators import MultimodalOutputEvaluator
from strands_evals.evaluators.prompt_templates.multimodal import OVERALL_QUALITY_RUBRIC_V0
from strands_evals.types import ImageData, MultimodalInput
from strands_evals.types.evaluation_report import EvaluationReport


def task_function(case: Case) -> str:
    # Replace with a call to your multimodal agent. For illustration we return
    # a fixed response so the evaluator has something to score.
    return "The chart is a bar chart showing quarterly revenue for three product lines in 2024."


cases = [
    Case(
        name="chart-overview",
        input=MultimodalInput(
            media="/path/to/your/chart.png",
            instruction="What kind of chart is shown and what does it represent?",
        ),
    ),
]

evaluator = MultimodalOutputEvaluator(rubric=OVERALL_QUALITY_RUBRIC_V0)
experiment = Experiment(cases=cases, evaluators=[evaluator])
reports = experiment.run_evaluations(task_function)
EvaluationReport.flatten(reports).run_display()

Reference-Based Evaluation

Setting expected_output on a case switches the evaluator into reference-based mode. The configured reference_suffix is appended to the rubric so the judge compares the response against the reference answer.

from strands_evals import Case, Experiment
from strands_evals.evaluators import MultimodalOutputEvaluator
from strands_evals.evaluators.prompt_templates.multimodal import CORRECTNESS_RUBRIC_V0
from strands_evals.types import ImageData, MultimodalInput

# Reuses the task_function defined in the reference-free example above.

cases = [
    Case(
        name="chart-value",
        input=MultimodalInput(
            media=ImageData(source="/path/to/revenue_chart.png"),
            instruction="What is the Q3 revenue for Product A?",
        ),
        expected_output="$4.2M",
    ),
]

evaluator = MultimodalOutputEvaluator(rubric=CORRECTNESS_RUBRIC_V0)
experiment = Experiment(cases=cases, evaluators=[evaluator])
reports = experiment.run_evaluations(task_function)

Multiple Images per Case

Pass a list of ImageData (or a list of source strings) to evaluate responses that reason over several images at once:

Case(
    name="before-after",
    input=MultimodalInput(
        media=[
            ImageData(source="/path/to/before.jpg"),
            ImageData(source="/path/to/after.jpg"),
        ],
        instruction="What changed between the two photos?",
        context="Both photos were taken at the same location one year apart.",
    ),
)

Evaluation Output

Like OutputEvaluator, MultimodalOutputEvaluator returns EvaluationOutput objects with:

score: Float between 0.0 and 1.0 (see Scoring System)
test_pass: Boolean indicating if the test passed
reason: String containing the judge’s reasoning for the score
label: Optional label categorizing the result

Best Practices

Use a Multimodal-Capable Judge: The default Bedrock model must support image input. If you override model, confirm the model ID accepts image content blocks.
Keep Images at Reasonable Resolution: Very large images increase latency and cost without improving judgment; resize to the minimum resolution needed for the task.
Write Rubrics That Reference the Image: Phrases like “based on what is visible in the image” anchor the judge to the media rather than to its prior knowledge.
Use Reference Mode for Ground-Truth Tasks: Chart QA and VQA benchmarks with known answers benefit from expected_output and the reference suffix.
Start with a Built-in Rubric: The four built-in rubrics are the defaults that come with Strands Evals; customize only once you see where they fall short on your data.

MultimodalOverallQualityEvaluator: Likert-5 quality scoring
MultimodalCorrectnessEvaluator: Strict binary factual correctness
MultimodalFaithfulnessEvaluator: Strict binary hallucination detection against the image
MultimodalInstructionFollowingEvaluator: Strict binary instruction compliance
OutputEvaluator: Text-only parent class with the same evaluate() contract