Skip to content

Multimodal Output Evaluator

The MultimodalOutputEvaluator assesses the quality of agent outputs for tasks that involve images or documents alongside text. It uses an MLLM as the judge and evaluates responses against a user-defined rubric, extending OutputEvaluator with multimodal input support. A complete example can be found here.

  • MLLM-as-a-Judge: Uses a multimodal judge model to evaluate responses against the rubric and the media
  • Data-Driven Dispatch: Emits Strands SDK content blocks when the input carries media; falls back to a plain text prompt otherwise
  • Automatic Reference Comparison: Appends a reference suffix to the rubric when expected_output is provided on the case
  • Multiple Media Sources: Accepts file paths, base64 strings, data URLs, HTTP(S) URLs (auto-fetched via the stdlib), raw bytes, and PIL Images
  • Built-in Subclasses: MultimodalOverallQualityEvaluator, MultimodalCorrectnessEvaluator, MultimodalFaithfulnessEvaluator, and MultimodalInstructionFollowingEvaluator each default to a built-in rubric
  • Async Support: Supports both synchronous and asynchronous evaluation

Use the MultimodalOutputEvaluator when you need to:

  • Evaluate image or document understanding (VQA, chart QA, document QA, image captioning, OCR-style tasks)
  • Assess whether a text response is grounded in the visual content it describes
  • Score multimodal responses against a custom rubric that cannot be expressed as exact-match assertions
  • Compare multimodal agent configurations or prompts
  • Benchmark MLLM judges against text-only LLM judges on the same tasks

This evaluator operates at the OUTPUT_LEVEL, scoring a single agent response per case.

  • Type: str
  • Description: The evaluation criteria that define what constitutes a good response. Should include scoring guidelines (e.g., “Score 1.0 if …, 0.0 if …”). You can author your own or reuse a built-in (see Built-in Rubrics).
  • Type: Model | str | None
  • Default: None (uses default Bedrock model)
  • Description: The multimodal judge model. Can be a model ID string or a Model instance. The default must support image input.
  • Type: str | None
  • Default: None (uses the built-in MLLM_JUDGE_SYSTEM_PROMPT)
  • Description: Custom system prompt to guide the judge model’s behavior.
  • Type: bool
  • Default: True
  • Description: Whether to include the user instruction in the evaluation prompt. Set to False to score the response in isolation.
  • Type: str | None
  • Default: None (uses the built-in default suffix)
  • Description: Text appended to the rubric when expected_output is present on the case. Lets you customize how the judge should use the reference (e.g., strict vs. lenient grading).
  • Type: bool
  • Default: False
  • Description: Whether to include environment state in the evaluation prompt, enabling assessment of agent side effects alongside the output.

The evaluator decides whether to call the judge as a multimodal or text-only model based on the shape of case.input:

  • If case.input is a MultimodalInput carrying non-empty media, the composer returns a list of content blocks (media first, then the evaluation text), and the judge is invoked as an MLLM.
  • If case.input is anything else (e.g., a plain string, or a MultimodalInput whose media is empty), the composer returns a plain text prompt, and the judge behaves exactly like OutputEvaluator.

This means the same evaluator can grade a mixed experiment containing both multimodal and text-only cases without any flags to set.

MultimodalInput is the Pydantic model used for multimodal cases. It has three fields:

  • media (ImageData | list[ImageData] | str): one or more image sources. A plain string is accepted as shorthand for a single ImageData(source=...).
  • instruction (str): the user’s question or request about the media.
  • context (str | None): optional additional context surfaced to the judge.

MultimodalInput round-trips through Experiment.to_dict / from_dict: when a saved experiment is reloaded, raw dicts are coerced back to MultimodalInput at the prompt-composer boundary.

ImageData normalizes image sources so the judge receives raw bytes in the correct format. Accepted forms for the source field:

SourceExample
Local file path"/path/to/chart.png" or pathlib.Path(...)
Base64 string"iVBORw0KGgo..."
Data URL"data:image/png;base64,iVBORw0KGgo..."
HTTP(S) URL"https://example.com/image.jpg"
Raw bytesb"\x89PNG..."
PIL ImagePIL.Image.Image instance (requires Pillow)

HTTP(S) URLs are auto-fetched via urllib.request from the standard library, so no extra dependencies are required. The format field ("jpeg", "png", "gif", "webp") is auto-detected from the extension or data URL when omitted. S3 URIs and boto3 auto-fetch are not supported; pre-download S3 objects to bytes or a local path before constructing ImageData.

The following rubric constants are available from strands_evals.evaluators.prompt_templates.multimodal:

  • OVERALL_QUALITY_RUBRIC_V0: Likert-5 scale (0.0, 0.25, 0.5, 0.75, 1.0) across visual accuracy, instruction adherence, completeness, and coherence
  • CORRECTNESS_RUBRIC_V0: strict binary (1.0 / 0.0) fact-check against the image
  • FAITHFULNESS_RUBRIC_V0: strict binary, catches hallucinations not grounded in the image
  • INSTRUCTION_FOLLOWING_RUBRIC_V0: strict binary, checks count/format/scope/order/completeness/style constraints

Each built-in rubric is the default for the matching subclass (see Related Evaluators).

The score is a float between 0.0 and 1.0; the granularity is determined by the rubric:

  • Strict binary (CORRECTNESS_RUBRIC_V0, FAITHFULNESS_RUBRIC_V0, INSTRUCTION_FOLLOWING_RUBRIC_V0): 1.0 on success, 0.0 on any violation. A response passes only if the score is 1.0.
  • Likert-5 (OVERALL_QUALITY_RUBRIC_V0): 0.0, 0.25, 0.5, 0.75, or 1.0. A response typically passes if the score is >= 0.75.
  • Custom rubric: any granularity you define in the rubric text. Specify the pass threshold in the rubric itself (e.g., “Score 1.0 if …”).
from strands_evals import Case, Experiment
from strands_evals.evaluators import MultimodalOutputEvaluator
from strands_evals.evaluators.prompt_templates.multimodal import OVERALL_QUALITY_RUBRIC_V0
from strands_evals.types import ImageData, MultimodalInput
from strands_evals.types.evaluation_report import EvaluationReport
def task_function(case: Case) -> str:
# Replace with a call to your multimodal agent. For illustration we return
# a fixed response so the evaluator has something to score.
return "The chart is a bar chart showing quarterly revenue for three product lines in 2024."
cases = [
Case(
name="chart-overview",
input=MultimodalInput(
media="/path/to/your/chart.png",
instruction="What kind of chart is shown and what does it represent?",
),
),
]
evaluator = MultimodalOutputEvaluator(rubric=OVERALL_QUALITY_RUBRIC_V0)
experiment = Experiment(cases=cases, evaluators=[evaluator])
reports = experiment.run_evaluations(task_function)
EvaluationReport.flatten(reports).run_display()

Setting expected_output on a case switches the evaluator into reference-based mode. The configured reference_suffix is appended to the rubric so the judge compares the response against the reference answer.

from strands_evals import Case, Experiment
from strands_evals.evaluators import MultimodalOutputEvaluator
from strands_evals.evaluators.prompt_templates.multimodal import CORRECTNESS_RUBRIC_V0
from strands_evals.types import ImageData, MultimodalInput
# Reuses the task_function defined in the reference-free example above.
cases = [
Case(
name="chart-value",
input=MultimodalInput(
media=ImageData(source="/path/to/revenue_chart.png"),
instruction="What is the Q3 revenue for Product A?",
),
expected_output="$4.2M",
),
]
evaluator = MultimodalOutputEvaluator(rubric=CORRECTNESS_RUBRIC_V0)
experiment = Experiment(cases=cases, evaluators=[evaluator])
reports = experiment.run_evaluations(task_function)

Pass a list of ImageData (or a list of source strings) to evaluate responses that reason over several images at once:

Case(
name="before-after",
input=MultimodalInput(
media=[
ImageData(source="/path/to/before.jpg"),
ImageData(source="/path/to/after.jpg"),
],
instruction="What changed between the two photos?",
context="Both photos were taken at the same location one year apart.",
),
)

Like OutputEvaluator, MultimodalOutputEvaluator returns EvaluationOutput objects with:

  • score: Float between 0.0 and 1.0 (see Scoring System)
  • test_pass: Boolean indicating if the test passed
  • reason: String containing the judge’s reasoning for the score
  • label: Optional label categorizing the result
  1. Use a Multimodal-Capable Judge: The default Bedrock model must support image input. If you override model, confirm the model ID accepts image content blocks.
  2. Keep Images at Reasonable Resolution: Very large images increase latency and cost without improving judgment; resize to the minimum resolution needed for the task.
  3. Write Rubrics That Reference the Image: Phrases like “based on what is visible in the image” anchor the judge to the media rather than to its prior knowledge.
  4. Use Reference Mode for Ground-Truth Tasks: Chart QA and VQA benchmarks with known answers benefit from expected_output and the reference suffix.
  5. Start with a Built-in Rubric: The four built-in rubrics are the defaults that come with Strands Evals; customize only once you see where they fall short on your data.