Multimodal Output Evaluator
Overview
Section titled “Overview”The MultimodalOutputEvaluator assesses the quality of agent outputs for tasks that involve images or documents alongside text. It uses an MLLM as the judge and evaluates responses against a user-defined rubric, extending OutputEvaluator with multimodal input support. A complete example can be found here.
Key Features
Section titled “Key Features”- MLLM-as-a-Judge: Uses a multimodal judge model to evaluate responses against the rubric and the media
- Data-Driven Dispatch: Emits Strands SDK content blocks when the input carries media; falls back to a plain text prompt otherwise
- Automatic Reference Comparison: Appends a reference suffix to the rubric when
expected_outputis provided on the case - Multiple Media Sources: Accepts file paths, base64 strings, data URLs, HTTP(S) URLs (auto-fetched via the stdlib), raw bytes, and PIL Images
- Built-in Subclasses:
MultimodalOverallQualityEvaluator,MultimodalCorrectnessEvaluator,MultimodalFaithfulnessEvaluator, andMultimodalInstructionFollowingEvaluatoreach default to a built-in rubric - Async Support: Supports both synchronous and asynchronous evaluation
When to Use
Section titled “When to Use”Use the MultimodalOutputEvaluator when you need to:
- Evaluate image or document understanding (VQA, chart QA, document QA, image captioning, OCR-style tasks)
- Assess whether a text response is grounded in the visual content it describes
- Score multimodal responses against a custom rubric that cannot be expressed as exact-match assertions
- Compare multimodal agent configurations or prompts
- Benchmark MLLM judges against text-only LLM judges on the same tasks
Evaluation Level
Section titled “Evaluation Level”This evaluator operates at the OUTPUT_LEVEL, scoring a single agent response per case.
Parameters
Section titled “Parameters”rubric (required)
Section titled “rubric (required)”- Type:
str - Description: The evaluation criteria that define what constitutes a good response. Should include scoring guidelines (e.g., “Score 1.0 if …, 0.0 if …”). You can author your own or reuse a built-in (see Built-in Rubrics).
model (optional)
Section titled “model (optional)”- Type:
Model | str | None - Default:
None(uses default Bedrock model) - Description: The multimodal judge model. Can be a model ID string or a
Modelinstance. The default must support image input.
system_prompt (optional)
Section titled “system_prompt (optional)”- Type:
str | None - Default:
None(uses the built-inMLLM_JUDGE_SYSTEM_PROMPT) - Description: Custom system prompt to guide the judge model’s behavior.
include_inputs (optional)
Section titled “include_inputs (optional)”- Type:
bool - Default:
True - Description: Whether to include the user instruction in the evaluation prompt. Set to
Falseto score the response in isolation.
reference_suffix (optional)
Section titled “reference_suffix (optional)”- Type:
str | None - Default:
None(uses the built-in default suffix) - Description: Text appended to the rubric when
expected_outputis present on the case. Lets you customize how the judge should use the reference (e.g., strict vs. lenient grading).
uses_environment_state (optional)
Section titled “uses_environment_state (optional)”- Type:
bool - Default:
False - Description: Whether to include environment state in the evaluation prompt, enabling assessment of agent side effects alongside the output.
Data-Driven Dispatch
Section titled “Data-Driven Dispatch”The evaluator decides whether to call the judge as a multimodal or text-only model based on the shape of case.input:
- If
case.inputis aMultimodalInputcarrying non-empty media, the composer returns a list of content blocks (media first, then the evaluation text), and the judge is invoked as an MLLM. - If
case.inputis anything else (e.g., a plain string, or aMultimodalInputwhosemediais empty), the composer returns a plain text prompt, and the judge behaves exactly likeOutputEvaluator.
This means the same evaluator can grade a mixed experiment containing both multimodal and text-only cases without any flags to set.
MultimodalInput
Section titled “MultimodalInput”MultimodalInput is the Pydantic model used for multimodal cases. It has three fields:
media(ImageData | list[ImageData] | str): one or more image sources. A plain string is accepted as shorthand for a singleImageData(source=...).instruction(str): the user’s question or request about the media.context(str | None): optional additional context surfaced to the judge.
MultimodalInput round-trips through Experiment.to_dict / from_dict: when a saved experiment is reloaded, raw dicts are coerced back to MultimodalInput at the prompt-composer boundary.
ImageData
Section titled “ImageData”ImageData normalizes image sources so the judge receives raw bytes in the correct format. Accepted forms for the source field:
| Source | Example |
|---|---|
| Local file path | "/path/to/chart.png" or pathlib.Path(...) |
| Base64 string | "iVBORw0KGgo..." |
| Data URL | "data:image/png;base64,iVBORw0KGgo..." |
| HTTP(S) URL | "https://example.com/image.jpg" |
| Raw bytes | b"\x89PNG..." |
| PIL Image | PIL.Image.Image instance (requires Pillow) |
HTTP(S) URLs are auto-fetched via urllib.request from the standard library, so no extra dependencies are required. The format field ("jpeg", "png", "gif", "webp") is auto-detected from the extension or data URL when omitted. S3 URIs and boto3 auto-fetch are not supported; pre-download S3 objects to bytes or a local path before constructing ImageData.
Built-in Rubrics
Section titled “Built-in Rubrics”The following rubric constants are available from strands_evals.evaluators.prompt_templates.multimodal:
OVERALL_QUALITY_RUBRIC_V0: Likert-5 scale (0.0,0.25,0.5,0.75,1.0) across visual accuracy, instruction adherence, completeness, and coherenceCORRECTNESS_RUBRIC_V0: strict binary (1.0/0.0) fact-check against the imageFAITHFULNESS_RUBRIC_V0: strict binary, catches hallucinations not grounded in the imageINSTRUCTION_FOLLOWING_RUBRIC_V0: strict binary, checks count/format/scope/order/completeness/style constraints
Each built-in rubric is the default for the matching subclass (see Related Evaluators).
Scoring System
Section titled “Scoring System”The score is a float between 0.0 and 1.0; the granularity is determined by the rubric:
- Strict binary (
CORRECTNESS_RUBRIC_V0,FAITHFULNESS_RUBRIC_V0,INSTRUCTION_FOLLOWING_RUBRIC_V0):1.0on success,0.0on any violation. A response passes only if the score is1.0. - Likert-5 (
OVERALL_QUALITY_RUBRIC_V0):0.0,0.25,0.5,0.75, or1.0. A response typically passes if the score is>= 0.75. - Custom rubric: any granularity you define in the rubric text. Specify the pass threshold in the rubric itself (e.g., “Score 1.0 if …”).
Basic Usage
Section titled “Basic Usage”Reference-Free Evaluation
Section titled “Reference-Free Evaluation”from strands_evals import Case, Experimentfrom strands_evals.evaluators import MultimodalOutputEvaluatorfrom strands_evals.evaluators.prompt_templates.multimodal import OVERALL_QUALITY_RUBRIC_V0from strands_evals.types import ImageData, MultimodalInputfrom strands_evals.types.evaluation_report import EvaluationReport
def task_function(case: Case) -> str: # Replace with a call to your multimodal agent. For illustration we return # a fixed response so the evaluator has something to score. return "The chart is a bar chart showing quarterly revenue for three product lines in 2024."
cases = [ Case( name="chart-overview", input=MultimodalInput( media="/path/to/your/chart.png", instruction="What kind of chart is shown and what does it represent?", ), ),]
evaluator = MultimodalOutputEvaluator(rubric=OVERALL_QUALITY_RUBRIC_V0)experiment = Experiment(cases=cases, evaluators=[evaluator])reports = experiment.run_evaluations(task_function)EvaluationReport.flatten(reports).run_display()Reference-Based Evaluation
Section titled “Reference-Based Evaluation”Setting expected_output on a case switches the evaluator into reference-based mode. The configured reference_suffix is appended to the rubric so the judge compares the response against the reference answer.
from strands_evals import Case, Experimentfrom strands_evals.evaluators import MultimodalOutputEvaluatorfrom strands_evals.evaluators.prompt_templates.multimodal import CORRECTNESS_RUBRIC_V0from strands_evals.types import ImageData, MultimodalInput
# Reuses the task_function defined in the reference-free example above.
cases = [ Case( name="chart-value", input=MultimodalInput( media=ImageData(source="/path/to/revenue_chart.png"), instruction="What is the Q3 revenue for Product A?", ), expected_output="$4.2M", ),]
evaluator = MultimodalOutputEvaluator(rubric=CORRECTNESS_RUBRIC_V0)experiment = Experiment(cases=cases, evaluators=[evaluator])reports = experiment.run_evaluations(task_function)Multiple Images per Case
Section titled “Multiple Images per Case”Pass a list of ImageData (or a list of source strings) to evaluate responses that reason over several images at once:
Case( name="before-after", input=MultimodalInput( media=[ ImageData(source="/path/to/before.jpg"), ImageData(source="/path/to/after.jpg"), ], instruction="What changed between the two photos?", context="Both photos were taken at the same location one year apart.", ),)Evaluation Output
Section titled “Evaluation Output”Like OutputEvaluator, MultimodalOutputEvaluator returns EvaluationOutput objects with:
- score: Float between
0.0and1.0(see Scoring System) - test_pass: Boolean indicating if the test passed
- reason: String containing the judge’s reasoning for the score
- label: Optional label categorizing the result
Best Practices
Section titled “Best Practices”- Use a Multimodal-Capable Judge: The default Bedrock model must support image input. If you override
model, confirm the model ID accepts image content blocks. - Keep Images at Reasonable Resolution: Very large images increase latency and cost without improving judgment; resize to the minimum resolution needed for the task.
- Write Rubrics That Reference the Image: Phrases like “based on what is visible in the image” anchor the judge to the media rather than to its prior knowledge.
- Use Reference Mode for Ground-Truth Tasks: Chart QA and VQA benchmarks with known answers benefit from
expected_outputand the reference suffix. - Start with a Built-in Rubric: The four built-in rubrics are the defaults that come with Strands Evals; customize only once you see where they fall short on your data.
Related Evaluators
Section titled “Related Evaluators”- MultimodalOverallQualityEvaluator: Likert-5 quality scoring
- MultimodalCorrectnessEvaluator: Strict binary factual correctness
- MultimodalFaithfulnessEvaluator: Strict binary hallucination detection against the image
- MultimodalInstructionFollowingEvaluator: Strict binary instruction compliance
- OutputEvaluator: Text-only parent class with the same
evaluate()contract