Multimodal Overall Quality Evaluator
Overview
Section titled “Overview”The MultimodalOverallQualityEvaluator assesses the overall quality of an agent response across four dimensions: visual accuracy, instruction adherence, completeness, and coherence/helpfulness. It produces a single Likert-5 score.
Key Features
Section titled “Key Features”- Output-Level Evaluation: Scores a single agent response per case
- Likert-5 Scoring: Score is one of
0.0,0.25,0.5,0.75,1.0 - Automatic Reference Comparison: Appends a reference suffix to the rubric when
expected_outputis provided on the case - Four-Dimension Rubric: Evaluates visual accuracy, instruction adherence, completeness, and coherence together
When to Use
Section titled “When to Use”Use the MultimodalOverallQualityEvaluator when you need to:
- Get a single, interpretable quality score for image-to-text responses
- Track overall quality trends across model or prompt versions
- Grade open-ended multimodal responses where binary judgments are too coarse
Evaluation Level
Section titled “Evaluation Level”This evaluator operates at the OUTPUT_LEVEL, scoring a single agent response per case.
Parameters
Section titled “Parameters”rubric (optional)
Section titled “rubric (optional)”- Type:
str | None - Default:
OVERALL_QUALITY_RUBRIC_V0 - Description: Custom rubric. Leave unset to use the default rubric.
model (optional)
Section titled “model (optional)”- Type:
Model | str | None - Default:
None(uses default Bedrock model) - Description: Multimodal judge model.
include_inputs (optional)
Section titled “include_inputs (optional)”- Type:
bool - Default:
True
system_prompt (optional)
Section titled “system_prompt (optional)”- Type:
str | None - Default:
None(uses the built-inMLLM_JUDGE_SYSTEM_PROMPT)
reference_suffix (optional)
Section titled “reference_suffix (optional)”- Type:
str | None - Default: An overall-quality-specific suffix that grades factual content rather than verbatim match
- Description: Override only if you want stricter or looser reference handling than the built-in default.
Scoring System
Section titled “Scoring System”| Score | Label | Meaning |
|---|---|---|
| 1.0 | Excellent | Accurate, complete, directly addresses the instruction |
| 0.75 | Good | Mostly accurate with minor imprecisions |
| 0.5 | Average | Partially correct; misses important details or has minor errors |
| 0.25 | Poor | Weak response with multiple inaccuracies or significant omissions |
| 0.0 | Very Poor | Factually wrong, off-topic, and unhelpful |
A response typically passes if the score is >= 0.75.
Basic Usage
Section titled “Basic Usage”from strands_evals import Case, Experimentfrom strands_evals.evaluators import MultimodalOverallQualityEvaluatorfrom strands_evals.types import MultimodalInputfrom strands_evals.types.evaluation_report import EvaluationReport
def task_function(case: Case) -> str: # Replace with your multimodal agent invocation. return "The chart is a bar chart of quarterly revenue for products A, B, and C."
cases = [ Case( name="chart-overview", input=MultimodalInput( media="/path/to/revenue_chart.png", instruction="What kind of chart is shown and what does it represent?", ), ),]
experiment = Experiment(cases=cases, evaluators=[MultimodalOverallQualityEvaluator()])reports = experiment.run_evaluations(task_function)EvaluationReport.flatten(reports).run_display()Related Evaluators
Section titled “Related Evaluators”- MultimodalOutputEvaluator: Parent class with full parameter reference
- MultimodalCorrectnessEvaluator: Strict binary factual correctness
- MultimodalFaithfulnessEvaluator: Strict binary hallucination detection
- MultimodalInstructionFollowingEvaluator: Strict binary instruction compliance