Skip to content

Multimodal Overall Quality Evaluator

The MultimodalOverallQualityEvaluator assesses the overall quality of an agent response across four dimensions: visual accuracy, instruction adherence, completeness, and coherence/helpfulness. It produces a single Likert-5 score.

  • Output-Level Evaluation: Scores a single agent response per case
  • Likert-5 Scoring: Score is one of 0.0, 0.25, 0.5, 0.75, 1.0
  • Automatic Reference Comparison: Appends a reference suffix to the rubric when expected_output is provided on the case
  • Four-Dimension Rubric: Evaluates visual accuracy, instruction adherence, completeness, and coherence together

Use the MultimodalOverallQualityEvaluator when you need to:

  • Get a single, interpretable quality score for image-to-text responses
  • Track overall quality trends across model or prompt versions
  • Grade open-ended multimodal responses where binary judgments are too coarse

This evaluator operates at the OUTPUT_LEVEL, scoring a single agent response per case.

  • Type: str | None
  • Default: OVERALL_QUALITY_RUBRIC_V0
  • Description: Custom rubric. Leave unset to use the default rubric.
  • Type: Model | str | None
  • Default: None (uses default Bedrock model)
  • Description: Multimodal judge model.
  • Type: bool
  • Default: True
  • Type: str | None
  • Default: None (uses the built-in MLLM_JUDGE_SYSTEM_PROMPT)
  • Type: str | None
  • Default: An overall-quality-specific suffix that grades factual content rather than verbatim match
  • Description: Override only if you want stricter or looser reference handling than the built-in default.
ScoreLabelMeaning
1.0ExcellentAccurate, complete, directly addresses the instruction
0.75GoodMostly accurate with minor imprecisions
0.5AveragePartially correct; misses important details or has minor errors
0.25PoorWeak response with multiple inaccuracies or significant omissions
0.0Very PoorFactually wrong, off-topic, and unhelpful

A response typically passes if the score is >= 0.75.

from strands_evals import Case, Experiment
from strands_evals.evaluators import MultimodalOverallQualityEvaluator
from strands_evals.types import MultimodalInput
from strands_evals.types.evaluation_report import EvaluationReport
def task_function(case: Case) -> str:
# Replace with your multimodal agent invocation.
return "The chart is a bar chart of quarterly revenue for products A, B, and C."
cases = [
Case(
name="chart-overview",
input=MultimodalInput(
media="/path/to/revenue_chart.png",
instruction="What kind of chart is shown and what does it represent?",
),
),
]
experiment = Experiment(cases=cases, evaluators=[MultimodalOverallQualityEvaluator()])
reports = experiment.run_evaluations(task_function)
EvaluationReport.flatten(reports).run_display()