Skip to content

Multimodal Faithfulness Evaluator

The MultimodalFaithfulnessEvaluator assesses whether an agent response is grounded in the image content, detecting hallucinations such as invented details, unsupported assumptions, external knowledge, and speculation.

  • Output-Level Evaluation: Scores a single agent response per case
  • Binary Scoring: 1.0 if fully grounded, 0.0 if any hallucination is found
  • Automatic Reference Comparison: Appends a reference suffix to the rubric when expected_output is provided on the case
  • Hallucination Detection: Designed to catch invented details, unsupported assumptions, and speculation

Use the MultimodalFaithfulnessEvaluator when you need to:

  • Detect hallucinations in image captions or VQA answers
  • Verify that a response only states what is verifiable from the image
  • Screen for inferred-but-unseen details (emotions, off-screen events, brand names, locations)
  • Complement correctness checks with a groundedness check

This evaluator operates at the OUTPUT_LEVEL, scoring a single agent response per case.

  • Type: str | None
  • Default: FAITHFULNESS_RUBRIC_V0
  • Description: Custom rubric. Leave unset to use the default rubric.
  • Type: Model | str | None
  • Default: None (uses default Bedrock model)
  • Description: Multimodal judge model.
  • Type: bool
  • Default: True
  • Type: str | None
  • Default: None (uses the built-in MLLM_JUDGE_SYSTEM_PROMPT)
  • Type: str | None
  • Default: None (uses the built-in default suffix)
ScoreLabelMeaning
1.0FaithfulResponse only contains information verifiable from the image
0.0UnfaithfulResponse contains one or more hallucinations

A response passes only if the score is 1.0.

from strands_evals import Case, Experiment
from strands_evals.evaluators import MultimodalFaithfulnessEvaluator
from strands_evals.types import MultimodalInput
from strands_evals.types.evaluation_report import EvaluationReport
def task_function(case: Case) -> str:
# Replace with your multimodal agent invocation.
return "A family is having a picnic in Central Park."
cases = [
Case(
name="park-scene",
input=MultimodalInput(
media="/path/to/picnic.jpg",
instruction="Describe what is happening in the image.",
),
),
]
experiment = Experiment(cases=cases, evaluators=[MultimodalFaithfulnessEvaluator()])
reports = experiment.run_evaluations(task_function)
EvaluationReport.flatten(reports).run_display()

Pair with correctness to distinguish “wrong” from “ungrounded”. Experiment.run_evaluations returns one report per evaluator, so use EvaluationReport.flatten to view them together:

from strands_evals import Experiment
from strands_evals.evaluators import (
MultimodalCorrectnessEvaluator,
MultimodalFaithfulnessEvaluator,
)
from strands_evals.types.evaluation_report import EvaluationReport
evaluators = [
MultimodalCorrectnessEvaluator(), # Are the claims factually right?
MultimodalFaithfulnessEvaluator(), # Are they supported by the image?
]
experiment = Experiment(cases=cases, evaluators=evaluators)
reports = experiment.run_evaluations(task_function)
EvaluationReport.flatten(reports).run_display()