Multimodal Faithfulness Evaluator
Overview
Section titled “Overview”The MultimodalFaithfulnessEvaluator assesses whether an agent response is grounded in the image content, detecting hallucinations such as invented details, unsupported assumptions, external knowledge, and speculation.
Key Features
Section titled “Key Features”- Output-Level Evaluation: Scores a single agent response per case
- Binary Scoring:
1.0if fully grounded,0.0if any hallucination is found - Automatic Reference Comparison: Appends a reference suffix to the rubric when
expected_outputis provided on the case - Hallucination Detection: Designed to catch invented details, unsupported assumptions, and speculation
When to Use
Section titled “When to Use”Use the MultimodalFaithfulnessEvaluator when you need to:
- Detect hallucinations in image captions or VQA answers
- Verify that a response only states what is verifiable from the image
- Screen for inferred-but-unseen details (emotions, off-screen events, brand names, locations)
- Complement correctness checks with a groundedness check
Evaluation Level
Section titled “Evaluation Level”This evaluator operates at the OUTPUT_LEVEL, scoring a single agent response per case.
Parameters
Section titled “Parameters”rubric (optional)
Section titled “rubric (optional)”- Type:
str | None - Default:
FAITHFULNESS_RUBRIC_V0 - Description: Custom rubric. Leave unset to use the default rubric.
model (optional)
Section titled “model (optional)”- Type:
Model | str | None - Default:
None(uses default Bedrock model) - Description: Multimodal judge model.
include_inputs (optional)
Section titled “include_inputs (optional)”- Type:
bool - Default:
True
system_prompt (optional)
Section titled “system_prompt (optional)”- Type:
str | None - Default:
None(uses the built-inMLLM_JUDGE_SYSTEM_PROMPT)
reference_suffix (optional)
Section titled “reference_suffix (optional)”- Type:
str | None - Default:
None(uses the built-in default suffix)
Scoring System
Section titled “Scoring System”| Score | Label | Meaning |
|---|---|---|
| 1.0 | Faithful | Response only contains information verifiable from the image |
| 0.0 | Unfaithful | Response contains one or more hallucinations |
A response passes only if the score is 1.0.
Basic Usage
Section titled “Basic Usage”from strands_evals import Case, Experimentfrom strands_evals.evaluators import MultimodalFaithfulnessEvaluatorfrom strands_evals.types import MultimodalInputfrom strands_evals.types.evaluation_report import EvaluationReport
def task_function(case: Case) -> str: # Replace with your multimodal agent invocation. return "A family is having a picnic in Central Park."
cases = [ Case( name="park-scene", input=MultimodalInput( media="/path/to/picnic.jpg", instruction="Describe what is happening in the image.", ), ),]
experiment = Experiment(cases=cases, evaluators=[MultimodalFaithfulnessEvaluator()])reports = experiment.run_evaluations(task_function)EvaluationReport.flatten(reports).run_display()Combining with Other Evaluators
Section titled “Combining with Other Evaluators”Pair with correctness to distinguish “wrong” from “ungrounded”. Experiment.run_evaluations returns one report per evaluator, so use EvaluationReport.flatten to view them together:
from strands_evals import Experimentfrom strands_evals.evaluators import ( MultimodalCorrectnessEvaluator, MultimodalFaithfulnessEvaluator,)from strands_evals.types.evaluation_report import EvaluationReport
evaluators = [ MultimodalCorrectnessEvaluator(), # Are the claims factually right? MultimodalFaithfulnessEvaluator(), # Are they supported by the image?]
experiment = Experiment(cases=cases, evaluators=evaluators)reports = experiment.run_evaluations(task_function)EvaluationReport.flatten(reports).run_display()Related Evaluators
Section titled “Related Evaluators”- MultimodalOutputEvaluator: Parent class with full parameter reference
- MultimodalCorrectnessEvaluator: Strict factual correctness
- FaithfulnessEvaluator: Text-only counterpart grounded in conversation history