Multimodal Faithfulness Evaluator

Overview

The MultimodalFaithfulnessEvaluator assesses whether an agent response is grounded in the image content, detecting hallucinations such as invented details, unsupported assumptions, external knowledge, and speculation.

Key Features

Output-Level Evaluation: Scores a single agent response per case
Binary Scoring: 1.0 if fully grounded, 0.0 if any hallucination is found
Automatic Reference Comparison: Appends a reference suffix to the rubric when expected_output is provided on the case
Hallucination Detection: Designed to catch invented details, unsupported assumptions, and speculation

When to Use

Use the MultimodalFaithfulnessEvaluator when you need to:

Detect hallucinations in image captions or VQA answers
Verify that a response only states what is verifiable from the image
Screen for inferred-but-unseen details (emotions, off-screen events, brand names, locations)
Complement correctness checks with a groundedness check

Evaluation Level

This evaluator operates at the OUTPUT_LEVEL, scoring a single agent response per case.

Parameters

`rubric` (optional)

Type: str | None
Default: FAITHFULNESS_RUBRIC_V0
Description: Custom rubric. Leave unset to use the default rubric.

`model` (optional)

Type: Model | str | None
Default: None (uses default Bedrock model)
Description: Multimodal judge model.

`include_inputs` (optional)

Type: bool
Default: True

`system_prompt` (optional)

Type: str | None
Default: None (uses the built-in MLLM_JUDGE_SYSTEM_PROMPT)

`reference_suffix` (optional)

Type: str | None
Default: None (uses the built-in default suffix)

Scoring System

Score	Label	Meaning
1.0	Faithful	Response only contains information verifiable from the image
0.0	Unfaithful	Response contains one or more hallucinations

A response passes only if the score is 1.0.

Basic Usage

from strands_evals import Case, Experiment
from strands_evals.evaluators import MultimodalFaithfulnessEvaluator
from strands_evals.types import MultimodalInput
from strands_evals.types.evaluation_report import EvaluationReport


def task_function(case: Case) -> str:
    # Replace with your multimodal agent invocation.
    return "A family is having a picnic in Central Park."


cases = [
    Case(
        name="park-scene",
        input=MultimodalInput(
            media="/path/to/picnic.jpg",
            instruction="Describe what is happening in the image.",
        ),
    ),
]

experiment = Experiment(cases=cases, evaluators=[MultimodalFaithfulnessEvaluator()])
reports = experiment.run_evaluations(task_function)
EvaluationReport.flatten(reports).run_display()

Combining with Other Evaluators

Pair with correctness to distinguish “wrong” from “ungrounded”. Experiment.run_evaluations returns one report per evaluator, so use EvaluationReport.flatten to view them together:

from strands_evals import Experiment
from strands_evals.evaluators import (
    MultimodalCorrectnessEvaluator,
    MultimodalFaithfulnessEvaluator,
)
from strands_evals.types.evaluation_report import EvaluationReport

evaluators = [
    MultimodalCorrectnessEvaluator(),   # Are the claims factually right?
    MultimodalFaithfulnessEvaluator(),  # Are they supported by the image?
]

experiment = Experiment(cases=cases, evaluators=evaluators)
reports = experiment.run_evaluations(task_function)
EvaluationReport.flatten(reports).run_display()

MultimodalOutputEvaluator: Parent class with full parameter reference
MultimodalCorrectnessEvaluator: Strict factual correctness
FaithfulnessEvaluator: Text-only counterpart grounded in conversation history