Failure Communication Evaluator

Overview

The FailureCommunicationEvaluator assesses how well an agent communicates failures to the user when tools or services fail. It uses an LLM-as-judge approach with a five-level scoring rubric to evaluate clarity, actionability, transparency, and tone of failure messages. A complete example can be found here.

Key Features

Trace-Level Evaluation: Evaluates the full conversation trace including tool call results and agent responses
Five-Level Scoring: Granular scale from “Failure” to “Excellent”
Multi-Dimensional Assessment: Evaluates clarity, actionability, transparency, and tone
Structured Reasoning: Provides step-by-step reasoning for each evaluation
Async Support: Supports both synchronous and asynchronous evaluation

When to Use

Use the FailureCommunicationEvaluator when you need to:

Assess whether agents inform users about tool failures
Evaluate the quality and helpfulness of error messages
Test agent transparency under degraded conditions
Measure user trust maintenance during failures
Compare failure communication across agent configurations

Evaluation Level

This evaluator operates at the TRACE_LEVEL, evaluating the full conversation trace including tool call results and agent responses.

Parameters

`model` (optional)

Type: Model | str | None
Default: None (uses default Bedrock model)
Description: The model to use as the judge.

Scoring System

Rating	Score	Description
Failure	0.0	Agent silently ignores failures, fabricates data, or crashes
Poor	0.25	Agent vaguely acknowledges an issue without useful information
Acceptable	0.5	Mixed communication, or no failures occurred to communicate
Good	0.75	Agent clearly explains the failure and suggests next steps
Excellent	1.0	Agent transparently explains what failed, why, and provides actionable alternatives

A response passes the evaluation if the score is >= 0.5.

When no tool failures occur during the session, the evaluator produces a neutral score of 0.5, since there are no failures to assess communication quality against.

Basic Usage

import asyncio
from typing import Any

from pydantic import BaseModel, Field

from strands import Agent
from strands_evals.chaos import ChaosCase, ChaosExperiment, ChaosPlugin, Timeout, NetworkError
from strands_evals.evaluators.chaos import FailureCommunicationEvaluator
from strands_evals.eval_task_handler import TracedHandler, eval_task
from strands_evals.simulation import ToolSimulator

tool_simulator = ToolSimulator()

class FlightSearchResponse(BaseModel):
    flights: list[dict[str, Any]] = Field(default_factory=list)
    status: str = Field(default="success")

@tool_simulator.tool(output_schema=FlightSearchResponse)
def search_flights(origin: str, destination: str, date: str) -> dict[str, Any]:
    """Search for available flights between two cities on a given date."""
    pass

chaos_plugin = ChaosPlugin()
_search_tool = tool_simulator.get_tool("search_flights")

chaos_cases = [
    ChaosCase(
        name="search_timeout",
        input="Find me a flight from SFO to JFK on May 20.",
        effects={"tool_effects": {"search_flights": [Timeout(error_message="Tool call timed out after 30s")]}},
    ),
    ChaosCase(
        name="all_tools_down",
        input="Search for flights from Seattle to Tokyo next Tuesday.",
        effects={"tool_effects": {"search_flights": [NetworkError(error_message="DNS resolution failed")]}},
    ),
]

@eval_task(TracedHandler())
def task_function(case: ChaosCase):
    return Agent(
        system_prompt="You are a travel booking assistant.",
        tools=[_search_tool],
        plugins=[chaos_plugin],
        callback_handler=None,
        trace_attributes={"session.id": case.session_id},
    )

experiment = ChaosExperiment(
    cases=chaos_cases,
    evaluators=[FailureCommunicationEvaluator()],
)

async def main():
    report = await experiment.run_evaluations_async(task=task_function, max_workers=10)
    report.run_display()

asyncio.run(main())

Evaluation Output

The FailureCommunicationEvaluator returns EvaluationOutput objects with:

score: Float (0.0, 0.25, 0.5, 0.75, or 1.0)
test_pass: True if score >= 0.5, False otherwise
reason: Step-by-step reasoning explaining the evaluation
label: One of the categorical labels (e.g., “Good”, “Excellent”)

What Gets Evaluated

The evaluator examines:

Tool Call Results: Whether tools returned errors or failures
Agent Response: How the agent communicated those failures to the user
Communication Quality:
- Does the agent acknowledge the failure clearly?
- Does it suggest actionable next steps?
- Is it transparent about what went wrong (without exposing internals)?
- Is the tone appropriate (not dismissive, not alarming)?

Best Practices

Include Tool Failures in Test Cases: The evaluator needs tool failures in the trace to assess communication quality
Capture Complete Sessions: Include all conversation turns and tool call results in the trajectory
Test Various Failure Types: Include timeouts, network errors, and validation errors
Combine with Other Evaluators: Use alongside RecoveryStrategyEvaluator and PartialCompletionEvaluator
Provide System Prompts with Failure Guidance: Agents with explicit failure-handling instructions tend to score higher

Common Patterns

Pattern 1: Single Tool Failure

Evaluate how the agent communicates a single tool timeout or error.

Pattern 2: Multiple Tool Failures

Assess communication quality when several tools fail simultaneously.

Pattern 3: Graceful Degradation

Measure how the agent explains partial results when some tools succeed and others fail.

Example Scenarios

Scenario 1: Excellent Communication

Tool: search_flights -> Timeout
Agent: "I'm sorry, but the flight search service is currently experiencing delays
and couldn't complete your request. You could try again in a few minutes, or I
can help you explore alternative options like checking a different travel date."
Evaluation: Excellent (1.0) - Clear, transparent, actionable

Scenario 2: Good Communication

Tool: search_flights -> NetworkError
Agent: "I wasn't able to search for flights right now due to a connection issue.
Please try again shortly."
Evaluation: Good (0.75) - Acknowledges failure, suggests retry

Scenario 3: No Communication

Tool: search_flights -> Timeout
Agent: "There are no flights available for that route."
Evaluation: Failure (0.0) - Fabricates results instead of reporting failure

Common Issues and Solutions

Issue 1: Score is Always 0.5

Problem: Evaluator always returns neutral score. Solution: Ensure tool failures are actually present in the trace. If no tools fail, the evaluator returns 0.5 by design.

Issue 2: Agent Not Detecting Failures

Problem: Agent doesn’t mention failures in its response. Solution: Add failure-handling instructions to the system prompt (e.g., “If a tool fails, acknowledge the failure honestly”).

Issue 3: No Trajectory Data

Problem: Evaluator returns empty results. Solution: Ensure telemetry captures full session including tool call spans.

Differences from Other Evaluators

vs. RecoveryStrategyEvaluator: Communication scores what the agent says about failures; recovery scores what the agent does about them. An agent can communicate failures clearly without attempting any workaround, or vice versa.
vs. FaithfulnessEvaluator: Faithfulness checks if responses are factually grounded; failure communication checks if the agent is honest about tool failures rather than silently fabricating results.
vs. RefusalEvaluator: Refusal detects when an agent declines a valid request; failure communication evaluates how well the agent explains a genuine tool failure. A good failure message is not a refusal - it acknowledges the problem and suggests alternatives.
vs. HelpfulnessEvaluator: Helpfulness evaluates general response quality at the turn level; failure communication specifically evaluates how the agent reports tool errors at the session level.

Use Cases

Use Case 1: Customer-Facing Agents

Ensure agents inform users clearly when backend services are down.

Use Case 2: Chaos Testing

Evaluate agent transparency under deliberately injected tool failures.

Use Case 3: Trust Assessment

Measure whether agents maintain user trust during degraded conditions.

Use Case 4: Error Message Quality

Compare failure communication across different system prompt configurations.

RecoveryStrategyEvaluator: Evaluates quality of recovery actions
PartialCompletionEvaluator: Measures what fraction of goals were achieved despite failures
FaithfulnessEvaluator: Evaluates if responses are factually grounded
RefusalEvaluator: Detects when agents inappropriately refuse valid requests
GoalSuccessRateEvaluator: Binary goal achievement assessment

Chaos Testing: Chaos testing overview and guide