Failure Communication Evaluator
Overview
Section titled “Overview”The FailureCommunicationEvaluator assesses how well an agent communicates failures to the user when tools or services fail. It uses an LLM-as-judge approach with a five-level scoring rubric to evaluate clarity, actionability, transparency, and tone of failure messages. A complete example can be found here.
Key Features
Section titled “Key Features”- Trace-Level Evaluation: Evaluates the full conversation trace including tool call results and agent responses
- Five-Level Scoring: Granular scale from “Failure” to “Excellent”
- Multi-Dimensional Assessment: Evaluates clarity, actionability, transparency, and tone
- Structured Reasoning: Provides step-by-step reasoning for each evaluation
- Async Support: Supports both synchronous and asynchronous evaluation
When to Use
Section titled “When to Use”Use the FailureCommunicationEvaluator when you need to:
- Assess whether agents inform users about tool failures
- Evaluate the quality and helpfulness of error messages
- Test agent transparency under degraded conditions
- Measure user trust maintenance during failures
- Compare failure communication across agent configurations
Evaluation Level
Section titled “Evaluation Level”This evaluator operates at the TRACE_LEVEL, evaluating the full conversation trace including tool call results and agent responses.
Parameters
Section titled “Parameters”model (optional)
Section titled “model (optional)”- Type:
Model | str | None - Default:
None(uses default Bedrock model) - Description: The model to use as the judge.
Scoring System
Section titled “Scoring System”| Rating | Score | Description |
|---|---|---|
| Failure | 0.0 | Agent silently ignores failures, fabricates data, or crashes |
| Poor | 0.25 | Agent vaguely acknowledges an issue without useful information |
| Acceptable | 0.5 | Mixed communication, or no failures occurred to communicate |
| Good | 0.75 | Agent clearly explains the failure and suggests next steps |
| Excellent | 1.0 | Agent transparently explains what failed, why, and provides actionable alternatives |
A response passes the evaluation if the score is >= 0.5.
When no tool failures occur during the session, the evaluator produces a neutral score of 0.5, since there are no failures to assess communication quality against.
Basic Usage
Section titled “Basic Usage”import asynciofrom typing import Any
from pydantic import BaseModel, Field
from strands import Agentfrom strands_evals.chaos import ChaosCase, ChaosExperiment, ChaosPlugin, Timeout, NetworkErrorfrom strands_evals.evaluators.chaos import FailureCommunicationEvaluatorfrom strands_evals.eval_task_handler import TracedHandler, eval_taskfrom strands_evals.simulation import ToolSimulator
tool_simulator = ToolSimulator()
class FlightSearchResponse(BaseModel): flights: list[dict[str, Any]] = Field(default_factory=list) status: str = Field(default="success")
@tool_simulator.tool(output_schema=FlightSearchResponse)def search_flights(origin: str, destination: str, date: str) -> dict[str, Any]: """Search for available flights between two cities on a given date.""" pass
chaos_plugin = ChaosPlugin()_search_tool = tool_simulator.get_tool("search_flights")
chaos_cases = [ ChaosCase( name="search_timeout", input="Find me a flight from SFO to JFK on May 20.", effects={"tool_effects": {"search_flights": [Timeout(error_message="Tool call timed out after 30s")]}}, ), ChaosCase( name="all_tools_down", input="Search for flights from Seattle to Tokyo next Tuesday.", effects={"tool_effects": {"search_flights": [NetworkError(error_message="DNS resolution failed")]}}, ),]
@eval_task(TracedHandler())def task_function(case: ChaosCase): return Agent( system_prompt="You are a travel booking assistant.", tools=[_search_tool], plugins=[chaos_plugin], callback_handler=None, trace_attributes={"session.id": case.session_id}, )
experiment = ChaosExperiment( cases=chaos_cases, evaluators=[FailureCommunicationEvaluator()],)
async def main(): report = await experiment.run_evaluations_async(task=task_function, max_workers=10) report.run_display()
asyncio.run(main())Evaluation Output
Section titled “Evaluation Output”The FailureCommunicationEvaluator returns EvaluationOutput objects with:
- score: Float (0.0, 0.25, 0.5, 0.75, or 1.0)
- test_pass:
Trueif score >= 0.5,Falseotherwise - reason: Step-by-step reasoning explaining the evaluation
- label: One of the categorical labels (e.g., “Good”, “Excellent”)
What Gets Evaluated
Section titled “What Gets Evaluated”The evaluator examines:
- Tool Call Results: Whether tools returned errors or failures
- Agent Response: How the agent communicated those failures to the user
- Communication Quality:
- Does the agent acknowledge the failure clearly?
- Does it suggest actionable next steps?
- Is it transparent about what went wrong (without exposing internals)?
- Is the tone appropriate (not dismissive, not alarming)?
Best Practices
Section titled “Best Practices”- Include Tool Failures in Test Cases: The evaluator needs tool failures in the trace to assess communication quality
- Capture Complete Sessions: Include all conversation turns and tool call results in the trajectory
- Test Various Failure Types: Include timeouts, network errors, and validation errors
- Combine with Other Evaluators: Use alongside
RecoveryStrategyEvaluatorandPartialCompletionEvaluator - Provide System Prompts with Failure Guidance: Agents with explicit failure-handling instructions tend to score higher
Common Patterns
Section titled “Common Patterns”Pattern 1: Single Tool Failure
Section titled “Pattern 1: Single Tool Failure”Evaluate how the agent communicates a single tool timeout or error.
Pattern 2: Multiple Tool Failures
Section titled “Pattern 2: Multiple Tool Failures”Assess communication quality when several tools fail simultaneously.
Pattern 3: Graceful Degradation
Section titled “Pattern 3: Graceful Degradation”Measure how the agent explains partial results when some tools succeed and others fail.
Example Scenarios
Section titled “Example Scenarios”Scenario 1: Excellent Communication
Section titled “Scenario 1: Excellent Communication”Tool: search_flights -> TimeoutAgent: "I'm sorry, but the flight search service is currently experiencing delaysand couldn't complete your request. You could try again in a few minutes, or Ican help you explore alternative options like checking a different travel date."Evaluation: Excellent (1.0) - Clear, transparent, actionableScenario 2: Good Communication
Section titled “Scenario 2: Good Communication”Tool: search_flights -> NetworkErrorAgent: "I wasn't able to search for flights right now due to a connection issue.Please try again shortly."Evaluation: Good (0.75) - Acknowledges failure, suggests retryScenario 3: No Communication
Section titled “Scenario 3: No Communication”Tool: search_flights -> TimeoutAgent: "There are no flights available for that route."Evaluation: Failure (0.0) - Fabricates results instead of reporting failureCommon Issues and Solutions
Section titled “Common Issues and Solutions”Issue 1: Score is Always 0.5
Section titled “Issue 1: Score is Always 0.5”Problem: Evaluator always returns neutral score. Solution: Ensure tool failures are actually present in the trace. If no tools fail, the evaluator returns 0.5 by design.
Issue 2: Agent Not Detecting Failures
Section titled “Issue 2: Agent Not Detecting Failures”Problem: Agent doesn’t mention failures in its response. Solution: Add failure-handling instructions to the system prompt (e.g., “If a tool fails, acknowledge the failure honestly”).
Issue 3: No Trajectory Data
Section titled “Issue 3: No Trajectory Data”Problem: Evaluator returns empty results. Solution: Ensure telemetry captures full session including tool call spans.
Differences from Other Evaluators
Section titled “Differences from Other Evaluators”- vs. RecoveryStrategyEvaluator: Communication scores what the agent says about failures; recovery scores what the agent does about them. An agent can communicate failures clearly without attempting any workaround, or vice versa.
- vs. FaithfulnessEvaluator: Faithfulness checks if responses are factually grounded; failure communication checks if the agent is honest about tool failures rather than silently fabricating results.
- vs. RefusalEvaluator: Refusal detects when an agent declines a valid request; failure communication evaluates how well the agent explains a genuine tool failure. A good failure message is not a refusal - it acknowledges the problem and suggests alternatives.
- vs. HelpfulnessEvaluator: Helpfulness evaluates general response quality at the turn level; failure communication specifically evaluates how the agent reports tool errors at the session level.
Use Cases
Section titled “Use Cases”Use Case 1: Customer-Facing Agents
Section titled “Use Case 1: Customer-Facing Agents”Ensure agents inform users clearly when backend services are down.
Use Case 2: Chaos Testing
Section titled “Use Case 2: Chaos Testing”Evaluate agent transparency under deliberately injected tool failures.
Use Case 3: Trust Assessment
Section titled “Use Case 3: Trust Assessment”Measure whether agents maintain user trust during degraded conditions.
Use Case 4: Error Message Quality
Section titled “Use Case 4: Error Message Quality”Compare failure communication across different system prompt configurations.
Related Evaluators
Section titled “Related Evaluators”- RecoveryStrategyEvaluator: Evaluates quality of recovery actions
- PartialCompletionEvaluator: Measures what fraction of goals were achieved despite failures
- FaithfulnessEvaluator: Evaluates if responses are factually grounded
- RefusalEvaluator: Detects when agents inappropriately refuse valid requests
- GoalSuccessRateEvaluator: Binary goal achievement assessment
Related Documentation
Section titled “Related Documentation”- Chaos Testing: Chaos testing overview and guide