Recovery Strategy Evaluator
Overview
Section titled “Overview”The RecoveryStrategyEvaluator scores the quality of an agent’s recovery actions when tools fail. It evaluates whether the agent attempts alternative approaches, retries appropriately, and varies its strategies rather than repeating the same failed action. A complete example can be found here.
Key Features
Section titled “Key Features”- Trace-Level Evaluation: Evaluates the full conversation trace including tool call patterns, retries, and alternative approaches
- Five-Level Scoring: Granular scale from “Failure” to “Excellent”
- Multi-Dimensional Assessment: Evaluates exploration breadth, retry discipline, and approach variation
- Structured Reasoning: Provides step-by-step reasoning for each evaluation
- Async Support: Supports both synchronous and asynchronous evaluation
When to Use
Section titled “When to Use”Use the RecoveryStrategyEvaluator when you need to:
- Assess whether agents attempt alternative approaches when tools fail
- Evaluate retry behavior (appropriate retries vs. infinite loops)
- Detect agents that give up immediately on first failure
- Measure quality and variety of recovery strategies
- Compare recovery sophistication across agent configurations
Evaluation Level
Section titled “Evaluation Level”This evaluator operates at the TRACE_LEVEL, evaluating the full conversation trace including tool call patterns, retries, and alternative approaches.
Parameters
Section titled “Parameters”model (optional)
Section titled “model (optional)”- Type:
Model | str | None - Default:
None(uses default Bedrock model) - Description: The model to use as the judge.
Scoring System
Section titled “Scoring System”| Rating | Score | Description |
|---|---|---|
| Failure | 0.0 | Agent gives up immediately or crashes on first failure |
| Poor | 0.25 | Agent retries the same failed action with no variation |
| Acceptable | 0.5 | Minimal recovery, or no failures occurred to recover from |
| Good | 0.75 | Agent retries with variation or tries alternative tools |
| Excellent | 1.0 | Agent demonstrates sophisticated recovery: retries, fallbacks, escalation, and adaptation |
A response passes the evaluation if the score is >= 0.5.
When no tool failures occur during the session, the evaluator produces a neutral score of 0.5, since there are no failures to assess recovery behavior against.
Basic Usage
Section titled “Basic Usage”import asynciofrom typing import Any
from pydantic import BaseModel, Field
from strands import Agentfrom strands_evals.chaos import ChaosCase, ChaosExperiment, ChaosPlugin, ExecutionError, Timeoutfrom strands_evals.evaluators.chaos import RecoveryStrategyEvaluatorfrom strands_evals.eval_task_handler import TracedHandler, eval_taskfrom strands_evals.simulation import ToolSimulator
tool_simulator = ToolSimulator()
class FlightSearchResponse(BaseModel): flights: list[dict[str, Any]] = Field(default_factory=list) status: str = Field(default="success")
class HotelSearchResponse(BaseModel): hotels: list[dict[str, Any]] = Field(default_factory=list) status: str = Field(default="success")
@tool_simulator.tool(output_schema=FlightSearchResponse)def search_flights(origin: str, destination: str, date: str) -> dict[str, Any]: """Search for available flights between two cities on a given date.""" pass
@tool_simulator.tool(output_schema=HotelSearchResponse)def search_hotels(city: str, check_in: str, check_out: str) -> dict[str, Any]: """Search for available hotels in a city for given dates.""" pass
chaos_plugin = ChaosPlugin()_flights_tool = tool_simulator.get_tool("search_flights")_hotels_tool = tool_simulator.get_tool("search_hotels")
# Flight search times out but hotel search works: agent should pivotchaos_cases = [ ChaosCase( name="flight_timeout_hotel_available", input="Plan my trip to Tokyo: find flights from SFO and hotels for May 20-23.", effects={"tool_effects": {"search_flights": [Timeout()]}}, ), ChaosCase( name="flight_and_booking_fail", input="Find a flight from NYC to London on June 1.", effects={"tool_effects": {"search_flights": [ExecutionError(error_message="Internal server error")]}}, ),]
@eval_task(TracedHandler())def task_function(case: ChaosCase): return Agent( system_prompt=( "You are a travel planning assistant. If a tool fails, " "try alternative tools that can partially fulfill the request. " "Do NOT retry the same failed tool more than once." ), tools=[_flights_tool, _hotels_tool], plugins=[chaos_plugin], callback_handler=None, trace_attributes={"session.id": case.session_id}, )
experiment = ChaosExperiment( cases=chaos_cases, evaluators=[RecoveryStrategyEvaluator()],)
async def main(): report = await experiment.run_evaluations_async(task=task_function, max_workers=10) report.run_display()
asyncio.run(main())Evaluation Output
Section titled “Evaluation Output”The RecoveryStrategyEvaluator returns EvaluationOutput objects with:
- score: Float (0.0, 0.25, 0.5, 0.75, or 1.0)
- test_pass:
Trueif score >= 0.5,Falseotherwise - reason: Step-by-step reasoning explaining the evaluation
- label: One of the categorical labels (e.g., “Good”, “Excellent”)
What Gets Evaluated
Section titled “What Gets Evaluated”The evaluator examines:
- Tool Call Patterns: Sequence of tool calls and their results
- Retry Behavior: Whether the agent retried failed tools and how many times
- Recovery Quality:
- Exploration breadth: Did the agent try alternative tools or approaches?
- Retry discipline: Did it retry appropriately (not excessively)?
- Approach variation: Did retries use different strategies (different parameters, different tools)?
Best Practices
Section titled “Best Practices”- Provide Alternative Tools: Give agents access to multiple tools that can partially fulfill the same goal
- Add Recovery Instructions: System prompts with explicit recovery guidance help agents score higher
- Capture Complete Sessions: Include all tool call attempts and retries in the trajectory
- Combine with Other Evaluators: Use alongside
FailureCommunicationEvaluatorandPartialCompletionEvaluator - Test Various Failure Severities: Include single-tool failures and multi-tool failures
Common Patterns
Section titled “Common Patterns”Pattern 1: Fallback to Alternative Tools
Section titled “Pattern 1: Fallback to Alternative Tools”Evaluate if the agent pivots to a different tool when the primary one fails.
Pattern 2: Retry with Variation
Section titled “Pattern 2: Retry with Variation”Assess if the agent retries with different parameters instead of repeating the same call.
Pattern 3: Graceful Escalation
Section titled “Pattern 3: Graceful Escalation”Measure if the agent escalates to the user when all automated recovery options are exhausted.
Example Scenarios
Section titled “Example Scenarios”Scenario 1: Excellent Recovery
Section titled “Scenario 1: Excellent Recovery”Tool: search_flights -> TimeoutAgent: [retries search_flights with broader date range -> still fails]Agent: [calls search_hotels for the destination instead]Final: "I couldn't find flight info, but I found hotels in Tokyo for your dates."Evaluation: Excellent (1.0) - Tried variation, then pivoted to alternativeScenario 2: Good Recovery
Section titled “Scenario 2: Good Recovery”Tool: search_flights -> NetworkErrorAgent: [retries search_flights once -> still fails]Final: "Flight search is unavailable. Please try again later."Evaluation: Good (0.75) - Retried once, then communicated clearlyScenario 3: Poor Recovery
Section titled “Scenario 3: Poor Recovery”Tool: search_flights -> TimeoutAgent: [retries search_flights 5 times with identical parameters]Final: "I'm having trouble finding flights."Evaluation: Poor (0.25) - Excessive retries with no variationScenario 4: No Recovery
Section titled “Scenario 4: No Recovery”Tool: search_flights -> ExecutionErrorAgent: "I can't help with that."Evaluation: Failure (0.0) - Gave up immediately without any attemptCommon Issues and Solutions
Section titled “Common Issues and Solutions”Issue 1: Score is Always 0.5
Section titled “Issue 1: Score is Always 0.5”Problem: Evaluator always returns neutral score. Solution: Ensure tool failures are present in the trace. If no tools fail, the evaluator returns 0.5 by design.
Issue 2: Agent Retries Excessively
Section titled “Issue 2: Agent Retries Excessively”Problem: Agent retries the same tool many times, getting a low recovery score. Solution: Add retry limits to the system prompt (e.g., “Do NOT retry more than once”).
Issue 3: No Trajectory Data
Section titled “Issue 3: No Trajectory Data”Problem: Evaluator returns empty results. Solution: Ensure telemetry captures full session including all tool call spans.
Differences from Other Evaluators
Section titled “Differences from Other Evaluators”- vs. FailureCommunicationEvaluator: Recovery scores the agent’s actions (retries, fallbacks, tool switching); communication scores the agent’s words (how it explains failures). Both can be high, both can be low, or one without the other.
- vs. PartialCompletionEvaluator: Recovery scores the quality of recovery attempts regardless of outcome; partial completion scores the result regardless of how the agent got there. Excellent recovery may still yield low completion if all alternatives also fail.
- vs. TrajectoryEvaluator: Trajectory evaluates the full action sequence holistically for workflow adherence; recovery specifically targets the quality of failure-response actions within that sequence.
- vs. ToolSelectionEvaluator: Tool selection checks if correct tools were chosen under normal conditions; recovery evaluates whether the agent adapted its tool choices appropriately when failures occurred.
Use Cases
Section titled “Use Cases”Use Case 1: Chaos Testing
Section titled “Use Case 1: Chaos Testing”Evaluate agent recovery strategies under deliberately injected tool failures.
Use Case 2: Agent Configuration Comparison
Section titled “Use Case 2: Agent Configuration Comparison”Compare how different system prompts affect recovery behavior.
Use Case 3: Retry Policy Validation
Section titled “Use Case 3: Retry Policy Validation”Verify agents follow expected retry policies (retry once, then fallback).
Use Case 4: Multi-Tool Resilience
Section titled “Use Case 4: Multi-Tool Resilience”Test whether agents leverage alternative tools when primary ones fail.
Related Evaluators
Section titled “Related Evaluators”- FailureCommunicationEvaluator: Evaluates how well agents communicate failures
- PartialCompletionEvaluator: Measures what fraction of goals were achieved
- TrajectoryEvaluator: Evaluates the sequence of actions taken
- ToolSelectionEvaluator: Evaluates whether correct tools were selected
- GoalSuccessRateEvaluator: Binary goal achievement assessment
Related Documentation
Section titled “Related Documentation”- Chaos Testing: Chaos testing overview and guide