Skip to content

Recovery Strategy Evaluator

The RecoveryStrategyEvaluator scores the quality of an agent’s recovery actions when tools fail. It evaluates whether the agent attempts alternative approaches, retries appropriately, and varies its strategies rather than repeating the same failed action. A complete example can be found here.

  • Trace-Level Evaluation: Evaluates the full conversation trace including tool call patterns, retries, and alternative approaches
  • Five-Level Scoring: Granular scale from “Failure” to “Excellent”
  • Multi-Dimensional Assessment: Evaluates exploration breadth, retry discipline, and approach variation
  • Structured Reasoning: Provides step-by-step reasoning for each evaluation
  • Async Support: Supports both synchronous and asynchronous evaluation

Use the RecoveryStrategyEvaluator when you need to:

  • Assess whether agents attempt alternative approaches when tools fail
  • Evaluate retry behavior (appropriate retries vs. infinite loops)
  • Detect agents that give up immediately on first failure
  • Measure quality and variety of recovery strategies
  • Compare recovery sophistication across agent configurations

This evaluator operates at the TRACE_LEVEL, evaluating the full conversation trace including tool call patterns, retries, and alternative approaches.

  • Type: Model | str | None
  • Default: None (uses default Bedrock model)
  • Description: The model to use as the judge.
RatingScoreDescription
Failure0.0Agent gives up immediately or crashes on first failure
Poor0.25Agent retries the same failed action with no variation
Acceptable0.5Minimal recovery, or no failures occurred to recover from
Good0.75Agent retries with variation or tries alternative tools
Excellent1.0Agent demonstrates sophisticated recovery: retries, fallbacks, escalation, and adaptation

A response passes the evaluation if the score is >= 0.5.

When no tool failures occur during the session, the evaluator produces a neutral score of 0.5, since there are no failures to assess recovery behavior against.

import asyncio
from typing import Any
from pydantic import BaseModel, Field
from strands import Agent
from strands_evals.chaos import ChaosCase, ChaosExperiment, ChaosPlugin, ExecutionError, Timeout
from strands_evals.evaluators.chaos import RecoveryStrategyEvaluator
from strands_evals.eval_task_handler import TracedHandler, eval_task
from strands_evals.simulation import ToolSimulator
tool_simulator = ToolSimulator()
class FlightSearchResponse(BaseModel):
flights: list[dict[str, Any]] = Field(default_factory=list)
status: str = Field(default="success")
class HotelSearchResponse(BaseModel):
hotels: list[dict[str, Any]] = Field(default_factory=list)
status: str = Field(default="success")
@tool_simulator.tool(output_schema=FlightSearchResponse)
def search_flights(origin: str, destination: str, date: str) -> dict[str, Any]:
"""Search for available flights between two cities on a given date."""
pass
@tool_simulator.tool(output_schema=HotelSearchResponse)
def search_hotels(city: str, check_in: str, check_out: str) -> dict[str, Any]:
"""Search for available hotels in a city for given dates."""
pass
chaos_plugin = ChaosPlugin()
_flights_tool = tool_simulator.get_tool("search_flights")
_hotels_tool = tool_simulator.get_tool("search_hotels")
# Flight search times out but hotel search works: agent should pivot
chaos_cases = [
ChaosCase(
name="flight_timeout_hotel_available",
input="Plan my trip to Tokyo: find flights from SFO and hotels for May 20-23.",
effects={"tool_effects": {"search_flights": [Timeout()]}},
),
ChaosCase(
name="flight_and_booking_fail",
input="Find a flight from NYC to London on June 1.",
effects={"tool_effects": {"search_flights": [ExecutionError(error_message="Internal server error")]}},
),
]
@eval_task(TracedHandler())
def task_function(case: ChaosCase):
return Agent(
system_prompt=(
"You are a travel planning assistant. If a tool fails, "
"try alternative tools that can partially fulfill the request. "
"Do NOT retry the same failed tool more than once."
),
tools=[_flights_tool, _hotels_tool],
plugins=[chaos_plugin],
callback_handler=None,
trace_attributes={"session.id": case.session_id},
)
experiment = ChaosExperiment(
cases=chaos_cases,
evaluators=[RecoveryStrategyEvaluator()],
)
async def main():
report = await experiment.run_evaluations_async(task=task_function, max_workers=10)
report.run_display()
asyncio.run(main())

The RecoveryStrategyEvaluator returns EvaluationOutput objects with:

  • score: Float (0.0, 0.25, 0.5, 0.75, or 1.0)
  • test_pass: True if score >= 0.5, False otherwise
  • reason: Step-by-step reasoning explaining the evaluation
  • label: One of the categorical labels (e.g., “Good”, “Excellent”)

The evaluator examines:

  1. Tool Call Patterns: Sequence of tool calls and their results
  2. Retry Behavior: Whether the agent retried failed tools and how many times
  3. Recovery Quality:
    • Exploration breadth: Did the agent try alternative tools or approaches?
    • Retry discipline: Did it retry appropriately (not excessively)?
    • Approach variation: Did retries use different strategies (different parameters, different tools)?
  1. Provide Alternative Tools: Give agents access to multiple tools that can partially fulfill the same goal
  2. Add Recovery Instructions: System prompts with explicit recovery guidance help agents score higher
  3. Capture Complete Sessions: Include all tool call attempts and retries in the trajectory
  4. Combine with Other Evaluators: Use alongside FailureCommunicationEvaluator and PartialCompletionEvaluator
  5. Test Various Failure Severities: Include single-tool failures and multi-tool failures

Evaluate if the agent pivots to a different tool when the primary one fails.

Assess if the agent retries with different parameters instead of repeating the same call.

Measure if the agent escalates to the user when all automated recovery options are exhausted.

Tool: search_flights -> Timeout
Agent: [retries search_flights with broader date range -> still fails]
Agent: [calls search_hotels for the destination instead]
Final: "I couldn't find flight info, but I found hotels in Tokyo for your dates."
Evaluation: Excellent (1.0) - Tried variation, then pivoted to alternative
Tool: search_flights -> NetworkError
Agent: [retries search_flights once -> still fails]
Final: "Flight search is unavailable. Please try again later."
Evaluation: Good (0.75) - Retried once, then communicated clearly
Tool: search_flights -> Timeout
Agent: [retries search_flights 5 times with identical parameters]
Final: "I'm having trouble finding flights."
Evaluation: Poor (0.25) - Excessive retries with no variation
Tool: search_flights -> ExecutionError
Agent: "I can't help with that."
Evaluation: Failure (0.0) - Gave up immediately without any attempt

Problem: Evaluator always returns neutral score. Solution: Ensure tool failures are present in the trace. If no tools fail, the evaluator returns 0.5 by design.

Problem: Agent retries the same tool many times, getting a low recovery score. Solution: Add retry limits to the system prompt (e.g., “Do NOT retry more than once”).

Problem: Evaluator returns empty results. Solution: Ensure telemetry captures full session including all tool call spans.

  • vs. FailureCommunicationEvaluator: Recovery scores the agent’s actions (retries, fallbacks, tool switching); communication scores the agent’s words (how it explains failures). Both can be high, both can be low, or one without the other.
  • vs. PartialCompletionEvaluator: Recovery scores the quality of recovery attempts regardless of outcome; partial completion scores the result regardless of how the agent got there. Excellent recovery may still yield low completion if all alternatives also fail.
  • vs. TrajectoryEvaluator: Trajectory evaluates the full action sequence holistically for workflow adherence; recovery specifically targets the quality of failure-response actions within that sequence.
  • vs. ToolSelectionEvaluator: Tool selection checks if correct tools were chosen under normal conditions; recovery evaluates whether the agent adapted its tool choices appropriately when failures occurred.

Evaluate agent recovery strategies under deliberately injected tool failures.

Use Case 2: Agent Configuration Comparison

Section titled “Use Case 2: Agent Configuration Comparison”

Compare how different system prompts affect recovery behavior.

Verify agents follow expected retry policies (retry once, then fallback).

Test whether agents leverage alternative tools when primary ones fail.