Chaos Testing
Overview
Section titled “Overview”Chaos testing systematically evaluates agent resilience by injecting controlled failures into tool execution. Using ChaosPlugin, ChaosCase, and ChaosExperiment, you can test how agents handle tool timeouts, network errors, and corrupted responses without modifying agent code. A complete example can be found here.
This enables you to answer questions like:
- Does the agent gracefully communicate failures to users?
- Can the agent achieve partial goals when some tools fail?
- Does the agent employ effective recovery strategies?
Why Chaos Testing?
Section titled “Why Chaos Testing?”Traditional evaluation tests agents under ideal conditions. In production, tools fail unpredictably:
Standard Evaluation:
- Tools always return correct responses
- No network failures or timeouts
- Cannot reveal fragile error handling
- Misses degraded-mode behavior
Chaos Testing:
- Injects realistic tool failures (timeouts, network errors, validation errors)
- Corrupts tool responses (truncated fields, removed data, corrupted values)
- Tests agent resilience without live infrastructure failures
- Measures graceful degradation and recovery behavior
- Quantifies partial goal completion under failure
- Reveals which tools are single points of failure and which the agent can route around
When to Use Chaos Testing
Section titled “When to Use Chaos Testing”Use chaos testing when you need to:
- Evaluate Resilience: Test how agents handle tool failures gracefully
- Assess Recovery: Verify agents try alternative approaches when tools fail
- Measure Degradation: Quantify how much of a goal agents achieve despite failures
- Test Communication: Ensure agents inform users clearly about failures
- Validate Robustness: Confirm agents don’t crash or loop on corrupted data
How It Works
Section titled “How It Works”Chaos testing integrates with Strands’ plugin system via BeforeToolCallEvent and AfterToolCallEvent hooks:
- ChaosCase: Extends
Casewith aneffectsfield mapping tool names to failure effects - ChaosPlugin: A Strands plugin that intercepts tool calls and applies effects transparently
- ChaosExperiment: Composes the base
Experimentto manage chaos context per case - ChaosEffect: A hierarchy of pre-hook effects (cancel tool calls) and post-hook effects (corrupt responses). Each tool can have only one effect per
ChaosCase; use separate cases to test different failure modes for the same tool.
The workflow:
- You define
ChaosCaseobjects with effects specifying which tools should fail and how ChaosExperimentsets aContextVarwith the active case before each task execution (thread/async safe)ChaosPluginreads the active case from theContextVarand applies effects at the appropriate hook point- Your task function code has zero chaos concepts. Just add
ChaosPlugin()to the agent’s plugins list
Basic Usage
Section titled “Basic Usage”Define chaos test cases with effects
Section titled “Define chaos test cases with effects”Define your tools as usual with @tool, then create ChaosCase objects specifying which tools should fail. The effect map keys must match the tool function names exactly:
from strands import toolfrom strands_evals.chaos import ChaosCase, NetworkError, Timeout
@tooldef get_weather(city: str) -> str: """Get current weather for a city.""" return '{"temperature": 72, "condition": "sunny"}'
chaos_cases = [ ChaosCase( name="weather_timeout", input="What's the weather in Seattle?", effects={"tool_effects": {"get_weather": [Timeout()]}}, ), ChaosCase( name="network_failure", input="What's the weather in Seattle?", effects={"tool_effects": {"get_weather": [NetworkError()]}}, ),]Add chaos plugin to your agent
Section titled “Add chaos plugin to your agent”Add ChaosPlugin() to the agent’s plugins list. No other code changes are needed:
from strands import Agentfrom strands_evals.chaos import ChaosPluginfrom strands_evals.eval_task_handler import TracedHandler, eval_task
chaos_plugin = ChaosPlugin()
@eval_task(TracedHandler())def task_function(case: ChaosCase): return Agent( system_prompt="You are a helpful weather assistant.", tools=[get_weather], plugins=[chaos_plugin], callback_handler=None, trace_attributes={"session.id": case.session_id}, )Run chaos experiment
Section titled “Run chaos experiment”from strands_evals.chaos import ChaosExperimentfrom strands_evals.evaluators import GoalSuccessRateEvaluator
experiment = ChaosExperiment( cases=chaos_cases, evaluators=[GoalSuccessRateEvaluator()])report = experiment.run_evaluations(task=task_function)report.run_display()Effect Types
Section titled “Effect Types”Pre-hook Effects (Tool Call Failures)
Section titled “Pre-hook Effects (Tool Call Failures)”These effects cancel the tool call entirely and return an error:
| Effect | Description |
|---|---|
Timeout | Simulates a tool execution timeout |
NetworkError | Simulates a network connectivity failure |
ExecutionError | Simulates a runtime error during tool execution |
ValidationError | Simulates invalid input/output validation failure |
from strands_evals.chaos import ExecutionError, NetworkError, Timeout, ValidationError
effect_maps = { "timeout": {"tool_effects": {"my_tool": [Timeout()]}}, "network": {"tool_effects": {"my_tool": [NetworkError()]}}, "execution": {"tool_effects": {"my_tool": [ExecutionError()]}}, "validation": {"tool_effects": {"my_tool": [ValidationError()]}},}Post-hook Effects (Response Corruption)
Section titled “Post-hook Effects (Response Corruption)”These effects let the tool execute but corrupt the response:
| Effect | Description | Parameters |
|---|---|---|
TruncateFields | Truncates string fields in the response | max_length |
RemoveFields | Randomly removes fields from the response | remove_ratio |
CorruptValues | Corrupts field values with garbage data | corrupt_ratio |
from strands_evals.chaos import TruncateFields, RemoveFields, CorruptValues
effect_maps = { "truncated": {"tool_effects": {"my_tool": [TruncateFields(max_length=10)]}}, "missing_fields": {"tool_effects": {"my_tool": [RemoveFields(remove_ratio=0.5)]}}, "corrupted": {"tool_effects": {"my_tool": [CorruptValues(corrupt_ratio=0.3)]}},}Compound Effects (Multiple Tools)
Section titled “Compound Effects (Multiple Tools)”Target multiple tools in a single case to simulate cascading failures:
from strands_evals.chaos import ChaosCase, Timeout, NetworkError, CorruptValues
chaos_case = ChaosCase( name="total_chaos", input="Book me a flight to Paris", effects={ "tool_effects": { "search_flights": [Timeout()], "book_flight": [NetworkError()], "send_confirmation": [CorruptValues(corrupt_ratio=0.5)], } },)Note: Each tool can only have one effect per
ChaosCase. Passing multiple effects for the same tool (e.g.,"my_tool": [Timeout(), NetworkError()]) raises aValueError. To test multiple failure modes for a single tool, create separateChaosCaseinstances — one per effect. Note that pre-hook effects are inherently mutually exclusive (only one can cancel a tool call), while the runtime supports composing multiple post-hook effects sequentially — this validator constraint may be relaxed in a future release.
Expanding Cases Across Multiple Effects
Section titled “Expanding Cases Across Multiple Effects”When you have multiple base cases and want to test across several failure scenarios, use ChaosCase.expand() to generate the Cartesian product:
from strands_evals import Casefrom strands_evals.chaos import ChaosCase, NetworkError, Timeout
# Define base test casesbase_cases = [ Case(name="weather-seattle", input="What's the weather in Seattle?"), Case(name="weather-tokyo", input="What's the weather in Tokyo?"),]
# Define named effect mapseffect_maps = { "search_timeout": { "tool_effects": {"get_weather": [Timeout()]}, }, "network_failure": { "tool_effects": {"get_weather": [NetworkError()]}, },}
# Expand: 2 cases x (2 effect maps + 1 baseline) = 6 ChaosCase objectschaos_cases = ChaosCase.expand(base_cases, effect_maps, include_no_effect_baseline=True)Setting include_no_effect_baseline=True adds an extra variant of each base case with no effects applied. This gives you a clean comparison point: you can see how the agent scores under normal conditions versus under each failure scenario, making it easy to measure the delta that chaos introduces.
Integration with ToolSimulator
Section titled “Integration with ToolSimulator”Chaos testing works naturally with ToolSimulator for fully controlled evaluation. Simulated tools provide reproducible responses, and chaos effects inject failures on top:
from strands import Agentfrom strands_evals.chaos import ChaosCase, ChaosExperiment, ChaosPlugin, Timeout, CorruptValuesfrom strands_evals.eval_task_handler import TracedHandler, eval_taskfrom strands_evals.evaluators import GoalSuccessRateEvaluatorfrom strands_evals.simulation import ToolSimulatorfrom pydantic import BaseModel, Field
tool_simulator = ToolSimulator()
class SearchResult(BaseModel): title: str = Field(..., description="Result title") snippet: str = Field(..., description="Result snippet")
@tool_simulator.tool(output_schema=SearchResult)def web_search(query: str) -> dict: """Search the web for information.""" pass
chaos_cases = [ ChaosCase( name="search_timeout", input="Find recent news about AI agents", effects={"tool_effects": {"web_search": [Timeout()]}}, ), ChaosCase( name="corrupted_results", input="Find recent news about AI agents", effects={"tool_effects": {"web_search": [CorruptValues(corrupt_ratio=0.5)]}}, ),]
chaos_plugin = ChaosPlugin()_search_tool = tool_simulator.get_tool("web_search")
@eval_task(TracedHandler())def task_function(case: ChaosCase): return Agent( tools=[_search_tool], plugins=[chaos_plugin], callback_handler=None, trace_attributes={"session.id": case.session_id}, )
experiment = ChaosExperiment( cases=chaos_cases, evaluators=[GoalSuccessRateEvaluator()])report = experiment.run_evaluations(task=task_function)report.run_display()Chaos Testing vs Simulators
Section titled “Chaos Testing vs Simulators”Understanding when to use each:
| Aspect | Simulators | Chaos Testing |
|---|---|---|
| Role | Replace tool execution entirely | Inject failures into tool execution |
| Scope | All tool calls are simulated | Only targeted tools are affected |
| Use Case | Test without infrastructure | Test resilience under failure |
| Combination | Can be used together | Chaos effects apply on top of simulated tools |
Resilience Evaluators
Section titled “Resilience Evaluators”Chaos testing ships with three specialized evaluators designed to assess agent behavior under failure:
| Evaluator | What It Measures | Scoring | Baseline |
|---|---|---|---|
| FailureCommunicationEvaluator | Clarity, actionability, transparency, and tone of failure messages | Five-level (0.0, 0.25, 0.5, 0.75, 1.0) | 0.5 when no failures occur |
| PartialCompletionEvaluator | Fraction of user goal achieved despite failures | Continuous (0.0 to 1.0) | ~1.0 when task completes fully |
| RecoveryStrategyEvaluator | Quality of recovery actions: exploration breadth, retry discipline, approach variation | Five-level (0.0, 0.25, 0.5, 0.75, 1.0) | 0.5 when no failures occur |
Interpreting Results
Section titled “Interpreting Results”When reviewing evaluation outputs, look at evaluator scores together to identify patterns in your agent’s failure-handling behavior:
- High FailureCommunication + low PartialCompletion: Agent explains failures well but cannot work around them. Add fallback tools or alternative approaches.
- High RecoveryStrategy + low PartialCompletion: Agent tries hard (retries, alternatives) but all options also fail. The failure is too severe for the available tools, or the agent’s fallback tools are also broken.
- Low FailureCommunication + high PartialCompletion: Agent completes the task despite failures but doesn’t inform the user about degraded results. Add failure-awareness instructions to the system prompt.
- Low RecoveryStrategy + low PartialCompletion: Agent gives up immediately without attempting alternatives. Add retry logic, fallback tools, or system prompt guidance about recovery behavior.
Check the reason field in each evaluation output for specific details about what the judge observed in the trace.
Advanced Chaos Testing Patterns
Section titled “Advanced Chaos Testing Patterns”Pattern 1: Comparing Agent Configurations Under Chaos
Section titled “Pattern 1: Comparing Agent Configurations Under Chaos”Compare how different system prompts affect resilience:
from strands import Agentfrom strands_evals.chaos import ChaosCase, ChaosExperiment, ChaosPluginfrom strands_evals.eval_task_handler import TracedHandler, eval_taskfrom strands_evals.evaluators.chaos import PartialCompletionEvaluator
def compare_agents_under_chaos(chaos_cases, configs): """Compare how different agent configs handle the same failures.""" results = {}
for config_name, system_prompt in configs.items(): def make_task(prompt): @eval_task(TracedHandler()) def task_function(case: ChaosCase): return Agent( system_prompt=prompt, plugins=[ChaosPlugin()], callback_handler=None, trace_attributes={"session.id": case.session_id}, ) return task_function
experiment = ChaosExperiment( cases=chaos_cases, evaluators=[PartialCompletionEvaluator()] ) report = experiment.run_evaluations(task=make_task(system_prompt)) results[config_name] = report
return resultsPattern 2: Degradation Sweep
Section titled “Pattern 2: Degradation Sweep”Map the resilience curve of your agent by sweeping corruption intensity from 0% to 100%. This reveals the critical threshold where your agent breaks, and whether degradation is gradual or cliff-edge:
from strands_evals.chaos import ChaosCase, ChaosExperiment, ChaosPlugin, CorruptValuesfrom strands_evals.evaluators import GoalSuccessRateEvaluatorfrom strands_evals.evaluators.chaos import PartialCompletionEvaluator
# Sweep corrupt_ratio from mild to total corruptionsweep_cases = [ ChaosCase( name=f"corrupt_{int(ratio*100)}pct", input="Find the cheapest flight to Paris next Tuesday", effects={"tool_effects": {"search_flights": [CorruptValues(corrupt_ratio=ratio)]}}, ) for ratio in [0.0, 0.1, 0.2, 0.3, 0.5, 0.7, 0.9, 1.0]]
experiment = ChaosExperiment( cases=sweep_cases, evaluators=[GoalSuccessRateEvaluator(), PartialCompletionEvaluator()])
# Analyze: at what ratio does goal success drop below 0.5?# Gradual degradation = resilient agent; cliff-edge = fragile agentPattern 3: Multi-turn Chaos Testing with User Simulator
Section titled “Pattern 3: Multi-turn Chaos Testing with User Simulator”Combine chaos testing with user simulation for multi-turn resilience evaluation:
from strands import Agentfrom strands_evals import ActorSimulatorfrom strands_evals.chaos import ChaosCase, ChaosPluginfrom strands_evals.eval_task_handler import TracedHandler, eval_task
@eval_task(TracedHandler())def task_function(case: ChaosCase): user_sim = ActorSimulator.from_case_for_user_simulator( case=case, max_turns=8 )
agent = Agent( system_prompt="You are a helpful assistant.", plugins=[ChaosPlugin()], callback_handler=None, trace_attributes={"session.id": case.session_id}, )
user_message = case.input while user_sim.has_next(): agent_response = agent(user_message) user_result = user_sim.act(str(agent_response)) user_message = str(user_result.structured_output.message)
return agentBest Practices
Section titled “Best Practices”1. Start with Baseline Comparisons
Section titled “1. Start with Baseline Comparisons”Always include a no-effect baseline to compare agent performance with and without failures. When using ChaosCase.expand():
chaos_cases = ChaosCase.expand(cases, effect_maps, include_no_effect_baseline=True)2. Gradually Increase Chaos Severity
Section titled “2. Gradually Increase Chaos Severity”Start with single-tool failures to understand how your agent handles each failure point in isolation. Once you understand the baseline behavior, move to compound failures (multiple tools failing simultaneously) and then to advanced patterns like degradation sweeps. When a compound test fails, single-tool results tell you which tool failure is responsible:
# Start simple: one tool, one effectsingle_case = ChaosCase( name="search_fails", input="Find flights to Paris", effects={"tool_effects": {"search": [Timeout()]}},)
# Then escalate: multiple tools failing togethercompound_case = ChaosCase( name="total_chaos", input="Find flights to Paris", effects={ "tool_effects": { "search": [Timeout()], "database": [NetworkError()], } },)3. Use Resilience Evaluators Together
Section titled “3. Use Resilience Evaluators Together”Combine all three resilience evaluators for a complete picture:
evaluators = [ FailureCommunicationEvaluator(), # Did the agent tell the user? PartialCompletionEvaluator(), # How much was achieved? RecoveryStrategyEvaluator(), # Did it try alternatives?]4. Match Error Types to Tool Semantics
Section titled “4. Match Error Types to Tool Semantics”Choose failure types that reflect realistic production failures:
NetworkErrorfor external API toolsTimeoutfor slow or overloaded servicesExecutionErrorfor local computation toolsValidationErrorfor tools with strict input schemas
5. Read the Reasoning, Not Just Pass/Fail
Section titled “5. Read the Reasoning, Not Just Pass/Fail”Evaluator scores alone don’t tell the full story. Check the reason field in evaluation outputs to understand why the agent scored the way it did. A score of 0.5 may mean “barely passes” or “no failures occurred to evaluate against,” and the reasoning explains which.
6. Iterate: Diagnose, Fix, Validate
Section titled “6. Iterate: Diagnose, Fix, Validate”Treat chaos testing as an iterative improvement loop:
- Run the experiment and identify which tool-failure combinations produce low scores
- Fix the agent (add retry logic, fallback tools, or better system prompt guidance)
- Re-run the same experiment and verify that previously failing cases now pass
7. Monitor Token Usage Under Chaos
Section titled “7. Monitor Token Usage Under Chaos”Agents under failure often burn tokens on retry storms (repeated failed tool calls). Compare token consumption between baseline and chaos cases to detect runaway costs. A sharp increase signals excessive retries; a sharp decrease signals the agent is giving up too early.
Related Documentation
Section titled “Related Documentation”- Tool Simulation: Simulate tool behavior for reproducible tests
- Goal Success Rate Evaluator: Assess goal completion
- Simulators Overview: Simulator framework
- Evaluators: All available evaluators