Skip to content

Chaos Testing

Chaos testing systematically evaluates agent resilience by injecting controlled failures into tool execution. Using ChaosPlugin, ChaosCase, and ChaosExperiment, you can test how agents handle tool timeouts, network errors, and corrupted responses without modifying agent code. A complete example can be found here.

This enables you to answer questions like:

  • Does the agent gracefully communicate failures to users?
  • Can the agent achieve partial goals when some tools fail?
  • Does the agent employ effective recovery strategies?

Traditional evaluation tests agents under ideal conditions. In production, tools fail unpredictably:

Standard Evaluation:

  • Tools always return correct responses
  • No network failures or timeouts
  • Cannot reveal fragile error handling
  • Misses degraded-mode behavior

Chaos Testing:

  • Injects realistic tool failures (timeouts, network errors, validation errors)
  • Corrupts tool responses (truncated fields, removed data, corrupted values)
  • Tests agent resilience without live infrastructure failures
  • Measures graceful degradation and recovery behavior
  • Quantifies partial goal completion under failure
  • Reveals which tools are single points of failure and which the agent can route around

Use chaos testing when you need to:

  • Evaluate Resilience: Test how agents handle tool failures gracefully
  • Assess Recovery: Verify agents try alternative approaches when tools fail
  • Measure Degradation: Quantify how much of a goal agents achieve despite failures
  • Test Communication: Ensure agents inform users clearly about failures
  • Validate Robustness: Confirm agents don’t crash or loop on corrupted data

Chaos testing integrates with Strands’ plugin system via BeforeToolCallEvent and AfterToolCallEvent hooks:

  1. ChaosCase: Extends Case with an effects field mapping tool names to failure effects
  2. ChaosPlugin: A Strands plugin that intercepts tool calls and applies effects transparently
  3. ChaosExperiment: Composes the base Experiment to manage chaos context per case
  4. ChaosEffect: A hierarchy of pre-hook effects (cancel tool calls) and post-hook effects (corrupt responses). Each tool can have only one effect per ChaosCase; use separate cases to test different failure modes for the same tool.

The workflow:

  1. You define ChaosCase objects with effects specifying which tools should fail and how
  2. ChaosExperiment sets a ContextVar with the active case before each task execution (thread/async safe)
  3. ChaosPlugin reads the active case from the ContextVar and applies effects at the appropriate hook point
  4. Your task function code has zero chaos concepts. Just add ChaosPlugin() to the agent’s plugins list

Define your tools as usual with @tool, then create ChaosCase objects specifying which tools should fail. The effect map keys must match the tool function names exactly:

from strands import tool
from strands_evals.chaos import ChaosCase, NetworkError, Timeout
@tool
def get_weather(city: str) -> str:
"""Get current weather for a city."""
return '{"temperature": 72, "condition": "sunny"}'
chaos_cases = [
ChaosCase(
name="weather_timeout",
input="What's the weather in Seattle?",
effects={"tool_effects": {"get_weather": [Timeout()]}},
),
ChaosCase(
name="network_failure",
input="What's the weather in Seattle?",
effects={"tool_effects": {"get_weather": [NetworkError()]}},
),
]

Add ChaosPlugin() to the agent’s plugins list. No other code changes are needed:

from strands import Agent
from strands_evals.chaos import ChaosPlugin
from strands_evals.eval_task_handler import TracedHandler, eval_task
chaos_plugin = ChaosPlugin()
@eval_task(TracedHandler())
def task_function(case: ChaosCase):
return Agent(
system_prompt="You are a helpful weather assistant.",
tools=[get_weather],
plugins=[chaos_plugin],
callback_handler=None,
trace_attributes={"session.id": case.session_id},
)
import asyncio
from strands_evals.chaos import ChaosExperiment
from strands_evals.evaluators import GoalSuccessRateEvaluator
experiment = ChaosExperiment(
cases=chaos_cases,
evaluators=[GoalSuccessRateEvaluator()]
)
async def main():
report = await experiment.run_evaluations_async(task=task_function, max_workers=1)
report.run_display()
asyncio.run(main())

These effects cancel the tool call entirely and return an error:

EffectDescription
TimeoutSimulates a tool execution timeout
NetworkErrorSimulates a network connectivity failure
ExecutionErrorSimulates a runtime error during tool execution
ValidationErrorSimulates invalid input/output validation failure
from strands_evals.chaos import ExecutionError, NetworkError, Timeout, ValidationError
effect_maps = {
"timeout": {"tool_effects": {"my_tool": [Timeout()]}},
"network": {"tool_effects": {"my_tool": [NetworkError()]}},
"execution": {"tool_effects": {"my_tool": [ExecutionError()]}},
"validation": {"tool_effects": {"my_tool": [ValidationError()]}},
}

These effects let the tool execute but corrupt the response:

EffectDescriptionParameters
TruncateFieldsTruncates string fields in the responsemax_length
RemoveFieldsRandomly removes fields from the responseremove_ratio
CorruptValuesCorrupts field values with garbage datacorrupt_ratio
from strands_evals.chaos import TruncateFields, RemoveFields, CorruptValues
effect_maps = {
"truncated": {"tool_effects": {"my_tool": [TruncateFields(max_length=10)]}},
"missing_fields": {"tool_effects": {"my_tool": [RemoveFields(remove_ratio=0.5)]}},
"corrupted": {"tool_effects": {"my_tool": [CorruptValues(corrupt_ratio=0.3)]}},
}

Target multiple tools in a single case to simulate cascading failures:

from strands_evals.chaos import ChaosCase, Timeout, NetworkError, CorruptValues
chaos_case = ChaosCase(
name="total_chaos",
input="Book me a flight to Paris",
effects={
"tool_effects": {
"search_flights": [Timeout()],
"book_flight": [NetworkError()],
"send_confirmation": [CorruptValues(corrupt_ratio=0.5)],
}
},
)

Note: Each tool can only have one effect per ChaosCase. Passing multiple effects for the same tool (e.g., "my_tool": [Timeout(), NetworkError()]) raises a ValueError. To test multiple failure modes for a single tool, create separate ChaosCase instances — one per effect. Note that pre-hook effects are inherently mutually exclusive (only one can cancel a tool call), while the runtime supports composing multiple post-hook effects sequentially — this validator constraint may be relaxed in a future release.

When you have multiple base cases and want to test across several failure scenarios, use ChaosCase.expand() to generate the Cartesian product:

from strands_evals import Case
from strands_evals.chaos import ChaosCase, NetworkError, Timeout
# Define base test cases
base_cases = [
Case(name="weather-seattle", input="What's the weather in Seattle?"),
Case(name="weather-tokyo", input="What's the weather in Tokyo?"),
]
# Define named effect maps
effect_maps = {
"search_timeout": {
"tool_effects": {"get_weather": [Timeout()]},
},
"network_failure": {
"tool_effects": {"get_weather": [NetworkError()]},
},
}
# Expand: 2 cases x (2 effect maps + 1 baseline) = 6 ChaosCase objects
chaos_cases = ChaosCase.expand(base_cases, effect_maps, include_no_effect_baseline=True)

Setting include_no_effect_baseline=True adds an extra variant of each base case with no effects applied. This gives you a clean comparison point: you can see how the agent scores under normal conditions versus under each failure scenario, making it easy to measure the delta that chaos introduces.

Chaos testing works naturally with ToolSimulator for fully controlled evaluation. Simulated tools provide reproducible responses, and chaos effects inject failures on top:

from strands import Agent
from strands_evals.chaos import ChaosCase, ChaosExperiment, ChaosPlugin, Timeout, CorruptValues
from strands_evals.eval_task_handler import TracedHandler, eval_task
from strands_evals.evaluators import GoalSuccessRateEvaluator
from strands_evals.simulation import ToolSimulator
from pydantic import BaseModel, Field
tool_simulator = ToolSimulator()
class SearchResult(BaseModel):
title: str = Field(..., description="Result title")
snippet: str = Field(..., description="Result snippet")
@tool_simulator.tool(output_schema=SearchResult)
def web_search(query: str) -> dict:
"""Search the web for information."""
pass
chaos_cases = [
ChaosCase(
name="search_timeout",
input="Find recent news about AI agents",
effects={"tool_effects": {"web_search": [Timeout()]}},
),
ChaosCase(
name="corrupted_results",
input="Find recent news about AI agents",
effects={"tool_effects": {"web_search": [CorruptValues(corrupt_ratio=0.5)]}},
),
]
chaos_plugin = ChaosPlugin()
_search_tool = tool_simulator.get_tool("web_search")
@eval_task(TracedHandler())
def task_function(case: ChaosCase):
return Agent(
tools=[_search_tool],
plugins=[chaos_plugin],
callback_handler=None,
trace_attributes={"session.id": case.session_id},
)
experiment = ChaosExperiment(
cases=chaos_cases,
evaluators=[GoalSuccessRateEvaluator()]
)
async def main():
report = await experiment.run_evaluations_async(task=task_function, max_workers=1)
report.run_display()
asyncio.run(main())

Understanding when to use each:

AspectSimulatorsChaos Testing
RoleReplace tool execution entirelyInject failures into tool execution
ScopeAll tool calls are simulatedOnly targeted tools are affected
Use CaseTest without infrastructureTest resilience under failure
CombinationCan be used togetherChaos effects apply on top of simulated tools

Chaos testing ships with three specialized evaluators designed to assess agent behavior under failure:

EvaluatorWhat It MeasuresScoringBaseline
FailureCommunicationEvaluatorClarity, actionability, transparency, and tone of failure messagesFive-level (0.0, 0.25, 0.5, 0.75, 1.0)0.5 when no failures occur
PartialCompletionEvaluatorFraction of user goal achieved despite failuresContinuous (0.0 to 1.0)~1.0 when task completes fully
RecoveryStrategyEvaluatorQuality of recovery actions: exploration breadth, retry discipline, approach variationFive-level (0.0, 0.25, 0.5, 0.75, 1.0)0.5 when no failures occur

When reviewing evaluation outputs, look at evaluator scores together to identify patterns in your agent’s failure-handling behavior:

  • High FailureCommunication + low PartialCompletion: Agent explains failures well but cannot work around them. Add fallback tools or alternative approaches.
  • High RecoveryStrategy + low PartialCompletion: Agent tries hard (retries, alternatives) but all options also fail. The failure is too severe for the available tools, or the agent’s fallback tools are also broken.
  • Low FailureCommunication + high PartialCompletion: Agent completes the task despite failures but doesn’t inform the user about degraded results. Add failure-awareness instructions to the system prompt.
  • Low RecoveryStrategy + low PartialCompletion: Agent gives up immediately without attempting alternatives. Add retry logic, fallback tools, or system prompt guidance about recovery behavior.

Check the reason field in each evaluation output for specific details about what the judge observed in the trace.

Pattern 1: Comparing Agent Configurations Under Chaos

Section titled “Pattern 1: Comparing Agent Configurations Under Chaos”

Compare how different system prompts affect resilience:

from strands import Agent
from strands_evals.chaos import ChaosCase, ChaosExperiment, ChaosPlugin
from strands_evals.eval_task_handler import TracedHandler, eval_task
from strands_evals.evaluators.chaos import PartialCompletionEvaluator
async def compare_agents_under_chaos(chaos_cases, configs):
"""Compare how different agent configs handle the same failures."""
results = {}
for config_name, system_prompt in configs.items():
def make_task(prompt):
@eval_task(TracedHandler())
def task_function(case: ChaosCase):
return Agent(
system_prompt=prompt,
plugins=[ChaosPlugin()],
callback_handler=None,
trace_attributes={"session.id": case.session_id},
)
return task_function
experiment = ChaosExperiment(
cases=chaos_cases,
evaluators=[PartialCompletionEvaluator()]
)
report = await experiment.run_evaluations_async(task=make_task(system_prompt), max_workers=1)
results[config_name] = report
return results

Map the resilience curve of your agent by sweeping corruption intensity from 0% to 100%. This reveals the critical threshold where your agent breaks, and whether degradation is gradual or cliff-edge:

from strands_evals.chaos import ChaosCase, ChaosExperiment, ChaosPlugin, CorruptValues
from strands_evals.evaluators import GoalSuccessRateEvaluator
from strands_evals.evaluators.chaos import PartialCompletionEvaluator
# Sweep corrupt_ratio from mild to total corruption
sweep_cases = [
ChaosCase(
name=f"corrupt_{int(ratio*100)}pct",
input="Find the cheapest flight to Paris next Tuesday",
effects={"tool_effects": {"search_flights": [CorruptValues(corrupt_ratio=ratio)]}},
)
for ratio in [0.0, 0.1, 0.2, 0.3, 0.5, 0.7, 0.9, 1.0]
]
experiment = ChaosExperiment(
cases=sweep_cases,
evaluators=[GoalSuccessRateEvaluator(), PartialCompletionEvaluator()]
)
# Analyze: at what ratio does goal success drop below 0.5?
# Gradual degradation = resilient agent; cliff-edge = fragile agent

Pattern 3: Multi-turn Chaos Testing with User Simulator

Section titled “Pattern 3: Multi-turn Chaos Testing with User Simulator”

Combine chaos testing with user simulation for multi-turn resilience evaluation:

from strands import Agent
from strands_evals import ActorSimulator
from strands_evals.chaos import ChaosCase, ChaosPlugin
from strands_evals.eval_task_handler import TracedHandler, eval_task
@eval_task(TracedHandler())
def task_function(case: ChaosCase):
user_sim = ActorSimulator.from_case_for_user_simulator(
case=case, max_turns=8
)
agent = Agent(
system_prompt="You are a helpful assistant.",
plugins=[ChaosPlugin()],
callback_handler=None,
trace_attributes={"session.id": case.session_id},
)
user_message = case.input
while user_sim.has_next():
agent_response = agent(user_message)
user_result = user_sim.act(str(agent_response))
user_message = str(user_result.structured_output.message)
return agent

Always include a no-effect baseline to compare agent performance with and without failures. When using ChaosCase.expand():

chaos_cases = ChaosCase.expand(cases, effect_maps, include_no_effect_baseline=True)

Start with single-tool failures to understand how your agent handles each failure point in isolation. Once you understand the baseline behavior, move to compound failures (multiple tools failing simultaneously) and then to advanced patterns like degradation sweeps. When a compound test fails, single-tool results tell you which tool failure is responsible:

# Start simple: one tool, one effect
single_case = ChaosCase(
name="search_fails",
input="Find flights to Paris",
effects={"tool_effects": {"search": [Timeout()]}},
)
# Then escalate: multiple tools failing together
compound_case = ChaosCase(
name="total_chaos",
input="Find flights to Paris",
effects={
"tool_effects": {
"search": [Timeout()],
"database": [NetworkError()],
}
},
)

Combine all three resilience evaluators for a complete picture:

evaluators = [
FailureCommunicationEvaluator(), # Did the agent tell the user?
PartialCompletionEvaluator(), # How much was achieved?
RecoveryStrategyEvaluator(), # Did it try alternatives?
]

Choose failure types that reflect realistic production failures:

  • NetworkError for external API tools
  • Timeout for slow or overloaded services
  • ExecutionError for local computation tools
  • ValidationError for tools with strict input schemas

Evaluator scores alone don’t tell the full story. Check the reason field in evaluation outputs to understand why the agent scored the way it did. A score of 0.5 may mean “barely passes” or “no failures occurred to evaluate against,” and the reasoning explains which.

Treat chaos testing as an iterative improvement loop:

  1. Run the experiment and identify which tool-failure combinations produce low scores
  2. Fix the agent (add retry logic, fallback tools, or better system prompt guidance)
  3. Re-run the same experiment and verify that previously failing cases now pass

Agents under failure often burn tokens on retry storms (repeated failed tool calls). Compare token consumption between baseline and chaos cases to detect runaway costs. A sharp increase signals excessive retries; a sharp decrease signals the agent is giving up too early.