Simulators

Overview

Simulators enable dynamic evaluation of agents by generating realistic interaction patterns. Unlike static evaluators that assess single outputs, simulators actively participate in the evaluation loop — driving multi-turn conversations or generating realistic tool responses — to create authentic evaluation scenarios.

Why Simulators?

Traditional evaluation approaches have limitations when assessing conversational agents:

Static Evaluators:

Evaluate single input/output pairs
Cannot test multi-turn conversation flow
Miss context-dependent behaviors
Don’t capture goal-oriented interactions

Simulators:

Generate dynamic, multi-turn conversations
Adapt responses based on agent behavior
Test goal completion in realistic scenarios
Evaluate conversation flow and context maintenance
Enable testing without predefined scripts
Simulate tool behavior without live infrastructure

When to Use Simulators

Use simulators when you need to:

Evaluate Multi-turn Conversations: Test agents across multiple conversation turns
Assess Goal Completion: Verify agents can achieve user objectives through dialogue
Test Conversation Flow: Evaluate how agents handle context and follow-up questions
Generate Diverse Interactions: Create varied conversation patterns automatically
Evaluate Without Scripts: Test agents without predefined conversation paths
Simulate Real Users: Generate realistic user behavior patterns
Test Tool Usage Without Infrastructure: Evaluate agent tool-use behavior without live APIs, databases, or services

ActorSimulator

The ActorSimulator is the core simulator class in Strands Evals. It’s a general-purpose simulator that can simulate any type of actor in multi-turn conversations. An “actor” is any conversational participant - users, customer service representatives, domain experts, adversarial testers, or any other entity that engages in dialogue.

The simulator maintains actor profiles, generates contextually appropriate responses based on conversation history, and tracks goal completion. By configuring different actor profiles and system prompts, you can simulate diverse interaction patterns.

User Simulation

The most common use of ActorSimulator is user simulation - simulating realistic end-users interacting with your agent during evaluation. This is the primary use case covered in our documentation.

Complete User Simulation Guide →

Other Actor Types

While user simulation is the primary use case, ActorSimulator can simulate other actor types by providing custom actor profiles:

Customer Support Representatives: Test agent-to-agent interactions
Domain Experts: Simulate specialized knowledge conversations
Adversarial Actors: Test robustness and edge cases
Internal Staff: Evaluate internal tooling workflows

ToolSimulator

The ToolSimulator enables LLM-powered simulation of tool behavior for controlled agent evaluation. Instead of calling real tools, registered tools are executed by an LLM that generates realistic, schema-validated responses while maintaining state across calls.

This is useful when real tools require live infrastructure, when you need deterministic behavior for evaluation, or when tools are still under development.

from typing import Any
from pydantic import BaseModel, Field
from strands import Agent
from strands_evals.simulation.tool_simulator import ToolSimulator

tool_simulator = ToolSimulator()

class WeatherResponse(BaseModel):
    temperature: float = Field(..., description="Temperature in Fahrenheit")
    conditions: str = Field(..., description="Weather conditions")

@tool_simulator.tool(output_schema=WeatherResponse)
def get_weather(city: str) -> dict[str, Any]:
    """Get current weather for a city."""
    pass

weather_tool = tool_simulator.get_tool("get_weather")
agent = Agent(tools=[weather_tool], callback_handler=None)
response = agent("What's the weather in Seattle?")

Key capabilities:

Decorator-based registration with automatic metadata extraction from function signatures
Schema-validated responses via Pydantic output models
Shared state across related tools via share_state_id (e.g., sensor + controller operating on the same environment)
Stateful context with initial state descriptions and bounded call history cache

Complete Tool Simulation Guide →

Extensibility

The simulator framework is designed to be extensible. ActorSimulator and ToolSimulator provide general-purpose foundations, and additional specialized simulators can be built for specific evaluation patterns as needs emerge.

Simulators vs Evaluators

Understanding when to use simulators versus evaluators:

Aspect	Evaluators	ActorSimulator	ToolSimulator
Role	Passive assessment	Active conversation participant	Simulated tool execution
Turns	Single turn	Multi-turn	Per tool call
Adaptation	Static criteria	Dynamic responses	Stateful responses
Use Case	Output quality	Conversation flow	Tool-use behavior
Goal	Score responses	Drive interactions	Replace infrastructure

Use Together: Simulators and evaluators complement each other. Use simulators to generate multi-turn conversations, then use evaluators to assess the quality of those interactions.

Integration with Evaluators

Simulators work seamlessly with trace-based evaluators:

from strands import Agent
from strands_evals import Case, Experiment, ActorSimulator
from strands_evals.evaluators import HelpfulnessEvaluator, GoalSuccessRateEvaluator
from strands_evals.mappers import StrandsInMemorySessionMapper
from strands_evals.telemetry import StrandsEvalsTelemetry

# Setup telemetry
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
memory_exporter = telemetry.in_memory_exporter

def task_function(case: Case) -> dict:
    # Create simulator to drive conversation
    simulator = ActorSimulator.from_case_for_user_simulator(
        case=case,
        max_turns=10
    )

    # Create agent to evaluate
    agent = Agent(
        trace_attributes={
            "gen_ai.conversation.id": case.session_id,
            "session.id": case.session_id
        },
        callback_handler=None
    )

    # Run multi-turn conversation
    user_message = case.input

    while simulator.has_next():
        agent_response = agent(user_message)
        turn_spans = list(memory_exporter.get_finished_spans())

        user_result = simulator.act(str(agent_response))
        user_message = str(user_result.structured_output.message)

    all_spans = memory_exporter.get_finished_spans()
    # Map to session for evaluation
    mapper = StrandsInMemorySessionMapper()
    session = mapper.map_to_session(all_spans, session_id=case.session_id)

    return {"output": str(agent_response), "trajectory": session}

# Use evaluators to assess simulated conversations
evaluators = [
    HelpfulnessEvaluator(),
    GoalSuccessRateEvaluator()
]

# Setup test cases
test_cases = [
    Case(
        input="I need to book a flight to Paris",
        metadata={"task_description": "Flight booking confirmed"}
    ),
    Case(
        input="Help me write a Python function to sort a list",
        metadata={"task_description": "Programming assistance"}
    )
]

experiment = Experiment(cases=test_cases, evaluators=evaluators)
reports = experiment.run_evaluations(task_function)

Best Practices

1. Define Clear Goals

Simulators work best with well-defined objectives:

case = Case(
    input="I need to book a flight",
    metadata={
        "task_description": "Flight booked with confirmation number and email sent"
    }
)

2. Set Appropriate Turn Limits

Balance thoroughness with efficiency:

# Simple tasks: 3-5 turns
simulator = ActorSimulator.from_case_for_user_simulator(case=case, max_turns=5)

# Complex tasks: 8-15 turns
simulator = ActorSimulator.from_case_for_user_simulator(case=case, max_turns=12)

3. Combine with Multiple Evaluators

Assess different aspects of simulated conversations:

evaluators = [
    HelpfulnessEvaluator(),      # User experience
    GoalSuccessRateEvaluator(),  # Task completion
    FaithfulnessEvaluator()      # Response accuracy
]

4. Log Conversations for Analysis

Capture conversation details for debugging:

conversation_log = []
while simulator.has_next():
    # ... conversation logic ...
    conversation_log.append({
        "turn": turn_number,
        "agent": agent_message,
        "simulator": simulator_message,
        "reasoning": simulator_reasoning
    })

Common Patterns

Pattern 1: Goal Completion Testing

def test_goal_completion(case: Case) -> bool:
    simulator = ActorSimulator.from_case_for_user_simulator(case=case)
    agent = Agent(system_prompt="Your prompt")

    user_message = case.input
    while simulator.has_next():
        agent_response = agent(user_message)
        user_result = simulator.act(str(agent_response))
        user_message = str(user_result.structured_output.message)

        if "<stop/>" in user_message:
            return True

    return False

Pattern 2: Conversation Flow Analysis

def analyze_conversation_flow(case: Case) -> dict:
    simulator = ActorSimulator.from_case_for_user_simulator(case=case)
    agent = Agent(system_prompt="Your prompt")

    metrics = {
        "turns": 0,
        "agent_questions": 0,
        "user_clarifications": 0
    }

    user_message = case.input
    while simulator.has_next():
        agent_response = agent(user_message)
        if "?" in str(agent_response):
            metrics["agent_questions"] += 1

        user_result = simulator.act(str(agent_response))
        user_message = str(user_result.structured_output.message)
        metrics["turns"] += 1

    return metrics

Pattern 3: Comparative Evaluation

def compare_agent_configurations(case: Case, configs: list) -> dict:
    results = {}

    for config in configs:
        simulator = ActorSimulator.from_case_for_user_simulator(case=case)
        agent = Agent(**config)

        # Run conversation and collect metrics
        # ... evaluation logic ...

        results[config["name"]] = metrics

    return results

Next Steps

User Simulation Guide: Simulate multi-turn user conversations
Tool Simulation Guide: Simulate tool behavior with LLM-powered responses
Evaluators: Combine with evaluators

Quickstart Guide: Get started with Strands Evals
Evaluators Overview: Learn about evaluators
Experiment Generator: Generate test cases automatically