Task Decorator
Overview
Section titled “Overview”The @eval_task decorator simplifies writing task functions for evaluation. Instead of manually wiring up telemetry, session mapping, and result normalization, you decorate a function and let the framework handle the boilerplate.
Basic Usage
Section titled “Basic Usage”The simplest form — return an Agent and the decorator invokes it with case.input automatically:
from strands import Agentfrom strands_evals import eval_task, Case, Experimentfrom strands_evals.evaluators import OutputEvaluator
@eval_task()def my_task(): return Agent(model="us.anthropic.claude-sonnet-4-20250514-v1:0", callback_handler=None)
cases = [Case(name="greeting", input="Hello!")]evaluator = OutputEvaluator(rubric="Score 1.0 if friendly. Score 0.0 otherwise.")experiment = Experiment(cases=cases, evaluators=[evaluator])reports = experiment.run_evaluations(my_task)How It Works
Section titled “How It Works”The decorator wraps your function so that Experiment.run_evaluations receives a properly formatted task callable. Your function can:
- Take no arguments — the decorator calls it once per case and invokes the returned Agent with
case.input - Take a
Caseargument — for per-case customization (different tools, system prompts, etc.) - Return an
Agent— auto-invoked withcase.input - Return a
str— used directly as the output - Return a
dict— passed through as-is (must have at least an"output"key)
Per-Case Customization
Section titled “Per-Case Customization”Accept a Case parameter to customize agent behavior per test case:
from strands_tools import calculator
@eval_task()def my_task(case): tools = [calculator] if case.metadata.get("use_calc") else [] return Agent(tools=tools, callback_handler=None)
cases = [ Case(name="math", input="What is 15 * 23?", metadata={"use_calc": True}), Case(name="chat", input="Tell me a joke", metadata={"use_calc": False}),]Collecting Traces with TracedHandler
Section titled “Collecting Traces with TracedHandler”For evaluators that need trajectory data (HelpfulnessEvaluator, CorrectnessEvaluator, etc.), use TracedHandler. It automatically collects OpenTelemetry spans and maps them to a Session:
from strands_evals import eval_task, TracedHandlerfrom strands_evals.evaluators import HelpfulnessEvaluator, CorrectnessEvaluator
@eval_task(TracedHandler())def my_task(): return Agent(callback_handler=None)
experiment = Experiment( cases=cases, evaluators=[HelpfulnessEvaluator(), CorrectnessEvaluator()])reports = experiment.run_evaluations(my_task)TracedHandler handles:
- Clearing the span exporter before each case
- Collecting finished spans after the task runs
- Mapping spans to a
SessionviaStrandsInMemorySessionMapper - Adding the session as
trajectoryin the result dict
Custom Handlers
Section titled “Custom Handlers”Create custom handlers by subclassing EvalTaskHandler:
from strands_evals import EvalTaskHandler
class MyHandler(EvalTaskHandler): def before(self, case): print(f"Running case: {case.name}")
def after(self, case, result): processed = super().after(case, result) processed["metadata"] = {"custom": True} return processed
@eval_task(MyHandler())def my_task(): return Agent(callback_handler=None)Before and After: Comparison
Section titled “Before and After: Comparison”Without the decorator:
from strands_evals.telemetry import StrandsEvalsTelemetryfrom strands_evals.mappers import StrandsInMemorySessionMapper
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
def task_function(case): telemetry.in_memory_exporter.clear() agent = Agent( trace_attributes={"session.id": case.session_id}, callback_handler=None ) response = agent(case.input) spans = telemetry.in_memory_exporter.get_finished_spans() mapper = StrandsInMemorySessionMapper() session = mapper.map_to_session(spans, session_id=case.session_id) return {"output": str(response), "trajectory": session}With the decorator:
@eval_task(TracedHandler())def task_function(): return Agent(callback_handler=None)Related Documentation
Section titled “Related Documentation”- Getting Started: Quickstart guide
- Evaluators Overview: Available evaluators
- Remote Trace Providers: Evaluate traces from production backends