Result Caching
Overview
Section titled “Overview”The EvaluationDataStore lets you cache task execution results so you can re-run evaluators against the same data without re-invoking your agent. This is useful when agent calls are expensive, slow, or non-deterministic — you run the agent once, cache the results, then iterate on evaluators.
Quick Start
Section titled “Quick Start”from strands_evals import Case, Experiment, LocalFileTaskResultStorefrom strands_evals.evaluators import OutputEvaluator
store = LocalFileTaskResultStore("./cached_results")
cases = [ Case(name="q1", input="What is the capital of France?"), Case(name="q2", input="What is 2 + 2?"),]
experiment = Experiment(cases=cases, evaluators=[OutputEvaluator(rubric="Score 1.0 if correct.")])
# First run: executes the task and caches resultsreports = experiment.run_evaluations(my_task, evaluation_data_store=store)
# Second run: loads cached results, skips task executionreports = experiment.run_evaluations(my_task, evaluation_data_store=store)How It Works
Section titled “How It Works”When you pass an evaluation_data_store to run_evaluations:
- For each case, the store is checked for a cached result using
case.nameas the key - If found, the cached
EvaluationDatais used directly — the task function is not called - If not found, the task runs normally and the result is saved to the store
- Evaluators then run against the (cached or fresh)
EvaluationData
LocalFileTaskResultStore
Section titled “LocalFileTaskResultStore”The built-in LocalFileTaskResultStore saves one JSON file per case in a directory:
from strands_evals import LocalFileTaskResultStore
store = LocalFileTaskResultStore("./my_cache")# Creates: ./my_cache/q1.json, ./my_cache/q2.json, etc.Each file contains the full EvaluationData (input, output, trajectory, environment state, etc.) serialized as JSON.
Custom Stores
Section titled “Custom Stores”Implement the EvaluationDataStore protocol for any storage backend:
from strands_evals.evaluation_data_store import EvaluationDataStorefrom strands_evals.types.evaluation import EvaluationData
class S3ResultStore: def __init__(self, bucket: str, prefix: str): self.bucket = bucket self.prefix = prefix
def load(self, case_name: str) -> EvaluationData | None: # Fetch from S3, return None if not found ...
def save(self, case_name: str, result: EvaluationData) -> None: # Upload to S3 ...The protocol requires just two methods: load(case_name) -> EvaluationData | None and save(case_name, result) -> None.
Related Documentation
Section titled “Related Documentation”- Task Decorator: Simplify task function boilerplate
- Experiment Management: Save and load experiments
- Serialization: Experiment serialization formats