Skip to content

Result Caching

The EvaluationDataStore lets you cache task execution results so you can re-run evaluators against the same data without re-invoking your agent. This is useful when agent calls are expensive, slow, or non-deterministic — you run the agent once, cache the results, then iterate on evaluators.

from strands_evals import Case, Experiment, LocalFileTaskResultStore
from strands_evals.evaluators import OutputEvaluator
store = LocalFileTaskResultStore("./cached_results")
cases = [
Case(name="q1", input="What is the capital of France?"),
Case(name="q2", input="What is 2 + 2?"),
]
experiment = Experiment(cases=cases, evaluators=[OutputEvaluator(rubric="Score 1.0 if correct.")])
# First run: executes the task and caches results
reports = experiment.run_evaluations(my_task, evaluation_data_store=store)
# Second run: loads cached results, skips task execution
reports = experiment.run_evaluations(my_task, evaluation_data_store=store)

When you pass an evaluation_data_store to run_evaluations:

  1. For each case, the store is checked for a cached result using case.name as the key
  2. If found, the cached EvaluationData is used directly — the task function is not called
  3. If not found, the task runs normally and the result is saved to the store
  4. Evaluators then run against the (cached or fresh) EvaluationData

The built-in LocalFileTaskResultStore saves one JSON file per case in a directory:

from strands_evals import LocalFileTaskResultStore
store = LocalFileTaskResultStore("./my_cache")
# Creates: ./my_cache/q1.json, ./my_cache/q2.json, etc.

Each file contains the full EvaluationData (input, output, trajectory, environment state, etc.) serialized as JSON.

Implement the EvaluationDataStore protocol for any storage backend:

from strands_evals.evaluation_data_store import EvaluationDataStore
from strands_evals.types.evaluation import EvaluationData
class S3ResultStore:
def __init__(self, bucket: str, prefix: str):
self.bucket = bucket
self.prefix = prefix
def load(self, case_name: str) -> EvaluationData | None:
# Fetch from S3, return None if not found
...
def save(self, case_name: str, result: EvaluationData) -> None:
# Upload to S3
...

The protocol requires just two methods: load(case_name) -> EvaluationData | None and save(case_name, result) -> None.