Result Caching

Overview

The EvaluationDataStore lets you cache task execution results so you can re-run evaluators against the same data without re-invoking your agent. This is useful when agent calls are expensive, slow, or non-deterministic — you run the agent once, cache the results, then iterate on evaluators.

Quick Start

from strands_evals import Case, Experiment, LocalFileTaskResultStore
from strands_evals.evaluators import OutputEvaluator

store = LocalFileTaskResultStore("./cached_results")

cases = [
    Case(name="q1", input="What is the capital of France?"),
    Case(name="q2", input="What is 2 + 2?"),
]

experiment = Experiment(cases=cases, evaluators=[OutputEvaluator(rubric="Score 1.0 if correct.")])

# First run: executes the task and caches results
reports = experiment.run_evaluations(my_task, evaluation_data_store=store)

# Second run: loads cached results, skips task execution
reports = experiment.run_evaluations(my_task, evaluation_data_store=store)

How It Works

When you pass an evaluation_data_store to run_evaluations:

For each case, the store is checked for a cached result using case.name as the key
If found, the cached EvaluationData is used directly — the task function is not called
If not found, the task runs normally and the result is saved to the store
Evaluators then run against the (cached or fresh) EvaluationData

LocalFileTaskResultStore

The built-in LocalFileTaskResultStore saves one JSON file per case in a directory:

from strands_evals import LocalFileTaskResultStore

store = LocalFileTaskResultStore("./my_cache")
# Creates: ./my_cache/q1.json, ./my_cache/q2.json, etc.

Each file contains the full EvaluationData (input, output, trajectory, environment state, etc.) serialized as JSON.

Custom Stores

Implement the EvaluationDataStore protocol for any storage backend:

from strands_evals.evaluation_data_store import EvaluationDataStore
from strands_evals.types.evaluation import EvaluationData

class S3ResultStore:
    def __init__(self, bucket: str, prefix: str):
        self.bucket = bucket
        self.prefix = prefix

    def load(self, case_name: str) -> EvaluationData | None:
        # Fetch from S3, return None if not found
        ...

    def save(self, case_name: str, result: EvaluationData) -> None:
        # Upload to S3
        ...

The protocol requires just two methods: load(case_name) -> EvaluationData | None and save(case_name, result) -> None.

Task Decorator: Simplify task function boilerplate
Experiment Management: Save and load experiments
Serialization: Experiment serialization formats