Checkpointing
The checkpoint system enables durable agent execution by allowing agents to pause at safe cycle boundaries, persist their state externally, and resume later — potentially in a fresh process after a crash. When a checkpoint is emitted, the agent stops its loop and returns a serializable snapshot of its state. The user persists the snapshot anywhere they choose (database, workflow engine, object storage) and passes it back to resume. The agent rebuilds its state from the snapshot and continues execution from the same cycle boundary. Checkpoints fire at two positions per ReAct cycle — after the model call (before tools execute) and after tool execution (before the next model call) — both controlled by a single flag on the agent. The general flow looks as follows:
flowchart TD A[Invoke Agent] --> B[Model Call] B --> C{stop_reason?} C -->|end_turn| Z[Return Result] C -->|tool_use| CP1["`**Checkpoint** after_model`"] CP1 -->|checkpointing=False<br/>or cancelled| E[Execute Tools] CP1 -->|checkpointing=True| P[Persist + Return<br/>stop_reason=checkpoint] P --> R[Resume with<br/>checkpointResume block] R --> RS[Restore State<br/>from snapshot] RS --> E E --> IX{Interrupt<br/>raised?} IX -->|Yes| IR[Return stop_reason=interrupt<br/>no checkpoint emitted] IX -->|No| CP2["`**Checkpoint** after_tools`"] CP2 -->|checkpointing=False<br/>or cancelled| B CP2 -->|checkpointing=True| P
classDef checkpoint fill:#fef3c7,stroke:#b45309,stroke-width:2px,color:#1f2937; class CP1,CP2 checkpoint;Checkpoints only bracket tool execution. An agent whose model call returns end_turn completes the cycle without emitting a checkpoint. After an after_model resume, state is restored and tools execute next. After an after_tools resume, state is restored and control loops back to the next model call — the restored messages already contain the tool results from the previous cycle, so those tools are not re-executed.
Basic Usage
Section titled “Basic Usage”Enable checkpointing by passing checkpointing=True when constructing the agent. The agent pauses at cycle boundaries and returns an AgentResult with stop_reason="checkpoint" and a populated checkpoint field. To resume, pass the serialized checkpoint back as a checkpointResume content block.
import json
from strands import Agent, tool
@tooldef get_weather(city: str) -> str: """Return the current weather for the given city.""" return "sunny"
@tooldef get_forecast(city: str) -> str: """Return the weather forecast for the given city.""" return "clear skies"
agent = Agent( tools=[get_weather, get_forecast], system_prompt="When asked about weather, call the get_weather and get_forecast tools.", checkpointing=True, callback_handler=None,)
result = agent("What is the weather and forecast for Seattle?")
while result.stop_reason == "checkpoint": # Persist anywhere: database, workflow engine, object storage, etc. persisted = json.dumps(result.checkpoint.to_dict())
# Later (potentially in a fresh process), reload and resume. checkpoint_dict = json.loads(persisted) result = agent([{"checkpointResume": {"checkpoint": checkpoint_dict}}])
print(f"MESSAGE: {json.dumps(result.message)}")The key invariant: completed tool calls are never re-run on resume. At an after_tools checkpoint, the tool result message has already been appended to the agent’s messages and captured in the snapshot, so the resumed agent’s next model call sees the results and continues rather than retrying the tools.
⚠️ Checkpoints only emit when the model chooses to call a tool. If the model answers directly from its own knowledge, the loop exits with
stop_reason="end_turn"on the first call and thewhileloop runs zero times. Encourage tool use through prompt and tool docstrings when you want to observe checkpointing end-to-end.
Components
Section titled “Components”Checkpointing in Strands is comprised of the following components:
checkpointing=True- Enables checkpoint emission at cycle boundaries. Defaults toFalse, so existing agents see no behavioral change.result.stop_reason- Check if the agent stopped due to"checkpoint".result.checkpoint- TheCheckpointobject containing the captured state.checkpoint.position- Either"after_model"or"after_tools", indicating which boundary fired.checkpoint.cycle_index- Zero-based index of the ReAct cycle that emitted the checkpoint.checkpoint.to_dict()- Serialize to a JSON-compatible dict for persistence.Checkpoint.from_dict(data)- Reconstruct aCheckpointfrom a persisted dict. RaisesCheckpointExceptionon schema version mismatch.
checkpointResume- Content block type for resuming from a persisted checkpoint.- The
checkpointfield must be a dict produced bycheckpoint.to_dict(). - Must be the only content block in the prompt list.
- The
Strands enforces the following rules for checkpointing:
- Checkpoints are only emitted on tool-use cycles. A cycle whose model call returns
end_turnskips both boundaries and returns normally — an agent whose first call returnsend_turncompletes with no checkpoint at all. - A resume prompt must contain exactly one
checkpointResumecontent block and nothing else. Mixing with other content types raises aTypeError, and multiple blocks in the same prompt also raise aTypeError. - The
checkpointResumeblock must contain acheckpointkey; otherwise aKeyErroris raised. - Passing a
checkpointResumeblock to an agent created withcheckpointing=Falseraises aValueError. - A checkpoint whose
schema_versiondoes not match the current SDK raises aCheckpointExceptionwhen loaded byCheckpoint.from_dict. - Interrupts take precedence over checkpoints. If a tool raises an interrupt during a checkpointing cycle, the agent returns
stop_reason="interrupt"and theafter_toolscheckpoint for that cycle is not emitted. - Cancellation takes precedence over checkpoints. A cancel signal set at either checkpoint boundary suppresses emission and returns
stop_reason="cancelled". - If
request_state["stop_event_loop"]or a structured output stop signal fires after tools execute, the loop stops before theafter_toolscheckpoint and returns the normal stop reason rather than"checkpoint".
Fresh Agent Instances
Section titled “Fresh Agent Instances”The primary motivation for checkpointing is surviving process-level failures. The pattern: persist the checkpoint to durable storage on pause, build a new Agent instance on resume (same model, same tools, same system prompt), and pass the persisted checkpoint back.
import json
from strands import Agent, tool
@tooldef expensive_lookup(key: str) -> str: # Implementation here return "result"
def build_agent() -> Agent: return Agent( tools=[expensive_lookup], checkpointing=True, callback_handler=None, )
def save(checkpoint_dict: dict) -> None: # Persist to database, blob storage, workflow event history, etc. pass
def load() -> dict | None: # Read from durable storage. return None
# --- First process ---agent = build_agent()result = agent("Look up several keys and summarize the results")
if result.stop_reason == "checkpoint": save(result.checkpoint.to_dict())
# Process crashes, is restarted, etc. No agent state survives.
# --- Fresh process ---checkpoint_dict = load()if checkpoint_dict is not None: agent = build_agent() result = agent([{"checkpointResume": {"checkpoint": checkpoint_dict}}]) # Tools that already ran before the crash are not re-invoked. print(f"MESSAGE: {json.dumps(result.message)}")Components
Section titled “Components”Crash recovery builds on the core checkpointing components plus:
checkpoint.to_dict()/Checkpoint.from_dict()- The serialization boundary. The output is JSON-compatible, so any storage backend that accepts structured data works.- Deterministic agent construction - The resuming agent must be configured with the same tools, model, and system prompt as the original. State travels in the checkpoint; configuration does not.
Crash-resilient checkpointing follows these additional rules:
- Snapshot data is opaque. Treat the output of
checkpoint.to_dict()as a black box — do not modify it between persist and resume. EventLoopMetricsresets on each resume call. If you need aggregate metrics across a durable run, accumulate them yourself at the persistence layer.BeforeInvocationEventandAfterInvocationEventfire on every resume call, matching interrupt semantics. Hooks counting invocations will see each resume as a separate invocation.- Streaming callbacks do not re-emit on replay. Only the post-resume cycle streams to the callback handler.
Limitations
Section titled “Limitations”Checkpointing is experimental and has the following limitations in its current form:
- At
after_tools,result.messageis the assistant message that requested the tools. Tool results live insidecheckpoint.snapshotand are restored on resume. OpenAIResponsesModel(stateful=True)is not supported. The server-sideresponse_idis not captured in the snapshot.- Per-tool granularity within a single cycle is not supported. Checkpoints operate at cycle boundaries; if a cycle contains multiple tools and one crashes mid-execution, the entire cycle re-runs on resume. For true per-tool durability, use a custom
ToolExecutorthat routes each tool to a separate durable activity.
Interrupts
Section titled “Interrupts”Checkpointing composes with the interrupt system. An agent can use both in the same run: interrupts handle human-in-the-loop pauses, checkpoints handle worker-level durability. When both would fire in the same cycle, interrupts take precedence — the agent returns stop_reason="interrupt" and the after_tools checkpoint for that cycle is not emitted. Once all interrupts are resolved, the agent resumes normally and the next cycle can emit a checkpoint as usual.