EAAPL-AGT005Proven

Agent Checkpoint and Recovery

[EAAPL-AGT005] Agent Checkpoint and Recovery

Category: Agentic AI Sub-category: Reliability Architecture Version: 1.4 Maturity: Proven Tags: checkpoint, durable-execution, idempotency, recovery, workflow-orchestration, state-serialisation, human-pause-resume Regulatory Relevance: APRA CPS 230 (Operational Resilience), ISO 22301 (BCM), NIST AI RMF (MANAGE 4.1), EU AI Act (Art. 9)

1. Executive Summary

The Agent Checkpoint and Recovery Pattern defines the durable execution architecture that enables AI agents running long-horizon tasks to survive infrastructure failures, model provider outages, and intentional human pauses without losing progress or re-executing actions that have already completed. Without checkpointing, a 45-minute agent execution that fails at iteration 38 must restart from the beginning — at full cost, with risk of re-triggering side effects (duplicate emails, duplicate payments, duplicate database records).

For CIO/CTO audiences: this pattern is the difference between an AI agent that is production-grade and one that is a fragile experiment. In financial services, a reconciliation agent that re-executes after failure without idempotency protection could create duplicate transactions. In healthcare, a treatment plan agent that replays all its tool calls after a crash could submit duplicate orders. This pattern eliminates both failure modes by guaranteeing that each action executes exactly once, that failures are recoverable without data loss, and that humans can pause and resume agent tasks at defined checkpoints. It is a prerequisite for deploying agents on tasks where the consequence of re-execution is unacceptable.

2. Problem Statement

Business Problem

Long-horizon agent tasks — multi-document review, multi-step research, complex data processing — take minutes to hours. In any distributed system, infrastructure failures during this window are not exceptional events; they are expected occurrences. An agent with no recovery mechanism treats these failures as task failures, wasting all prior work and risking incorrect outcomes from partial re-execution.

Technical Problem

LLM agent loops are inherently stateful but execute on stateless infrastructure. The state — conversation history, tool call results, partial outputs, memory references — lives in process memory. A process crash loses all of it. Restarting the task from scratch without idempotency guarantees on tool calls causes duplicate side effects on non-idempotent external systems (APIs, databases, message queues).

Symptoms of Absence

Failed agent tasks require complete restart from scratch, consuming full token and compute budget again
Duplicate records appear in downstream systems after agent failures (duplicate emails, duplicate API calls)
Human-pause functionality does not exist; humans cannot safely interrupt a running agent without losing all progress
A single LLM provider timeout causes a cascading task failure with no recovery
Operations team has no visibility into how far through a long task a failed agent had progressed

Cost of Inaction

Financial: Re-execution from scratch on long tasks doubles or triples LLM token costs per failure event
Risk: Duplicate side effects from non-idempotent re-execution create data integrity issues and potential regulatory events
Operational: APRA CPS 230 requires RTO/RPO for material business services; an agent with no recovery cannot meet any realistic RTO
Human Oversight: Inability to pause and resume means human-in-the-loop controls cannot be applied mid-task

3. Context

When to Apply

Agent tasks that routinely exceed 5 minutes of wall-clock time
Tasks that invoke non-idempotent external systems (payment APIs, email APIs, database mutations)
Tasks that require human approval at intermediate steps (see EAAPL-MAG003)
Environments with regulated RTO requirements (APRA CPS 230, ISO 22301)
Tasks where partial results have value (a 90%-complete document review is useful even if the last 10% failed)

When NOT to Apply

Tasks that complete in under 60 seconds with a single LLM call and ≤3 tool calls (overhead not justified)
Fully idempotent tasks where re-execution from scratch is safe and cost-acceptable
Tasks where the checkpoint store introduces unacceptable latency on each iteration

Prerequisites

Durable, low-latency state store (Redis, DynamoDB, Cosmos DB, or equivalent) accessible from agent runtime
Idempotency key generation per tool call
Workflow orchestration integration (optional but recommended for complex flows)
Human approval queue infrastructure (if pause/resume is needed for HITL gates)

Industry Applicability

Industry	Use Case	Recovery RTO Requirement	Checkpointing Priority
Financial Services	Multi-document reconciliation, regulatory report generation	Minutes	Critical
Healthcare	Clinical summary generation, multi-system data aggregation	Minutes	Critical
Legal / Professional Services	Multi-contract review, due diligence	Hours	High
Technology / SaaS	Large codebase refactoring agent, multi-repo analysis	Hours	High
Government	Complex case assessment, multi-agency data aggregation	Hours	High

4. Architecture Overview

The Agent Checkpoint and Recovery Pattern introduces three core mechanisms to the baseline agent loop: state serialisation at each checkpoint, idempotency keys on all tool calls, and a recovery protocol that replays the logical execution plan while skipping already-completed actions.

Why checkpoint at every iteration rather than every N iterations? The answer is the cost of re-execution. Each LLM iteration consumes tokens; each tool call may have side effects. Re-executing N iterations to restore state costs N × (token cost + potential side effects). Checkpointing at every iteration ensures maximum-one iteration is ever lost to recovery, regardless of when the failure occurs. For short iterations (sub-second), the checkpoint write overhead is minimal relative to the LLM inference latency. For expensive iterations (multi-second tool calls), each checkpoint write is even more justified.

State Serialisation At the end of each loop iteration, the agent's execution state is serialised to a durable checkpoint store. The state object is a versioned JSON document containing: task_id, current iteration number, tool call history (IDs, arguments, results), memory references (episodic/semantic record IDs), current context window snapshot (or a reference to it), partial results, and metadata (timestamps, token consumption, cost so far). The serialisation is an atomic write — either the full state is written or nothing is written; partial checkpoint states are detected and rejected during recovery.

Checkpoint Store Design The checkpoint store must provide: durability (survives process and node failures), low write latency (ideally ≤10ms per checkpoint write to not dominate iteration latency), and support for conditional writes (CAS — compare-and-swap — to prevent concurrent checkpoint writes from two instances of the same task). Redis with AOF persistence, DynamoDB with conditional writes, or Azure Cosmos DB with optimistic concurrency are appropriate. For highest durability requirements, a write-ahead log pattern (checkpoint written to durable log first, then to fast store) provides strong guarantees.

Idempotency Keys Every tool call is issued with a unique idempotency key: a UUID generated at the time the call is first planned, stored in the checkpoint state, and reused on replay. External APIs that support idempotency keys (Stripe, most REST APIs via custom headers) will deduplicate re-submissions with the same key, returning the original response rather than executing the operation again. For APIs that do not natively support idempotency keys, the checkpoint includes the tool result, and the recovery protocol skips re-calling the tool entirely, returning the stored result. This is the "result cache" pattern for recovery.

Recovery Protocol On task startup, the agent checks the checkpoint store for an existing checkpoint for the task_id. If found, it loads the checkpoint state, reconstructs the context (injecting the stored tool call history and partial results), and resumes from the iteration after the last checkpointed iteration. Tool calls in the history are marked as complete and their results are returned from the checkpoint rather than re-executed. This ensures that external systems see at-most-once execution semantics for non-idempotent calls.

Human Pause and Resume The checkpoint mechanism naturally supports human-controlled pause and resume. When a human sends a pause signal (via the management API), the Pause Controller sets a pause flag in the task state. The Termination Controller checks this flag at each iteration boundary and, when set, writes a checkpoint with status: paused and stops execution. The task remains in the checkpoint store, frozen in time. When the human sends a resume signal, the agent restarts from the paused checkpoint exactly as if recovering from a failure — with all prior context intact. This enables safe human review of partial results and context injection before resumption.

Workflow Orchestration Integration For complex multi-stage agent tasks (where the agent itself is a step in a larger workflow), this pattern integrates with workflow orchestration engines. Temporal and Azure Durable Functions provide built-in state persistence and replay-safe execution semantics at the workflow level. In this mode, the agent loop is implemented as a Temporal Workflow or Durable Function, and the platform handles checkpointing automatically via its event-sourced execution model. This is the preferred implementation for tasks that compose multiple agent instances or require saga-style compensation logic.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Input["Input Layer"] A[Task Request] B[Human Control API] end subgraph Core["Agent Execution Core"] C[Task Initialiser] D{Checkpoint Exists?} E[Agent Loop] F[Idempotency Manager] end subgraph Storage["State Storage"] G[(Checkpoint Store)] H[(Audit Log)] end subgraph Output["Output Layer"] I[Final Output] J[Paused State] end A --> C B -->|pause/resume| G C --> D D -->|found| E D -->|not found| E E --> F F -->|cached result| E F -->|new call + save| G G -->|restore state| E E -->|complete| I E -->|paused| J J -->|resume| E E --> H style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#f0fdf4,stroke:#22c55e style D fill:#f3e8ff,stroke:#a855f7 style E fill:#f0fdf4,stroke:#22c55e style F fill:#f0fdf4,stroke:#22c55e style G fill:#fef9c3,stroke:#eab308 style H fill:#fef9c3,stroke:#eab308 style I fill:#d1fae5,stroke:#10b981 style J fill:#fee2e2,stroke:#ef4444

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Checkpoint Store	Durable State Store	Stores serialised task state per iteration; supports CAS writes; low latency	Redis (AOF) + conditional SET, DynamoDB (condition expressions), Azure Cosmos DB (ETag CAS)	Critical
Task Initialiser	Orchestration	Checks for existing checkpoint; routes to restore or fresh execution	Custom; part of agent framework	Critical
State Restorer	Orchestration	Loads checkpoint; reconstructs context; marks completed tool calls	Custom; integrated into agent loop	Critical
Idempotency Key Manager	Reliability	Generates and stores UUID idempotency keys per tool call; retrieves keys on replay	Custom; UUID v4 generation; stored in checkpoint state	Critical
Checkpoint Writer	Persistence	Atomically serialises and writes state after each iteration	Custom; Redis SETNX + pipeline; DynamoDB PutItem with condition	Critical
Pause Controller	Human Control	Sets pause flag in checkpoint state on human signal; ensures clean checkpoint before stop	Custom management API	High
Retry Controller	Reliability	Implements exponential backoff retry policy on transient failures; respects max retry limit	Custom; Temporal retry policy; AWS Step Functions	High
Management API	Operations	Exposes pause, resume, cancel, and status endpoints for human operators	REST API; FastAPI, Express, Azure Functions	High
Task Status Viewer	Operations	Reads checkpoint store to display current task state, progress, and cost	Dashboard UI; custom + Grafana	Medium
Temporal / Durable Functions Engine	Workflow Orchestration	Provides event-sourced durable execution natively (optional but recommended)	Temporal OSS, Temporal Cloud, Azure Durable Functions, AWS Step Functions	High (if used)
Audit Log	Compliance	Records checkpoint writes, restores, pauses, and resumes with timestamps	WORM store: S3 Object Lock, Azure Immutable Blob	Critical

7. Data Flow

Fresh Execution with Checkpointing

Step	Actor	Action	Output
1	Calling System	Submits task with unique task_id	Task queued
2	Task Initialiser	Queries checkpoint store for task_id	No checkpoint found
3	Agent Loop	Executes iteration 1: context assembly → plan → tool call → result	Iteration 1 result
4	Idempotency Manager	Generates UUID idempotency key for tool call; stores in pending state	Keyed tool call record
5	Checkpoint Writer	Atomically writes state: `{task_id, iteration: 1, tool_history: [{tool_id, idempotency_key, result}], partial_output, token_count}`	Checkpoint record v1
6	Agent Loop	Continues for iterations 2..N	Checkpoint written after each iteration
7	Termination	Task completes; final output returned; checkpoint marked `status: complete`	Final output

Recovery from Mid-Task Failure

Step	Actor	Action	Output
1	Retry Controller	Detects failure; initiates recovery after backoff	Recovery signal
2	Task Initialiser	Queries checkpoint store for task_id	Checkpoint found at iteration K
3	State Restorer	Loads checkpoint; reconstructs context with full tool call history up to iteration K	Restored context
4	Idempotency Manager	For each tool call in history with a stored result: marks as complete, returns cached result	Cached results injected
5	Agent Loop	Resumes from iteration K+1; LLM receives full context including all prior tool results	Execution continues
6	External API	If iteration K+1 tool call has idempotency key already stored (sent before failure): API returns original response	No duplicate side effect

Error Flow

Error	Detection	Recovery
Checkpoint write failure (store unavailable)	Write exception; circuit breaker	Retry write with backoff; if checkpoint store unavailable for > threshold, abort task cleanly; alert
Checkpoint CAS failure (concurrent write)	Conditional write rejection	Indicates duplicate execution; one instance wins; other aborts; coordination via distributed lock
Idempotency key not accepted by external API	HTTP 422 / API-specific error	Log; attempt with new key if API behaviour permits; escalate if duplicate detected
State deserialisation failure on restore	Schema version mismatch	Versioned state schema; migration function for minor versions; fresh execution if major version mismatch

8. Security Considerations

Checkpoint State Protection

Checkpoint state may contain sensitive intermediate tool results (customer data, partial financial records); it must be encrypted at rest with CMK
Checkpoint store access is restricted to the agent service identity; human operators can view task status via the management API but cannot read raw checkpoint state without elevated access
Checkpoint states are automatically expired after the task retention period; no indefinite accumulation of sensitive data

Idempotency Key Exposure

Idempotency keys must be treated as sensitive: if an attacker can obtain the key for a payment API call, they could potentially replay or probe the API
Keys are stored in the encrypted checkpoint state, not in logs or observable metadata

OWASP LLM Top 10

OWASP LLM Risk	Checkpoint Relevance	Mitigation
LLM08 Excessive Agency	Recovery replay could re-execute a previously blocked action if policy changed after checkpoint	Policy check is re-evaluated at each iteration after restore, not skipped based on checkpoint history
LLM01 Prompt Injection	Checkpoint state could contain injected content from a compromised tool result	Content validation applied to restored context before injection into LLM prompt
LLM06 Sensitive Information Disclosure	Checkpoint state contains intermediate sensitive data	Encryption at rest; access controls on checkpoint store; expiry policy

9. Governance Considerations

Audit Trail

Every checkpoint write, restore, pause, resume, and cancel event is recorded in the immutable audit log
The audit trail enables reconstruction of a complete task execution timeline, including any human interventions
For regulated tasks (financial, clinical), the checkpoint audit trail is a material compliance artefact

Human Override Records

Every pause, resume, and cancel action through the management API is recorded with the human operator's identity, timestamp, and justification
Context injected at resume points (human feedback, corrected data) is appended to the task audit trail

Governance Artefacts

Artefact	Owner	Frequency	Purpose
Task Execution Audit Trail	Platform Engineering	Per task	Complete timeline of execution, checkpoints, human interventions
Recovery Incident Log	Operations	Per recovery event	Records failure, recovery attempt, outcome, and any re-execution anomalies
Idempotency Violation Report	Operations	Monthly	Documents any detected duplicate side effects; root cause analysis
Checkpoint Store Capacity Report	Platform Engineering	Monthly	Storage growth, TTL expiry rates, capacity planning

10. Operational Considerations

SLOs

SLO	Target	Window	Alert
Checkpoint write latency	≤ 15ms p95	1-hour rolling	> 50ms triggers P2
Recovery time (from failure detection to resumed execution)	≤ 60 seconds	Per event	> 5 minutes triggers P1
Task completion rate (including recovered tasks)	≥ 98%	24-hour rolling	< 95% triggers P2
Checkpoint store availability	99.99%	Monthly	Any degradation triggers P1

Monitoring

Checkpoint write success/failure rate per task_id
Task recovery event rate — spike indicates infrastructure instability
Average iterations per task — significant increase may indicate agent getting stuck in recovery loops
Cost per task with recovery events — quantifies the financial impact of failures

DR and Capacity

DR Tier	Checkpoint Store Config	RTO	RPO
Standard	Redis with AOF + daily snapshot to S3	5 min	1 iteration (1 checkpoint interval)
Enhanced	DynamoDB Global Tables (multi-region)	< 1 min	Near-zero (synchronous replication)
Premium	Temporal Cloud (fully managed)	< 30 sec	Zero (event-sourced; no data loss)

11. Cost Considerations

Cost Drivers

Cost Driver	Scaling Behaviour	Control Lever
Checkpoint store writes	Linear with tasks × iterations per task	Checkpoint every N iterations instead of every 1 (trade recovery granularity for cost)
Checkpoint state storage	Linear with concurrent tasks × state size × retention period	Compress checkpoint state; aggressive TTL on completed tasks
Recovery LLM calls	Proportional to failure rate × iterations replayed	Improve infrastructure reliability to reduce failure rate; idempotent cached results avoid re-inference

Indicative Cost Impact

Scenario	Cost Impact vs. No Checkpoint
0% failure rate, 100% completion	+2–5% (checkpoint write overhead only)
5% failure rate, recovery from checkpoint	Break-even to 10% savings (avoid full re-execution cost)
10% failure rate without recovery	2–3× effective cost (full re-execution + risk of duplicate side effects)

12. Trade-Off Analysis

Checkpointing Strategy Options

Option	Granularity	Write Overhead	Recovery Granularity	Best For
A: Per-Iteration (Recommended)	Every loop iteration	Low (Redis ≤15ms)	Maximum — at most 1 iteration lost	Tasks with expensive or side-effectful iterations
B: Per-N-Iterations	Every N iterations	Proportionally lower	At most N iterations lost	Short, cheap iterations where overhead matters
C: Workflow Engine Native	Platform-managed (Temporal/Durable Functions)	Platform overhead	Zero — event-sourced replay	Complex multi-stage workflows; regulated workloads
D: No Checkpointing (Idempotent Tasks Only)	N/A — full re-execution	Zero	Full re-execution from start	Fully idempotent tasks where re-execution is safe and cheap

Architectural Tensions

Tension	Left Pole	Right Pole	Balance
Recovery granularity vs. Checkpoint overhead	Checkpoint after every action (microseconds apart)	Checkpoint only at major milestones	Per-iteration checkpointing is the practical optimum
Idempotency key lifespan vs. Storage cost	Keep keys indefinitely	Expire after task completion	Expire with task retention period; dedup window for external APIs is typically 24h
Human pause flexibility vs. Task complexity	Allow pause at any iteration	Only allow pause at defined milestones	Pause at any iteration for safety; display milestone progress for human context

13. Failure Modes

Failure Mode	Likelihood	Impact	Detection	Recovery
Checkpoint store failure during write	Low	Medium — last iteration lost; replay from prior checkpoint	Write exception + circuit breaker	Recover from last successful checkpoint; at most 1 iteration re-executed
Duplicate task execution (two instances race)	Low	High — duplicate side effects if not protected by idempotency keys	CAS conflict on checkpoint write	One instance wins CAS; other detects conflict and aborts
Checkpoint state corruption	Very Low	High — task cannot recover; must restart from scratch	Deserialisation failure	Versioned schema migration; if unrecoverable, fresh start with duplicate side effect audit
Idempotency key rejected by external API	Low	Medium — tool call may not complete	API error code	Log and investigate; if API does not support idempotency, implement result cache in checkpoint
Recovery loop (agent stuck, keeps recovering)	Low	High — cost overrun	Recovery attempt count in checkpoint; alert after N recoveries	Kill task after max recovery attempts; alert; human review

14. Regulatory Considerations

APRA CPS 230 (Operational Resilience)

Material business services must have documented RTO/RPO; this pattern enables RTO measured in seconds/minutes for agent-powered services
Recovery testing is required; the checkpoint restore path must be exercised in DR testing

ISO 22301 (Business Continuity)

Agent checkpoint/recovery maps to §8.4 (Business Continuity Plans); the restore protocol is the BCM procedure for agent task failure

EU AI Act

Art. 9 (Risk Management): the ability to safely pause and inspect a running agent implements a key risk management control
Art. 14 (Human Oversight): the pause/resume mechanism is a direct implementation of the human oversight requirement — humans can halt agent execution at any point without losing task progress

NIST AI RMF

MANAGE 4.1: The recovery protocol and idempotency design implement the incident response and recovery management requirement

15. Reference Implementations

AWS

Component	Service
Checkpoint Store	Amazon DynamoDB (conditional PutItem; strong consistency)
Workflow Engine	AWS Step Functions (built-in state persistence)
Management API	AWS API Gateway + Lambda
Audit Log	AWS CloudTrail + S3 Object Lock

Azure

Component	Service
Checkpoint Store	Azure Cosmos DB (ETag-based optimistic concurrency)
Workflow Engine	Azure Durable Functions (event-sourced; built-in checkpointing)
Management API	Azure Functions + API Management

GCP

Component	Service
Checkpoint Store	Cloud Spanner (strong consistency; CAS transactions)
Workflow Engine	Cloud Workflows + Cloud Run
Management API	Cloud Run API endpoint

On-Premises

Component	Technology
Checkpoint Store	Redis 7+ with AOF persistence (fsync always for highest durability)
Workflow Engine	Temporal OSS (self-hosted on Kubernetes)
Management API	FastAPI on Kubernetes

Pattern	ID	Relationship Type	Notes
Single Agent Pattern	EAAPL-AGT001	Extends	Checkpoint adds durable state to the base agent loop
Stateful Agent Memory	EAAPL-AGT002	Integrates With	Checkpoint references memory record IDs; restore reloads referenced memories
Long-Running Agent	EAAPL-AGT007	Depends On	Long-running agent requires checkpointing as a foundational capability
Human-in-the-Loop Agent	EAAPL-MAG003	Integrates With	Pause/resume mechanism is the implementation enabler for HITL gates mid-task
Agent Handoff Protocol	EAAPL-MAG006	Peer	Handoff payload includes checkpoint state for seamless context transfer between agents

17. Maturity Assessment

Overall Maturity: Proven

Dimension	Score (1–5)	Evidence
Checkpointing Technology Maturity	5	Redis, DynamoDB, Cosmos DB battle-tested at hyperscale; Temporal/Durable Functions proven
Idempotency Pattern Maturity	5	Industry-standard pattern (Stripe, Twilio, etc.) with established implementation guidance
AI-Agent-Specific Integration	3	Agent framework integration (LangGraph, Temporal) maturing; some custom implementation still required
Human Pause/Resume UX	3	Management API well-defined; UI tooling for operator visibility still developing
Regulatory Evidence	4	CPS 230 and ISO 22301 mapping well-established; audit trail design proven

18. Revision History

Version	Date	Author	Changes
1.0	2024-03-15	Architecture Board	Initial publication
1.1	2024-06-01	Platform Engineering	Added Temporal integration option; idempotency key management detail
1.2	2024-09-15	Reliability Engineering	Added DR tiers; SLO table; recovery loop detection failure mode
1.3	2025-01-05	Architecture Board	Added EU AI Act Art. 14 mapping; pause/resume governance artefacts
1.4	2025-04-20	Reliability Engineering	Added CAS conflict resolution; Cosmos DB ETag implementation reference

← Back to Library More Agentic AI →