[EAAPL-AGT005] Agent Checkpoint and Recovery
Category: Agentic AI
Sub-category: Reliability Architecture
Version: 1.4
Maturity: Proven
Tags: checkpoint, durable-execution, idempotency, recovery, workflow-orchestration, state-serialisation, human-pause-resume
Regulatory Relevance: APRA CPS 230 (Operational Resilience), ISO 22301 (BCM), NIST AI RMF (MANAGE 4.1), EU AI Act (Art. 9)
1. Executive Summary
The Agent Checkpoint and Recovery Pattern defines the durable execution architecture that enables AI agents running long-horizon tasks to survive infrastructure failures, model provider outages, and intentional human pauses without losing progress or re-executing actions that have already completed. Without checkpointing, a 45-minute agent execution that fails at iteration 38 must restart from the beginning — at full cost, with risk of re-triggering side effects (duplicate emails, duplicate payments, duplicate database records).
For CIO/CTO audiences: this pattern is the difference between an AI agent that is production-grade and one that is a fragile experiment. In financial services, a reconciliation agent that re-executes after failure without idempotency protection could create duplicate transactions. In healthcare, a treatment plan agent that replays all its tool calls after a crash could submit duplicate orders. This pattern eliminates both failure modes by guaranteeing that each action executes exactly once, that failures are recoverable without data loss, and that humans can pause and resume agent tasks at defined checkpoints. It is a prerequisite for deploying agents on tasks where the consequence of re-execution is unacceptable.
2. Problem Statement
Business Problem
Long-horizon agent tasks — multi-document review, multi-step research, complex data processing — take minutes to hours. In any distributed system, infrastructure failures during this window are not exceptional events; they are expected occurrences. An agent with no recovery mechanism treats these failures as task failures, wasting all prior work and risking incorrect outcomes from partial re-execution.
Technical Problem
LLM agent loops are inherently stateful but execute on stateless infrastructure. The state — conversation history, tool call results, partial outputs, memory references — lives in process memory. A process crash loses all of it. Restarting the task from scratch without idempotency guarantees on tool calls causes duplicate side effects on non-idempotent external systems (APIs, databases, message queues).
Symptoms of Absence
- Failed agent tasks require complete restart from scratch, consuming full token and compute budget again
- Duplicate records appear in downstream systems after agent failures (duplicate emails, duplicate API calls)
- Human-pause functionality does not exist; humans cannot safely interrupt a running agent without losing all progress
- A single LLM provider timeout causes a cascading task failure with no recovery
- Operations team has no visibility into how far through a long task a failed agent had progressed
Cost of Inaction
- Financial: Re-execution from scratch on long tasks doubles or triples LLM token costs per failure event
- Risk: Duplicate side effects from non-idempotent re-execution create data integrity issues and potential regulatory events
- Operational: APRA CPS 230 requires RTO/RPO for material business services; an agent with no recovery cannot meet any realistic RTO
- Human Oversight: Inability to pause and resume means human-in-the-loop controls cannot be applied mid-task
3. Context
When to Apply
- Agent tasks that routinely exceed 5 minutes of wall-clock time
- Tasks that invoke non-idempotent external systems (payment APIs, email APIs, database mutations)
- Tasks that require human approval at intermediate steps (see EAAPL-MAG003)
- Environments with regulated RTO requirements (APRA CPS 230, ISO 22301)
- Tasks where partial results have value (a 90%-complete document review is useful even if the last 10% failed)
When NOT to Apply
- Tasks that complete in under 60 seconds with a single LLM call and ≤3 tool calls (overhead not justified)
- Fully idempotent tasks where re-execution from scratch is safe and cost-acceptable
- Tasks where the checkpoint store introduces unacceptable latency on each iteration
Prerequisites
- Durable, low-latency state store (Redis, DynamoDB, Cosmos DB, or equivalent) accessible from agent runtime
- Idempotency key generation per tool call
- Workflow orchestration integration (optional but recommended for complex flows)
- Human approval queue infrastructure (if pause/resume is needed for HITL gates)
Industry Applicability
| Industry |
Use Case |
Recovery RTO Requirement |
Checkpointing Priority |
| Financial Services |
Multi-document reconciliation, regulatory report generation |
Minutes |
Critical |
| Healthcare |
Clinical summary generation, multi-system data aggregation |
Minutes |
Critical |
| Legal / Professional Services |
Multi-contract review, due diligence |
Hours |
High |
| Technology / SaaS |
Large codebase refactoring agent, multi-repo analysis |
Hours |
High |
| Government |
Complex case assessment, multi-agency data aggregation |
Hours |
High |
4. Architecture Overview
The Agent Checkpoint and Recovery Pattern introduces three core mechanisms to the baseline agent loop: state serialisation at each checkpoint, idempotency keys on all tool calls, and a recovery protocol that replays the logical execution plan while skipping already-completed actions.
Why checkpoint at every iteration rather than every N iterations?
The answer is the cost of re-execution. Each LLM iteration consumes tokens; each tool call may have side effects. Re-executing N iterations to restore state costs N × (token cost + potential side effects). Checkpointing at every iteration ensures maximum-one iteration is ever lost to recovery, regardless of when the failure occurs. For short iterations (sub-second), the checkpoint write overhead is minimal relative to the LLM inference latency. For expensive iterations (multi-second tool calls), each checkpoint write is even more justified.
State Serialisation
At the end of each loop iteration, the agent's execution state is serialised to a durable checkpoint store. The state object is a versioned JSON document containing: task_id, current iteration number, tool call history (IDs, arguments, results), memory references (episodic/semantic record IDs), current context window snapshot (or a reference to it), partial results, and metadata (timestamps, token consumption, cost so far). The serialisation is an atomic write — either the full state is written or nothing is written; partial checkpoint states are detected and rejected during recovery.
Checkpoint Store Design
The checkpoint store must provide: durability (survives process and node failures), low write latency (ideally ≤10ms per checkpoint write to not dominate iteration latency), and support for conditional writes (CAS — compare-and-swap — to prevent concurrent checkpoint writes from two instances of the same task). Redis with AOF persistence, DynamoDB with conditional writes, or Azure Cosmos DB with optimistic concurrency are appropriate. For highest durability requirements, a write-ahead log pattern (checkpoint written to durable log first, then to fast store) provides strong guarantees.
Idempotency Keys
Every tool call is issued with a unique idempotency key: a UUID generated at the time the call is first planned, stored in the checkpoint state, and reused on replay. External APIs that support idempotency keys (Stripe, most REST APIs via custom headers) will deduplicate re-submissions with the same key, returning the original response rather than executing the operation again. For APIs that do not natively support idempotency keys, the checkpoint includes the tool result, and the recovery protocol skips re-calling the tool entirely, returning the stored result. This is the "result cache" pattern for recovery.
Recovery Protocol
On task startup, the agent checks the checkpoint store for an existing checkpoint for the task_id. If found, it loads the checkpoint state, reconstructs the context (injecting the stored tool call history and partial results), and resumes from the iteration after the last checkpointed iteration. Tool calls in the history are marked as complete and their results are returned from the checkpoint rather than re-executed. This ensures that external systems see at-most-once execution semantics for non-idempotent calls.
Human Pause and Resume
The checkpoint mechanism naturally supports human-controlled pause and resume. When a human sends a pause signal (via the management API), the Pause Controller sets a pause flag in the task state. The Termination Controller checks this flag at each iteration boundary and, when set, writes a checkpoint with status: paused and stops execution. The task remains in the checkpoint store, frozen in time. When the human sends a resume signal, the agent restarts from the paused checkpoint exactly as if recovering from a failure — with all prior context intact. This enables safe human review of partial results and context injection before resumption.
Workflow Orchestration Integration
For complex multi-stage agent tasks (where the agent itself is a step in a larger workflow), this pattern integrates with workflow orchestration engines. Temporal and Azure Durable Functions provide built-in state persistence and replay-safe execution semantics at the workflow level. In this mode, the agent loop is implemented as a Temporal Workflow or Durable Function, and the platform handles checkpointing automatically via its event-sourced execution model. This is the preferred implementation for tasks that compose multiple agent instances or require saga-style compensation logic.
5. Architecture Diagram
flowchart TD
subgraph Input["Input Layer"]
A[Task Request]
B[Human Control API]
end
subgraph Core["Agent Execution Core"]
C[Task Initialiser]
D{Checkpoint Exists?}
E[Agent Loop]
F[Idempotency Manager]
end
subgraph Storage["State Storage"]
G[(Checkpoint Store)]
H[(Audit Log)]
end
subgraph Output["Output Layer"]
I[Final Output]
J[Paused State]
end
A --> C
B -->|pause/resume| G
C --> D
D -->|found| E
D -->|not found| E
E --> F
F -->|cached result| E
F -->|new call + save| G
G -->|restore state| E
E -->|complete| I
E -->|paused| J
J -->|resume| E
E --> H
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#dbeafe,stroke:#3b82f6
style C fill:#f0fdf4,stroke:#22c55e
style D fill:#f3e8ff,stroke:#a855f7
style E fill:#f0fdf4,stroke:#22c55e
style F fill:#f0fdf4,stroke:#22c55e
style G fill:#fef9c3,stroke:#eab308
style H fill:#fef9c3,stroke:#eab308
style I fill:#d1fae5,stroke:#10b981
style J fill:#fee2e2,stroke:#ef4444
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Checkpoint Store |
Durable State Store |
Stores serialised task state per iteration; supports CAS writes; low latency |
Redis (AOF) + conditional SET, DynamoDB (condition expressions), Azure Cosmos DB (ETag CAS) |
Critical |
| Task Initialiser |
Orchestration |
Checks for existing checkpoint; routes to restore or fresh execution |
Custom; part of agent framework |
Critical |
| State Restorer |
Orchestration |
Loads checkpoint; reconstructs context; marks completed tool calls |
Custom; integrated into agent loop |
Critical |
| Idempotency Key Manager |
Reliability |
Generates and stores UUID idempotency keys per tool call; retrieves keys on replay |
Custom; UUID v4 generation; stored in checkpoint state |
Critical |
| Checkpoint Writer |
Persistence |
Atomically serialises and writes state after each iteration |
Custom; Redis SETNX + pipeline; DynamoDB PutItem with condition |
Critical |
| Pause Controller |
Human Control |
Sets pause flag in checkpoint state on human signal; ensures clean checkpoint before stop |
Custom management API |
High |
| Retry Controller |
Reliability |
Implements exponential backoff retry policy on transient failures; respects max retry limit |
Custom; Temporal retry policy; AWS Step Functions |
High |
| Management API |
Operations |
Exposes pause, resume, cancel, and status endpoints for human operators |
REST API; FastAPI, Express, Azure Functions |
High |
| Task Status Viewer |
Operations |
Reads checkpoint store to display current task state, progress, and cost |
Dashboard UI; custom + Grafana |
Medium |
| Temporal / Durable Functions Engine |
Workflow Orchestration |
Provides event-sourced durable execution natively (optional but recommended) |
Temporal OSS, Temporal Cloud, Azure Durable Functions, AWS Step Functions |
High (if used) |
| Audit Log |
Compliance |
Records checkpoint writes, restores, pauses, and resumes with timestamps |
WORM store: S3 Object Lock, Azure Immutable Blob |
Critical |
7. Data Flow
Fresh Execution with Checkpointing
| Step |
Actor |
Action |
Output |
| 1 |
Calling System |
Submits task with unique task_id |
Task queued |
| 2 |
Task Initialiser |
Queries checkpoint store for task_id |
No checkpoint found |
| 3 |
Agent Loop |
Executes iteration 1: context assembly → plan → tool call → result |
Iteration 1 result |
| 4 |
Idempotency Manager |
Generates UUID idempotency key for tool call; stores in pending state |
Keyed tool call record |
| 5 |
Checkpoint Writer |
Atomically writes state: {task_id, iteration: 1, tool_history: [{tool_id, idempotency_key, result}], partial_output, token_count} |
Checkpoint record v1 |
| 6 |
Agent Loop |
Continues for iterations 2..N |
Checkpoint written after each iteration |
| 7 |
Termination |
Task completes; final output returned; checkpoint marked status: complete |
Final output |
Recovery from Mid-Task Failure
| Step |
Actor |
Action |
Output |
| 1 |
Retry Controller |
Detects failure; initiates recovery after backoff |
Recovery signal |
| 2 |
Task Initialiser |
Queries checkpoint store for task_id |
Checkpoint found at iteration K |
| 3 |
State Restorer |
Loads checkpoint; reconstructs context with full tool call history up to iteration K |
Restored context |
| 4 |
Idempotency Manager |
For each tool call in history with a stored result: marks as complete, returns cached result |
Cached results injected |
| 5 |
Agent Loop |
Resumes from iteration K+1; LLM receives full context including all prior tool results |
Execution continues |
| 6 |
External API |
If iteration K+1 tool call has idempotency key already stored (sent before failure): API returns original response |
No duplicate side effect |
Error Flow
| Error |
Detection |
Recovery |
| Checkpoint write failure (store unavailable) |
Write exception; circuit breaker |
Retry write with backoff; if checkpoint store unavailable for > threshold, abort task cleanly; alert |
| Checkpoint CAS failure (concurrent write) |
Conditional write rejection |
Indicates duplicate execution; one instance wins; other aborts; coordination via distributed lock |
| Idempotency key not accepted by external API |
HTTP 422 / API-specific error |
Log; attempt with new key if API behaviour permits; escalate if duplicate detected |
| State deserialisation failure on restore |
Schema version mismatch |
Versioned state schema; migration function for minor versions; fresh execution if major version mismatch |
8. Security Considerations
Checkpoint State Protection
- Checkpoint state may contain sensitive intermediate tool results (customer data, partial financial records); it must be encrypted at rest with CMK
- Checkpoint store access is restricted to the agent service identity; human operators can view task status via the management API but cannot read raw checkpoint state without elevated access
- Checkpoint states are automatically expired after the task retention period; no indefinite accumulation of sensitive data
Idempotency Key Exposure
- Idempotency keys must be treated as sensitive: if an attacker can obtain the key for a payment API call, they could potentially replay or probe the API
- Keys are stored in the encrypted checkpoint state, not in logs or observable metadata
OWASP LLM Top 10
| OWASP LLM Risk |
Checkpoint Relevance |
Mitigation |
| LLM08 Excessive Agency |
Recovery replay could re-execute a previously blocked action if policy changed after checkpoint |
Policy check is re-evaluated at each iteration after restore, not skipped based on checkpoint history |
| LLM01 Prompt Injection |
Checkpoint state could contain injected content from a compromised tool result |
Content validation applied to restored context before injection into LLM prompt |
| LLM06 Sensitive Information Disclosure |
Checkpoint state contains intermediate sensitive data |
Encryption at rest; access controls on checkpoint store; expiry policy |
9. Governance Considerations
Audit Trail
- Every checkpoint write, restore, pause, resume, and cancel event is recorded in the immutable audit log
- The audit trail enables reconstruction of a complete task execution timeline, including any human interventions
- For regulated tasks (financial, clinical), the checkpoint audit trail is a material compliance artefact
Human Override Records
- Every pause, resume, and cancel action through the management API is recorded with the human operator's identity, timestamp, and justification
- Context injected at resume points (human feedback, corrected data) is appended to the task audit trail
Governance Artefacts
| Artefact |
Owner |
Frequency |
Purpose |
| Task Execution Audit Trail |
Platform Engineering |
Per task |
Complete timeline of execution, checkpoints, human interventions |
| Recovery Incident Log |
Operations |
Per recovery event |
Records failure, recovery attempt, outcome, and any re-execution anomalies |
| Idempotency Violation Report |
Operations |
Monthly |
Documents any detected duplicate side effects; root cause analysis |
| Checkpoint Store Capacity Report |
Platform Engineering |
Monthly |
Storage growth, TTL expiry rates, capacity planning |
10. Operational Considerations
SLOs
| SLO |
Target |
Window |
Alert |
| Checkpoint write latency |
≤ 15ms p95 |
1-hour rolling |
> 50ms triggers P2 |
| Recovery time (from failure detection to resumed execution) |
≤ 60 seconds |
Per event |
> 5 minutes triggers P1 |
| Task completion rate (including recovered tasks) |
≥ 98% |
24-hour rolling |
< 95% triggers P2 |
| Checkpoint store availability |
99.99% |
Monthly |
Any degradation triggers P1 |
Monitoring
- Checkpoint write success/failure rate per task_id
- Task recovery event rate — spike indicates infrastructure instability
- Average iterations per task — significant increase may indicate agent getting stuck in recovery loops
- Cost per task with recovery events — quantifies the financial impact of failures
DR and Capacity
| DR Tier |
Checkpoint Store Config |
RTO |
RPO |
| Standard |
Redis with AOF + daily snapshot to S3 |
5 min |
1 iteration (1 checkpoint interval) |
| Enhanced |
DynamoDB Global Tables (multi-region) |
< 1 min |
Near-zero (synchronous replication) |
| Premium |
Temporal Cloud (fully managed) |
< 30 sec |
Zero (event-sourced; no data loss) |
11. Cost Considerations
Cost Drivers
| Cost Driver |
Scaling Behaviour |
Control Lever |
| Checkpoint store writes |
Linear with tasks × iterations per task |
Checkpoint every N iterations instead of every 1 (trade recovery granularity for cost) |
| Checkpoint state storage |
Linear with concurrent tasks × state size × retention period |
Compress checkpoint state; aggressive TTL on completed tasks |
| Recovery LLM calls |
Proportional to failure rate × iterations replayed |
Improve infrastructure reliability to reduce failure rate; idempotent cached results avoid re-inference |
Indicative Cost Impact
| Scenario |
Cost Impact vs. No Checkpoint |
| 0% failure rate, 100% completion |
+2–5% (checkpoint write overhead only) |
| 5% failure rate, recovery from checkpoint |
Break-even to 10% savings (avoid full re-execution cost) |
| 10% failure rate without recovery |
2–3× effective cost (full re-execution + risk of duplicate side effects) |
12. Trade-Off Analysis
Checkpointing Strategy Options
| Option |
Granularity |
Write Overhead |
Recovery Granularity |
Best For |
| A: Per-Iteration (Recommended) |
Every loop iteration |
Low (Redis ≤15ms) |
Maximum — at most 1 iteration lost |
Tasks with expensive or side-effectful iterations |
| B: Per-N-Iterations |
Every N iterations |
Proportionally lower |
At most N iterations lost |
Short, cheap iterations where overhead matters |
| C: Workflow Engine Native |
Platform-managed (Temporal/Durable Functions) |
Platform overhead |
Zero — event-sourced replay |
Complex multi-stage workflows; regulated workloads |
| D: No Checkpointing (Idempotent Tasks Only) |
N/A — full re-execution |
Zero |
Full re-execution from start |
Fully idempotent tasks where re-execution is safe and cheap |
Architectural Tensions
| Tension |
Left Pole |
Right Pole |
Balance |
| Recovery granularity vs. Checkpoint overhead |
Checkpoint after every action (microseconds apart) |
Checkpoint only at major milestones |
Per-iteration checkpointing is the practical optimum |
| Idempotency key lifespan vs. Storage cost |
Keep keys indefinitely |
Expire after task completion |
Expire with task retention period; dedup window for external APIs is typically 24h |
| Human pause flexibility vs. Task complexity |
Allow pause at any iteration |
Only allow pause at defined milestones |
Pause at any iteration for safety; display milestone progress for human context |
13. Failure Modes
| Failure Mode |
Likelihood |
Impact |
Detection |
Recovery |
| Checkpoint store failure during write |
Low |
Medium — last iteration lost; replay from prior checkpoint |
Write exception + circuit breaker |
Recover from last successful checkpoint; at most 1 iteration re-executed |
| Duplicate task execution (two instances race) |
Low |
High — duplicate side effects if not protected by idempotency keys |
CAS conflict on checkpoint write |
One instance wins CAS; other detects conflict and aborts |
| Checkpoint state corruption |
Very Low |
High — task cannot recover; must restart from scratch |
Deserialisation failure |
Versioned schema migration; if unrecoverable, fresh start with duplicate side effect audit |
| Idempotency key rejected by external API |
Low |
Medium — tool call may not complete |
API error code |
Log and investigate; if API does not support idempotency, implement result cache in checkpoint |
| Recovery loop (agent stuck, keeps recovering) |
Low |
High — cost overrun |
Recovery attempt count in checkpoint; alert after N recoveries |
Kill task after max recovery attempts; alert; human review |
14. Regulatory Considerations
APRA CPS 230 (Operational Resilience)
- Material business services must have documented RTO/RPO; this pattern enables RTO measured in seconds/minutes for agent-powered services
- Recovery testing is required; the checkpoint restore path must be exercised in DR testing
ISO 22301 (Business Continuity)
- Agent checkpoint/recovery maps to §8.4 (Business Continuity Plans); the restore protocol is the BCM procedure for agent task failure
EU AI Act
- Art. 9 (Risk Management): the ability to safely pause and inspect a running agent implements a key risk management control
- Art. 14 (Human Oversight): the pause/resume mechanism is a direct implementation of the human oversight requirement — humans can halt agent execution at any point without losing task progress
NIST AI RMF
- MANAGE 4.1: The recovery protocol and idempotency design implement the incident response and recovery management requirement
15. Reference Implementations
AWS
| Component |
Service |
| Checkpoint Store |
Amazon DynamoDB (conditional PutItem; strong consistency) |
| Workflow Engine |
AWS Step Functions (built-in state persistence) |
| Management API |
AWS API Gateway + Lambda |
| Audit Log |
AWS CloudTrail + S3 Object Lock |
Azure
| Component |
Service |
| Checkpoint Store |
Azure Cosmos DB (ETag-based optimistic concurrency) |
| Workflow Engine |
Azure Durable Functions (event-sourced; built-in checkpointing) |
| Management API |
Azure Functions + API Management |
GCP
| Component |
Service |
| Checkpoint Store |
Cloud Spanner (strong consistency; CAS transactions) |
| Workflow Engine |
Cloud Workflows + Cloud Run |
| Management API |
Cloud Run API endpoint |
On-Premises
| Component |
Technology |
| Checkpoint Store |
Redis 7+ with AOF persistence (fsync always for highest durability) |
| Workflow Engine |
Temporal OSS (self-hosted on Kubernetes) |
| Management API |
FastAPI on Kubernetes |
| Pattern |
ID |
Relationship Type |
Notes |
| Single Agent Pattern |
EAAPL-AGT001 |
Extends |
Checkpoint adds durable state to the base agent loop |
| Stateful Agent Memory |
EAAPL-AGT002 |
Integrates With |
Checkpoint references memory record IDs; restore reloads referenced memories |
| Long-Running Agent |
EAAPL-AGT007 |
Depends On |
Long-running agent requires checkpointing as a foundational capability |
| Human-in-the-Loop Agent |
EAAPL-MAG003 |
Integrates With |
Pause/resume mechanism is the implementation enabler for HITL gates mid-task |
| Agent Handoff Protocol |
EAAPL-MAG006 |
Peer |
Handoff payload includes checkpoint state for seamless context transfer between agents |
17. Maturity Assessment
Overall Maturity: Proven
| Dimension |
Score (1–5) |
Evidence |
| Checkpointing Technology Maturity |
5 |
Redis, DynamoDB, Cosmos DB battle-tested at hyperscale; Temporal/Durable Functions proven |
| Idempotency Pattern Maturity |
5 |
Industry-standard pattern (Stripe, Twilio, etc.) with established implementation guidance |
| AI-Agent-Specific Integration |
3 |
Agent framework integration (LangGraph, Temporal) maturing; some custom implementation still required |
| Human Pause/Resume UX |
3 |
Management API well-defined; UI tooling for operator visibility still developing |
| Regulatory Evidence |
4 |
CPS 230 and ISO 22301 mapping well-established; audit trail design proven |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2024-03-15 |
Architecture Board |
Initial publication |
| 1.1 |
2024-06-01 |
Platform Engineering |
Added Temporal integration option; idempotency key management detail |
| 1.2 |
2024-09-15 |
Reliability Engineering |
Added DR tiers; SLO table; recovery loop detection failure mode |
| 1.3 |
2025-01-05 |
Architecture Board |
Added EU AI Act Art. 14 mapping; pause/resume governance artefacts |
| 1.4 |
2025-04-20 |
Reliability Engineering |
Added CAS conflict resolution; Cosmos DB ETag implementation reference |