[EAAPL-AGT007] Long-Running Agent
Category: Agentic AI
Sub-category: Async Execution Architecture
Version: 1.2
Maturity: Proven
Tags: long-running, async, task-queue, heartbeat, cost-budget, partial-results, deadline-management, human-checkin
Regulatory Relevance: APRA CPS 230 (Operational Resilience), ISO 22301, NIST AI RMF (MANAGE 4.1), EU AI Act (Art. 9, 14)
1. Executive Summary
The Long-Running Agent Pattern defines the architecture for AI agents that execute tasks over hours or days — due diligence analysis, large codebase refactoring, enterprise-wide data reconciliation, or extended research synthesis. These tasks cannot fit within the synchronous request-response paradigm: calling systems cannot hold a connection open for hours, LLM context windows cannot hold 48 hours of tool results, and cost controls require active monitoring rather than post-hoc billing surprises.
For CIO/CTO audiences: this pattern transforms AI agents from interactive request-responders into asynchronous workforce members — entities you assign a task to on Monday morning and receive a deliverable from by Friday, with status updates throughout and the ability to pause or redirect them at any point. It defines how to decompose multi-day tasks into manageable segments, how to monitor and control running costs, how to ensure partial results are safely preserved if the task is interrupted, and how to maintain human oversight over extended autonomous operation. The resulting architecture is what separates a toy AI demo from a production AI workforce capability.
2. Problem Statement
Business Problem
High-value knowledge work tasks take hours or days. A due diligence review of 500 contracts, a codebase-wide security audit, or a multi-source research synthesis cannot complete in seconds. If AI agents are restricted to short tasks, the most valuable automation opportunities remain out of reach.
Technical Problem
Synchronous agent execution (HTTP request/response model) is unsuitable for long tasks: connection timeouts, LLM context window limits, token cost unpredictability, and inability to inject human checkpoints all fail at scale. Context window exhaustion on multi-hour tasks is a particularly severe problem: a 100K token context window fills after 60–100 tool calls with moderate result sizes.
Symptoms of Absence
- Tasks taking longer than 30 minutes are decomposed manually by humans into shorter subtasks, negating automation benefits
- Cost surprises: a long agent task consumes 10–50× the anticipated token budget with no warning
- Partial work is lost when infrastructure restarts or LLM provider timeouts occur at hour 3 of a 5-hour task
- No mechanism for human course-correction once a long task is launched
Cost of Inaction
- High-value automation opportunities (due diligence, audit, research) remain manual
- Ad hoc workarounds (manually splitting tasks) create brittle processes that fail when task sizes vary
- Infrastructure teams field escalations about unexplained high AI inference costs from long tasks without budget controls
3. Context
When to Apply
- Expected task duration is > 30 minutes
- Task involves processing a large corpus (hundreds of documents, thousands of records)
- Human review or approval at intermediate milestones is required
- Cost predictability and budget control are required
- Partial results have value (delivering results incrementally is better than delivering nothing if the task is interrupted)
When NOT to Apply
- Tasks that complete in < 5 minutes (async overhead not justified)
- Tasks that require a synchronous response in the same user session
- Tasks with no natural decomposition into independently useful segments
Prerequisites
- EAAPL-AGT005 (Checkpoint and Recovery) — mandatory for multi-hour tasks
- Durable task queue with dead-letter handling
- Async notification infrastructure (webhooks, event bus, push notifications)
- Cost monitoring and kill switch capability
- Human management API (pause, redirect, cancel)
Industry Applicability
| Industry |
Long-Running Task |
Duration |
Human Check-in Frequency |
| Legal / M&A |
Due diligence (500+ documents) |
4–24 hours |
At task creation, 50% progress, completion |
| Financial Services |
Regulatory report generation, reconciliation |
2–12 hours |
At key milestones; anomaly-triggered |
| Technology |
Large codebase security audit, refactoring |
4–48 hours |
At phase boundaries |
| Healthcare |
Multi-source patient cohort analysis |
2–8 hours |
At each data source completion |
| Research |
Literature synthesis, competitive analysis |
8–72 hours |
Daily check-in |
4. Architecture Overview
The Long-Running Agent Pattern addresses four fundamental challenges of extended autonomous execution: context window management, task decomposition and progress tracking, cost budget enforcement, and human oversight at meaningful checkpoints.
Task Decomposition and Segment Orchestration
A long task is decomposed by the Task Planner into an ordered sequence of segments — bounded sub-tasks each of which can complete within the single-agent pattern's standard execution model (typically < 30 minutes, < 50K tokens). The segment plan is stored durably at task creation and is the master execution schedule. Each segment produces a partial result that is stored in the Partial Result Store. If the task is interrupted, the segment plan acts as the recovery map: completed segments are skipped; the next incomplete segment is resumed.
The segment plan is not a rigid pre-specified plan. The Task Planner can be queried to revise the remaining segment plan based on discoveries made in early segments (adaptive planning). For example, if segment 3 discovers that 200 additional documents need to be reviewed, the plan is revised to add segments 3a–3n before segment 4.
Context Window Management Across Segments
Each segment executes in a fresh context window. The context for segment N includes: the original task objective, a summary of results from segments 1 through N-1 (produced by the Context Summariser component), the current segment's specific sub-objective, and the relevant tools. The summary is a lossy compression of prior results — the Task Planner specifies what information must be preserved across segment boundaries in the task plan.
This approach solves context exhaustion by design: no single segment accumulates more context than the window can hold. The cost is that inter-segment reasoning is mediated through the summary, which may lose nuance. For tasks that require tight consistency across many segments (e.g., a legal review where clause 400 must reference clause 12), the Task Planner must preserve the critical cross-references in the carry-forward summary.
Heartbeat and Progress Monitoring
The long-running agent emits a heartbeat event to the monitoring system at the completion of each segment and at configurable intervals within a segment. The heartbeat includes: current segment number, total segments estimated, cost consumed so far, cost projected to completion (based on average cost per segment × remaining segments), elapsed time, and an ETA for completion. The Heartbeat Monitor triggers alerts if heartbeat events are not received within the expected interval — indicating a stuck or crashed agent.
Human Check-in Points
The task plan defines human check-in points — typically at task creation (human reviews and approves the decomposition plan), at significant milestones (e.g., 50% completion), and at completion. At check-in points, the long-running agent pauses execution (using the checkpoint mechanism from EAAPL-AGT005), delivers the partial results and a progress summary to the human via a notification, and waits for human acknowledgment or instruction. The human can: approve and resume, redirect (modify the remaining segment plan), or cancel. This implements EU AI Act Art. 14 human oversight for high-risk long-running tasks.
Cost Budget and Kill Switch
Before execution begins, the calling system specifies a cost budget (maximum token spend for the task). The Cost Controller monitors cumulative spend at each segment boundary. If projected cost-to-completion exceeds the budget, the Cost Controller pauses the task and notifies the human with the current partial results and a cost projection. The human can approve budget extension or accept the partial results. A hard kill switch (emergency stop) is available to humans at any time, delivering immediately available partial results and a clean task termination.
Partial Result Delivery
Each completed segment's output is written to the Partial Result Store immediately upon completion. A Partial Result Aggregator compiles the running partial results into a human-consumable intermediate deliverable. The calling system can request partial results at any time via the management API, regardless of whether the task is still running. This enables progressive value delivery — a due diligence review that identifies 20 critical issues in the first 30% of documents is actionable immediately, before the full 500-document review completes.
5. Architecture Diagram
flowchart TD
subgraph Input["Task Initiation"]
A[Long Task Request]
B[Task Planner]
end
subgraph Execution["Async Execution Engine"]
C[Task Queue]
D[Segment Worker]
E[Cost Controller]
end
subgraph Storage["State and Results"]
F[(Checkpoint Store)]
G[(Partial Result Store)]
end
A --> B
B -->|segment plan + human approval| C
C --> D
D -->|checkpoint each segment| F
D -->|segment output| G
D --> E
E -->|over budget| B
F -->|recover on failure| D
G -->|final aggregation| A
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#f0fdf4,stroke:#22c55e
style C fill:#f0fdf4,stroke:#22c55e
style D fill:#f0fdf4,stroke:#22c55e
style E fill:#f3e8ff,stroke:#a855f7
style F fill:#fef9c3,stroke:#eab308
style G fill:#fef9c3,stroke:#eab308
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Task Planner |
Orchestration / AI |
Decomposes long task into segment plan; revises plan adaptively |
LLM-based planner; rule-based decomposition for structured tasks |
Critical |
| Task Plan Store |
Persistence |
Stores segment plan; tracks segment completion status |
DynamoDB, PostgreSQL, Azure Cosmos DB |
Critical |
| Task Queue |
Message Queue |
Durable queue for segment execution; dead-letter for failed segments |
SQS, Azure Service Bus, Google Pub/Sub, Kafka |
Critical |
| Segment Worker |
Compute |
Executes each segment as a standard agent loop (EAAPL-AGT001) |
Containerised agent runtime; ECS, AKS, Cloud Run |
Critical |
| Context Summariser |
AI Component |
Compresses prior segment results into carry-forward context |
LLM summarisation; structured summary template per task type |
High |
| Partial Result Store |
Persistence |
Stores completed segment outputs; supports partial result queries |
PostgreSQL, S3 + DynamoDB index, Cosmos DB |
High |
| Partial Result Aggregator |
Orchestration |
Compiles segment outputs into intermediate deliverable |
Custom; LLM-assisted for natural language outputs |
Medium |
| Heartbeat Emitter |
Monitoring |
Emits heartbeat events at configurable intervals |
Custom; part of segment worker |
High |
| Heartbeat Monitor |
Monitoring |
Detects missed heartbeats; triggers recovery |
CloudWatch Alarms, Azure Monitor, custom |
High |
| Cost Controller |
Governance |
Tracks cumulative cost; projects to completion; enforces budget ceiling |
Custom + LLM provider usage APIs |
Critical |
| Management API |
Operations |
Exposes pause, redirect, cancel, status, partial-result endpoints |
REST API; API Gateway + Lambda/Functions |
High |
| Human Check-in Queue |
Human Oversight |
Delivers milestone notifications to human approvers; collects decisions |
Email, Slack, Teams, custom approval portal |
High |
| Checkpoint Store |
Recovery |
Stores segment-level checkpoints (EAAPL-AGT005) |
Redis, DynamoDB, Cosmos DB |
Critical |
| Deadline Manager |
SLA |
Monitors task ETA vs. deadline; alerts if deadline at risk |
Custom scheduler + ETA calculation |
Medium |
7. Data Flow
Task Initiation
| Step |
Actor |
Action |
Output |
| 1 |
Calling System |
Submits long task: instruction, corpus reference, cost_budget, deadline, checkin_points |
Task request |
| 2 |
Task Planner |
Analyses task; decomposes into N segments; assigns cost estimate per segment; identifies checkin milestones |
Segment plan: [{segment_id, sub_objective, input_scope, estimated_cost, checkin: bool}] |
| 3 |
Human Check-in |
Delivers plan to human for review; awaits approval |
Approved / Modified plan |
| 4 |
Task Queue |
Enqueues segment 1 for execution |
Segment 1 in queue |
Segment Execution
| Step |
Actor |
Action |
Output |
| 1 |
Segment Worker |
Dequeues segment N; loads carry-forward context from Context Summariser |
Assembled context |
| 2 |
Agent Loop |
Executes standard agent loop for segment N scope |
Segment N result |
| 3 |
Partial Result Store |
Writes segment N result |
Partial result record |
| 4 |
Cost Controller |
Updates cumulative cost; projects remaining cost |
Cost status |
| 5 |
Heartbeat Emitter |
Emits segment completion heartbeat |
Heartbeat event |
| 6 |
Checkpoint |
Writes segment N checkpoint |
Recovery state |
| 7 |
Context Summariser |
Produces carry-forward summary including segment N findings |
Updated cross-segment summary |
| 8 |
Checkin Gate |
If checkin milestone: pause; notify human; await instruction |
Human instruction |
| 9 |
Task Queue |
Enqueues segment N+1 (or revised plan if redirected) |
Next segment queued |
Error Flow
| Error |
Detection |
Recovery |
| Segment worker crashes mid-execution |
Missed heartbeat |
Heartbeat monitor triggers recovery; resume from last checkpoint within segment |
| Task queue message lost |
Dead-letter queue |
DLQ alarm; reprocess segment from last checkpoint |
| LLM provider outage |
Segment worker invocation failure |
Exponential backoff retry; failover to secondary LLM provider if configured; alert |
| Cost overrun projection |
Cost Controller |
Pause task; notify human; await budget decision |
| Deadline at risk |
Deadline Manager |
Alert human; option to increase parallelism or reduce scope |
8. Security Considerations
Long-Running Identity Tokens
- Agent authentication tokens for accessing external tools must not expire during a multi-hour task
- Implement token refresh within the segment worker; use long-lived service account credentials, not short-lived user tokens
- Dynamic secrets (auto-rotating) must have rotation intervals longer than the maximum task duration
Data Retention of In-Progress Tasks
- Partial results contain sensitive intermediate data; they must be encrypted and access-controlled
- Partial results for cancelled tasks must be cleaned up according to the data retention policy
- Cross-segment context summaries may contain PII extracted from processed documents; apply the same classification and retention rules as the source data
OWASP LLM Top 10
| OWASP LLM Risk |
Long-Running Applicability |
Mitigation |
| LLM08 Excessive Agency |
A long-running agent operating autonomously for hours may drift from its initial scope without human awareness |
Mandatory human check-in at milestones; segment plan visible to humans from task creation; Management API enables real-time course correction at any point |
| LLM04 DoS |
Runaway long tasks consume excessive compute and API quotas |
Hard cost ceiling; segment count limit; deadline enforcement |
| LLM01 Prompt Injection |
Documents processed by the agent may contain injected instructions |
Content sanitisation on all ingested documents before task planning and segment execution |
| LLM09 Overreliance |
Business stakeholders may trust long-running agent outputs without appropriate scrutiny |
Output metadata includes confidence and completeness indicators; human check-in at completion is mandatory for high-stakes tasks |
9. Governance Considerations
Human Oversight for Long-Running Tasks
- All long-running tasks must have a named human owner who is notified of check-in points and receives partial results
- Tasks exceeding a configured duration (default: 4 hours) automatically escalate to the human owner's manager
- No task may run longer than 72 hours without a human re-approval of the segment plan
Governance Artefacts
| Artefact |
Owner |
Frequency |
Purpose |
| Task Execution Log |
Platform Engineering |
Per task |
Complete segment-by-segment execution record with costs, durations, and human decisions |
| Cost Budget Report |
FinOps |
Monthly |
Aggregate long-task spend vs. budget; overrun analysis |
| Missed Deadline Report |
Operations |
Monthly |
Tasks that exceeded deadline; root cause analysis |
| Human Check-in Audit |
AI Governance |
Quarterly |
Review of human check-in compliance; decision quality audit |
10. Operational Considerations
SLOs
| SLO |
Target |
Window |
Alert |
| Heartbeat interval compliance |
100% heartbeats within 2× expected interval |
Per task |
Any missed heartbeat triggers P2 |
| Task completion rate |
≥ 95% of started tasks complete |
Monthly |
< 90% triggers investigation |
| Segment retry rate |
≤ 5% of segments require retry |
24-hour rolling |
> 10% indicates infrastructure instability |
| Human check-in response time |
≤ 4 hours for milestone approvals |
Per check-in |
> 8 hours triggers escalation to task owner's manager |
Capacity
- Segment workers are stateless containers; horizontal scaling is bounded by LLM provider quota and tool API rate limits
- Estimate: 1 worker per 5 concurrent segments for 30-minute segments; scale up to 1 worker per concurrent segment for 5-minute segments
- Partial result storage grows with task count × average output size; provision for 30-day retention of all partial results
11. Cost Considerations
Cost Drivers
| Cost Driver |
Example |
Control |
| Total token consumption |
500-doc due diligence: ~5M tokens |
Budget ceiling; scope reduction option |
| Context summarisation overhead |
5–10% of total tokens for summaries |
Efficient summarisation prompt; smaller model for summaries |
| Segment retry cost |
Redundant work on retry |
Checkpoint granularity; reliable infrastructure |
| Long-running compute |
Worker idle time between segments |
Event-driven scaling; scale-to-zero between segments |
Indicative Cost Range (USD)
| Task Type |
Scale |
Estimated Token Count |
Estimated LLM Cost |
| Contract review (50 documents) |
Medium |
~1.5M tokens |
$15–60 |
| Contract review (500 documents) |
Large |
~12M tokens |
$120–480 |
| Codebase security audit (100K LOC) |
Large |
~8M tokens |
$80–320 |
| Research synthesis (200 papers) |
Large |
~6M tokens |
$60–240 |
12. Trade-Off Analysis
Task Decomposition Options
| Option |
Description |
Pros |
Cons |
Best For |
| A: LLM-Planned Segmentation (Recommended) |
Task Planner uses LLM to decompose task into segments |
Adaptive; handles irregular corpora |
Planner itself consumes tokens; plan quality depends on model |
Complex, variable tasks |
| B: Rule-Based Segmentation |
Fixed rules decompose by document count, page count, or time estimate |
Predictable; no LLM planning overhead |
Inflexible; poor fit for varied task types |
Well-structured, homogeneous tasks |
| C: User-Defined Milestones |
Human specifies segment boundaries upfront |
Maximum human control |
Requires human upfront effort; may mis-estimate |
Regulated tasks where human defines scope |
| D: Workflow Engine Native |
Temporal or Durable Functions handle segmentation |
Built-in persistence and retry; mature tooling |
Less LLM-native; segment boundaries are code-defined |
Engineering-intensive regulated workloads |
Architectural Tensions
| Tension |
Left Pole |
Right Pole |
Balance |
| Segment granularity vs. Context continuity |
Many small segments — low risk per segment |
Few large segments — better cross-segment reasoning |
20–30 minute segments balancing context continuity and recovery granularity |
| Cost certainty vs. Completeness |
Hard budget ceiling — task may not complete |
Best-effort — may overrun budget |
Budget ceiling with human escalation at 80% spend; partial results delivered at ceiling |
| Human oversight frequency vs. Task latency |
Check-in after every segment |
Single check-in at completion |
Risk-tiered: check-in at task creation, major milestones, and completion |
13. Failure Modes
| Failure Mode |
Likelihood |
Impact |
Detection |
Recovery |
| Agent drifts from task scope over many segments |
Medium |
High — wasted work; wrong outputs |
Human check-in reveals drift; output quality monitoring |
Re-anchor with task objective in carry-forward context; human redirect |
| Cross-segment context summary loses critical information |
Medium |
High — logical inconsistencies in final output |
Human review of final output; quality scoring |
Preserve critical references explicitly in summary template; test on sample tasks |
| Task never terminates (segment count grows adaptively) |
Low |
High — cost overrun |
Segment count limit alert; cost ceiling |
Hard limit on total segment count; cost ceiling enforcement |
| Partial results delivered to wrong principal |
Very Low |
Critical — data breach |
Access control on partial result store |
IAM on partial result endpoints; audit of all retrievals |
| Infrastructure change invalidates checkpoint schema |
Low |
Medium — recovery fails |
Checkpoint deserialisation failure |
Schema versioning; migration function |
14. Regulatory Considerations
APRA CPS 230
- Long-running agents supporting material business services require RTO/RPO; the checkpoint + segmentation architecture enables sub-segment RTO
- Multi-hour tasks interacting with critical systems require operational risk assessment and business impact analysis
EU AI Act
- Art. 14 (Human Oversight): mandatory human check-ins at task creation and significant milestones implement the "meaningful human oversight" requirement for high-risk long-running agents
- For high-risk AI systems: the complete task execution log (all segments, costs, human decisions, partial results) is a required audit artefact
15. Reference Implementations
AWS
| Component |
Service |
| Task Queue |
Amazon SQS (FIFO with DLQ) |
| Segment Worker |
AWS ECS Fargate (event-triggered) |
| Task Plan + Partial Results |
Amazon DynamoDB |
| Workflow |
AWS Step Functions (for structured decomposition) |
| Heartbeat Monitor |
CloudWatch Alarms |
| Human Check-in |
Amazon SNS + custom approval portal or AWS Step Functions human task |
Azure
| Component |
Service |
| Task Queue |
Azure Service Bus |
| Segment Worker |
Azure Container Apps |
| Task Plan + Partial Results |
Azure Cosmos DB |
| Workflow |
Azure Durable Functions |
| Human Check-in |
Azure Logic Apps + Adaptive Cards (Teams) |
On-Premises
| Component |
Technology |
| Task Queue |
Apache Kafka or RabbitMQ |
| Segment Worker |
Kubernetes Jobs |
| Task Plan + Partial Results |
PostgreSQL |
| Workflow |
Temporal OSS |
| Pattern |
ID |
Relationship Type |
Notes |
| Single Agent Pattern |
EAAPL-AGT001 |
Extended By |
Each segment is a single agent loop execution |
| Agent Checkpoint and Recovery |
EAAPL-AGT005 |
Depends On |
Checkpointing is mandatory for multi-hour tasks |
| Agent Cost Governance |
EAAPL-AGT010 |
Integrates With |
Budget ceiling and kill switch are cost governance capabilities |
| Human-in-the-Loop Agent |
EAAPL-MAG003 |
Extends |
Human check-in at milestones is a specialised application of HITL |
| Supervisor Agent |
EAAPL-MAG002 |
Related |
Supervisor can orchestrate long-running segments; alternative decomposition model |
17. Maturity Assessment
Overall Maturity: Proven
| Dimension |
Score (1–5) |
Evidence |
| Core Technology (Queuing + Checkpointing) |
5 |
Durable queues and checkpointing are mature distributed systems patterns |
| Context Summarisation Quality |
3 |
Cross-segment context compression is a known challenge; LLM summarisation quality varies |
| Human Check-in UX |
3 |
Tooling for human review of multi-hour tasks improving; no standard UX pattern yet |
| Cost Estimation Accuracy |
3 |
Per-segment cost estimates improve with task history; initial estimates are rough |
| Adaptive Re-planning |
2 |
Adaptive segment plan revision is emerging; limited production evidence |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2024-06-01 |
Architecture Board |
Initial publication |
| 1.1 |
2024-10-15 |
Platform Engineering |
Added adaptive re-planning; deadline manager; partial result aggregator |
| 1.2 |
2025-03-01 |
Architecture Board |
Added EU AI Act Art. 14 mapping; human check-in escalation policy; cost estimation table |