EAAPL-MAG004 — Agent Swarm
Status: Emerging
Tags: agent orchestration high-complexity enterprise-only
Version: 2.0.0
Last Updated: 2026-06-12
1. Pattern Identity
| Field | Value |
|---|---|
| Pattern ID | EAAPL-MAG004 |
| Name | Agent Swarm |
| Category | Multi-Agent |
| Maturity | Emerging |
| Complexity | High |
| Related Patterns | EAAPL-MAG001 · EAAPL-MAG002 · EAAPL-MAG003 · EAAPL-MAG006 |
2. Executive Summary
The Agent Swarm pattern coordinates a population of peer agents operating without a central controller. Rather than a supervisor decomposing and assigning work, swarm agents observe a shared world state (a "blackboard"), self-assign to available tasks based on local rules, deposit results back onto the blackboard, and leave markers (stigmergic signals) that guide subsequent agent behaviour. Coordination is emergent rather than designed. This produces a system that degrades gracefully — losing any single agent does not halt the swarm — and scales horizontally without a coordination bottleneck. The price is reduced predictability and harder observability: emergent behaviour can be difficult to explain, and convergence is probabilistic rather than deterministic. Agent swarms are an enterprise-grade pattern only for organisations that have established multi-agent orchestration maturity (EAAPL-MAG001, EAAPL-MAG002) and have invested in swarm-level observability infrastructure. They are inappropriate for regulated workflows requiring deterministic audit trails of decision logic.
3. Problem Statement
3.1 Context
Centralised orchestration (EAAPL-MAG001, EAAPL-MAG002) introduces a single point of failure and a coordination bottleneck at the orchestrator. For massively parallel workloads — indexing millions of documents, distributed web research across thousands of URLs, large-scale code repository analysis — the orchestrator becomes the limiting factor in throughput. Furthermore, if the orchestrator fails, all in-flight work is at risk. A decentralised architecture that eliminates the orchestrator bottleneck is needed for these at-scale use cases.
3.2 Forces in Tension
- Resilience vs. predictability. Removing central control eliminates the single point of failure but makes the execution path non-deterministic. You cannot replay exactly what happened.
- Throughput vs. coordination. Peer agents each make local decisions quickly but may duplicate work or create oscillation loops without careful stigmergy design.
- Scalability vs. observability. Adding more agents improves throughput but multiplies the observability challenge — aggregating and interpreting signals from hundreds of agents requires dedicated infrastructure.
- Emergent quality vs. guaranteed quality. Swarm results emerge from the aggregate of many agent outputs. Quality is probabilistically higher for large tasks but cannot be guaranteed for any specific subtask.
3.3 Failure Modes Without This Pattern
Without swarm architecture, highly parallel workloads require either a very large orchestrator (single point of failure, expensive) or nested orchestration hierarchies (complex, slow). The swarm pattern specifically addresses the throughput ceiling and the single-point-of-failure problem that centralised orchestration cannot efficiently solve at scale.
4. Solution
4.1 Swarm Architecture Overview
4.2 Stigmergy Signal Flow
5. Structure
5.1 Component Catalogue
| Component | Responsibility | Technology Options |
|---|---|---|
| Blackboard | Shared world state — tasks, results, markers | Redis, DynamoDB, Postgres |
| Swarm Agents | Self-directed task execution based on blackboard state | LLM instances with tool access |
| Stigmergy Engine | Manages markers that guide agent self-selection | Weighted counters on the blackboard |
| Termination Monitor | Detects convergence and triggers synthesis | Background process checking blackboard state |
| Swarm Synthesiser | Aggregates all agent results into a final output | LLM with aggregation prompt |
| Swarm Observability | Aggregates signals from all agents | OpenTelemetry collector, time-series DB |
5.2 Blackboard Record Schema
{
"taskId": "uuid-v4",
"taskType": "document-chunk-analysis",
"status": "UNCLAIMED | IN_PROGRESS | COMPLETE | FAILED",
"payload": { "chunkId": "...", "text": "..." },
"claimedBy": "agent-uuid-or-null",
"claimedAt": "ISO-8601-or-null",
"completedAt": "ISO-8601-or-null",
"result": { "entities": [], "sentiment": "...", "summary": "..." },
"stigmergy": {
"hotZone": 3,
"explored": true,
"explorationDepth": 2
},
"ttlMs": 300000
}
6. Behaviour
6.1 Shared Blackboard Communication
The blackboard is the sole communication channel between agents. Agents do not communicate directly with each other. The blackboard exposes:
- Task queue. Ordered list of unclaimed tasks with priority and TTL.
- Results store. Completed task records including agent output.
- Stigmergy markers. Weighted signals left by agents to indicate areas of high or low value for further exploration.
Agent task selection uses an atomic claim operation (compare-and-swap on status: UNCLAIMED -> IN_PROGRESS with claimedBy: agentId). This prevents two agents from claiming the same task. If a claim fails (another agent beat them to it), the agent immediately re-evaluates the blackboard for the next available task.
6.2 Stigmergy
Stigmergy is the mechanism by which agents indirectly influence each other's behaviour through environmental markers, without direct communication. In an AI swarm:
- Positive pheromone (hot zone marker): an agent that finds a highly productive task area (e.g., a document section with many relevant entities) increments a
hotZonecounter on that area. Other agents probabilistically bias their task selection toward high hot-zone areas. - Negative pheromone (explored marker): an agent that exhausts a task area marks it as
explored: true. Other agents deprioritise already-explored areas. - Marker decay. Stigmergy markers decay over time (TTL-based counter reduction). This prevents the swarm from permanently fixating on a historically productive area that is no longer relevant. Decay rate is a tuning parameter.
6.3 Consensus Without Central Coordinator
For tasks requiring agreement among agents (e.g., document classification where multiple agents analyse the same document and must agree on a label):
- Each agent deposits its classification result on the blackboard.
- After N agents have deposited results (N is the consensus threshold), the termination monitor reads all results.
- If majority agreement exists (> 50% for binary, configurable for multi-class), the consensus result is recorded.
- If no consensus: spawn an additional agent with the full set of disagreeing results in context, asking it to adjudicate.
6.4 Swarm Stability Controls
Termination conditions. The swarm terminates when one of: all tasks in the blackboard are in COMPLETE or FAILED status; a wall-clock deadline is reached; the remaining unclaimed task count falls below a minimum threshold; the quality score of results reaches a target threshold.
Convergence detection. The termination monitor tracks the rate of new results being deposited. If the rate drops below a minimum threshold for a sustained period (configurable: e.g., fewer than 5 results per minute for 3 consecutive minutes), the swarm is declared converged even if tasks remain, indicating they are likely infeasible or blocked.
Anti-oscillation. Oscillation occurs when agents repeatedly claim and release the same tasks without making progress. Detect by tracking the number of IN_PROGRESS -> UNCLAIMED transitions per task. A task that has been claimed and abandoned more than 3 times is marked FAILED and removed from circulation.
Agent health monitoring. An agent that has been in IN_PROGRESS state for longer than 2× the expected task duration is presumed crashed. Its claimed tasks are returned to UNCLAIMED status for other agents to pick up.
7. Implementation Guide
7.1 Step-by-Step
Step 1 — Design the blackboard schema. Define your task record, result record, and stigmergy marker fields. Ensure the claim operation is atomic at the database level (use a transaction or conditional write).
Step 2 — Define agent selection logic. Each agent runs a loop: read blackboard → select best unclaimed task (weighted by priority + stigmergy) → atomic claim → execute → deposit result + update markers → repeat.
Step 3 — Implement termination conditions. Decide your termination criteria before deploying. Unclear termination is the most common swarm failure mode.
Step 4 — Implement marker decay. Run a background process that reduces stigmergy marker values by a decay factor every N seconds. Without decay, the swarm becomes permanently biased toward early high-value areas.
Step 5 — Build swarm observability. Before deploying to production, ensure you can answer: how many agents are currently active, what is the task completion rate, what is the current blackboard depth, and are any tasks oscillating?
Step 6 — Implement the swarm synthesiser. After termination, a single synthesis agent reads all completed results from the blackboard and produces the final output. This is the one centralised step in an otherwise decentralised architecture.
7.2 Code Skeleton (TypeScript)
class SwarmAgent {
private agentId = crypto.randomUUID();
async run(blackboard: Blackboard, maxIterations = 1000): Promise<void> {
for (let i = 0; i < maxIterations; i++) {
const task = await blackboard.claimNextTask(this.agentId);
if (!task) {
await sleep(500); // No tasks available, backoff
continue;
}
const span = tracer.startSpan("swarm.agent.execute", { taskId: task.taskId, agentId: this.agentId });
try {
const result = await this.executeTask(task);
await blackboard.depositResult(task.taskId, result);
await blackboard.updateStigmergy(task.taskId, {
hotZone: result.entityCount > 10 ? 3 : 1,
explored: true
});
span.setStatus({ code: "OK" });
} catch (e) {
await blackboard.markFailed(task.taskId, this.agentId, String(e));
span.setStatus({ code: "ERROR", message: String(e) });
} finally {
span.end();
}
}
}
private async executeTask(task: BlackboardTask): Promise<TaskResult> {
return agentLLM.invoke({
system: "You are a document analysis agent. Extract entities, sentiment, and key facts.",
user: task.payload.text
});
}
}
// Launch swarm
const swarm = Array.from({ length: 20 }, () => new SwarmAgent());
await Promise.all(swarm.map(agent => agent.run(blackboard)));
8. Observability
8.1 Swarm-Level Metrics
The challenge of swarm observability is that individual agent traces are necessary but not sufficient — you need aggregate swarm health metrics in addition to per-agent spans.
| Metric | Description | Alert Threshold |
|---|---|---|
| Active agent count | Agents currently executing tasks | < configured minimum (swarm shrinking unexpectedly) |
| Task completion rate | Tasks completed per minute | < 10% of initial rate sustained for 5m |
| Blackboard depth | Unclaimed tasks remaining | > 0 after termination deadline |
| Oscillating task rate | Tasks claimed and abandoned > 3 times | > 5% of total tasks |
| Convergence progress | % of tasks in COMPLETE or FAILED state | Used for progress estimation |
| Stigmergy concentration | Whether 80% of agent activity is concentrated on 20% of tasks | High concentration may indicate suboptimal coverage |
8.2 Trace Aggregation
Each agent emits OpenTelemetry spans with the swarm run ID as the root trace context. The trace aggregation system must be able to: group spans by swarm run ID; show the timeline of task claims and completions across all agents; identify which agents had the highest error rates; show the evolution of the blackboard state over time.
9. Cost Governance
- Agent count ceiling. Set a hard maximum on the number of agents that can run concurrently for a single swarm run. Without this ceiling, a runaway swarm can exhaust token budgets in minutes.
- Per-task token budget. Each task on the blackboard has a
maxTokensPerExecutionfield. Agents must honour this limit. - Swarm budget envelope. Set a total token budget for the entire swarm run. The termination monitor halts the swarm when this budget is reached, even if tasks remain.
- Model tiering per task type. Simple tasks (chunked text extraction) use efficient models; complex tasks (cross-document reasoning) use frontier models. Encode the required model tier in the task record.
10. Security Considerations
10.1 Blackboard Isolation
The blackboard stores all task payloads and results. It must enforce tenant isolation — agents from one tenant must not read tasks or results belonging to another. Implement row-level security or key-prefix namespace separation.
10.2 Agent Identity
Each agent must authenticate to the blackboard using a short-lived token scoped to the current swarm run. Tokens expire when the swarm run ends. This prevents orphaned agents from continuing to access the blackboard after the run concludes.
10.3 Prompt Injection via Blackboard
Task payloads read from the blackboard may contain adversarial content. Sanitise task payloads before injecting them into agent prompts. Never allow task payload content to appear in the agent's system prompt — only in the user turn, clearly demarcated.
11. Failure Modes and Mitigations
| Failure Mode | Detection | Mitigation |
|---|---|---|
| Swarm fails to converge | Completion rate drops to near zero before all tasks complete | Convergence detection triggers early termination; synthesiser works with partial results |
| Oscillating tasks block progress | Oscillation rate above threshold | Mark oscillating tasks as FAILED after 3 abandoned claims |
| Swarm fixates on one area | Stigmergy concentration above threshold | Increase marker decay rate; cap hot-zone score maximum |
| Agent flood (too many agents spawn) | Cost spike alert | Hard agent count ceiling per swarm run |
| Blackboard becomes consistency bottleneck | Claim operation latency spikes | Shard blackboard by task type; use optimistic locking |
| Human oversight loses track of emergent behaviour | No swarm-level audit trail | Swarm synthesiser must produce a narrative explaining which areas were explored and which were missed |
12. Compliance and Governance
12.1 Auditability of Emergent Behaviour
The principal compliance challenge of the swarm pattern is that the execution path is non-deterministic — the same input will produce a different order of agent operations on each run. For regulated use cases requiring a deterministic audit trail, the swarm pattern is inappropriate. The centralised orchestration pattern (EAAPL-MAG001) or supervisor agent pattern (EAAPL-MAG002) should be used instead.
For enterprise use cases where swarm is appropriate (non-regulated, large-scale analysis), the audit record must capture: the full blackboard state at start and end of run; the aggregate list of tasks completed and failed; the final synthesised output; and the swarm run parameters (agent count, termination conditions, budget).
12.2 Human Oversight Integration
Because swarm behaviour is emergent and difficult to predict, human oversight must occur at the swarm output level rather than at individual agent decision points. Integrate EAAPL-MAG003 as a post-swarm checkpoint: before the swarm synthesiser's output is consumed by downstream systems, a human reviewer validates the aggregate result and approves publication.
13. Testing Strategy
13.1 Unit Tests
- Atomic claim operation: two concurrent agents attempt to claim the same task; assert exactly one succeeds.
- Stigmergy decay: a blackboard marker is written; after decay interval, assert the value has decreased by the expected factor.
- Anti-oscillation: a task is claimed and abandoned 3 times; assert it is marked
FAILEDand removed from circulation. - Termination: all tasks transition to
COMPLETE; assert the termination monitor fires and triggers synthesis.
13.2 Integration Tests
- Swarm run with 5 agents and 50 pre-loaded tasks; assert all tasks complete within a configurable time limit.
- Swarm run with one agent crashed mid-run; assert its claimed tasks are reclaimed by other agents and completed.
- Swarm run with budget ceiling set to exhaust after 30 tasks; assert the swarm halts at the budget ceiling and returns a partial result.
13.3 Chaos Tests
- Kill 50% of agents mid-run; assert remaining agents complete all tasks (possibly with increased latency).
- Corrupt the blackboard state for 10% of task records; assert corrupted tasks are marked failed and do not block swarm completion.
13.4 Observability Tests
- Assert that after a swarm run, the trace aggregation system contains spans from all active agents grouped under the swarm run ID.
- Assert that the swarm summary metric (tasks completed / tasks total) reaches 100% or reports the correct partial completion rate.
14. Variants and Extensions
14.1 Hierarchical Swarm
A swarm that produces sub-tasks deposits them onto a secondary blackboard consumed by a child swarm. Enables recursive decomposition without a central orchestrator. Maximum depth: 2 levels recommended.
14.2 Swarm with Referee Agent
A single referee agent monitors swarm output quality in real time (without blocking swarm execution). If quality falls below threshold (e.g., too many agent results contradicting each other), the referee posts a correction task onto the blackboard for the swarm to address.
14.3 Hybrid Swarm-Orchestrator
A central orchestrator handles task decomposition and final synthesis; the execution of individual subtasks is delegated to a swarm of peer agents rather than assigned by the orchestrator. Preserves orchestrator observability for decomposition and synthesis while gaining swarm resilience for execution.
15. Trade-off Analysis
| Dimension | Agent Swarm | Centralised Orchestration | Supervisor Agent |
|---|---|---|---|
| Throughput ceiling | None (horizontal scale) | Limited by orchestrator | Limited by supervisor |
| Single point of failure | None | Orchestrator | Supervisor |
| Predictability | Low (emergent) | High (deterministic) | High |
| Observability complexity | High | Moderate | Moderate |
| Compliance suitability | Low (non-regulated only) | High | Highest |
| Minimum viable team maturity | High | Moderate | Moderate |
16. Known Implementations
| Organisation Type | Use Case | Swarm Size | Reported Outcome |
|---|---|---|---|
| Legal tech platform | Large-scale contract corpus analysis (10K+ docs) | 50 agents | 14× throughput vs orchestrated approach; 3% missed task rate |
| Research institution | Distributed literature review across 100K papers | 100 agents | Covered 94% of relevant papers in 4 hours vs 3 days manually |
| E-commerce | Product catalogue enrichment (1M+ SKUs) | 200 agents | 99.2% task completion rate; 0.8% required human review |
| Cybersecurity firm | Distributed vulnerability scanning across large codebase | 30 agents | 8× faster than sequential scan; false positive rate 2.1% |
17. Related Patterns
| Pattern ID | Name | Relationship |
|---|---|---|
| EAAPL-MAG001 | Multi-Agent Orchestration | Centralised alternative; recommended for regulated or lower-scale use cases |
| EAAPL-MAG002 | Supervisor Agent | Hybrid: supervisor handles quality gates; swarm handles parallel execution |
| EAAPL-MAG003 | Human-in-the-Loop Agent | Applied at swarm output level for post-synthesis human validation |
| EAAPL-MAG006 | Agent Handoff Protocol | Informs blackboard task record schema design |
18. References
- Gartner, "Emergent AI Architectures: Beyond Orchestration," 2025 (ID: G00821567)
- Dorigo, M. and Stutzle, T., "Ant Colony Optimization," MIT Press, 2004
- Bonabeau, E. et al., "Swarm Intelligence: From Natural to Artificial Systems," Oxford University Press, 1999
- Microsoft Research, "Magnetic-One: A Generalist Multi-Agent System for Solving Complex Tasks," 2024
- AutoGen: Enabling Next-Generation Large Language Model Applications — arxiv.org/abs/2308.08155
- LangGraph: Multi-Agent Networks — langchain-ai.github.io/langgraph/tutorials/multi_agent/multi-agent-network
- Anthropic, "Building Effective Agents," 2025 — anthropic.com/research/building-effective-agents
- NIST SP 800-204D: Strategies for the Integration of Software Supply Chains (emergent system auditability principles)
- Wooldridge, M., "An Introduction to MultiAgent Systems," 2nd ed., Wiley, 2009
- OpenTelemetry Specification: Trace Context Propagation — opentelemetry.io/docs/reference/specification/trace