Proven

EAAPL-MAG004 — Agent Swarm

Status: Emerging Tags: agent orchestration high-complexity enterprise-only Version: 2.0.0 Last Updated: 2026-06-12

1. Pattern Identity

Field	Value
Pattern ID	EAAPL-MAG004
Name	Agent Swarm
Category	Multi-Agent
Maturity	Emerging
Complexity	High
Related Patterns	EAAPL-MAG001 · EAAPL-MAG002 · EAAPL-MAG003 · EAAPL-MAG006

2. Executive Summary

The Agent Swarm pattern coordinates a population of peer agents operating without a central controller. Rather than a supervisor decomposing and assigning work, swarm agents observe a shared world state (a "blackboard"), self-assign to available tasks based on local rules, deposit results back onto the blackboard, and leave markers (stigmergic signals) that guide subsequent agent behaviour. Coordination is emergent rather than designed. This produces a system that degrades gracefully — losing any single agent does not halt the swarm — and scales horizontally without a coordination bottleneck. The price is reduced predictability and harder observability: emergent behaviour can be difficult to explain, and convergence is probabilistic rather than deterministic. Agent swarms are an enterprise-grade pattern only for organisations that have established multi-agent orchestration maturity (EAAPL-MAG001, EAAPL-MAG002) and have invested in swarm-level observability infrastructure. They are inappropriate for regulated workflows requiring deterministic audit trails of decision logic.

3. Problem Statement

3.1 Context

Centralised orchestration (EAAPL-MAG001, EAAPL-MAG002) introduces a single point of failure and a coordination bottleneck at the orchestrator. For massively parallel workloads — indexing millions of documents, distributed web research across thousands of URLs, large-scale code repository analysis — the orchestrator becomes the limiting factor in throughput. Furthermore, if the orchestrator fails, all in-flight work is at risk. A decentralised architecture that eliminates the orchestrator bottleneck is needed for these at-scale use cases.

3.2 Forces in Tension

Resilience vs. predictability. Removing central control eliminates the single point of failure but makes the execution path non-deterministic. You cannot replay exactly what happened.
Throughput vs. coordination. Peer agents each make local decisions quickly but may duplicate work or create oscillation loops without careful stigmergy design.
Scalability vs. observability. Adding more agents improves throughput but multiplies the observability challenge — aggregating and interpreting signals from hundreds of agents requires dedicated infrastructure.
Emergent quality vs. guaranteed quality. Swarm results emerge from the aggregate of many agent outputs. Quality is probabilistically higher for large tasks but cannot be guaranteed for any specific subtask.

3.3 Failure Modes Without This Pattern

Without swarm architecture, highly parallel workloads require either a very large orchestrator (single point of failure, expensive) or nested orchestration hierarchies (complex, slow). The swarm pattern specifically addresses the throughput ceiling and the single-point-of-failure problem that centralised orchestration cannot efficiently solve at scale.

4. Solution

4.1 Swarm Architecture Overview

ARCHITECTURE DIAGRAM

flowchart TD subgraph Input["Task Entry"] A[Task Posted to Blackboard] end subgraph Swarm["Swarm Agents"] B[Agent Alpha] C[Agent Beta] D[Agent Gamma] E[Agent Delta] end subgraph Shared["Shared Blackboard"] F[Task Queue] G[Results Store] H[Stigmergy Markers] end subgraph Output["Convergence"] I{Termination Check} J[Swarm Output Synthesiser] K[Final Result] end A --> F F --> B F --> C F --> D F --> E B --> G C --> G D --> G E --> H H --> F G --> I I -->|not done| F I -->|done| J --> K style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#fef9c3,stroke:#eab308 style G fill:#fef9c3,stroke:#eab308 style H fill:#fef9c3,stroke:#eab308 style I fill:#f3e8ff,stroke:#a855f7 style J fill:#f0fdf4,stroke:#22c55e style K fill:#d1fae5,stroke:#10b981

4.2 Stigmergy Signal Flow

ARCHITECTURE DIAGRAM

flowchart TD subgraph Agent["Agent Processing"] A[Agent Reads Blackboard] B[Claims Available Task] C[Executes Task] D[Deposits Result] E[Deposits Stigmergy Marker] end subgraph Board["Blackboard State"] F[Task: Unclaimed] G[Task: In-Progress] H[Task: Complete] I[Marker: HotZone] J[Marker: Explored] end A --> B B --> G C --> D --> H D --> E --> I I --> F style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#fef9c3,stroke:#eab308 style G fill:#fef9c3,stroke:#eab308 style H fill:#d1fae5,stroke:#10b981 style I fill:#f3e8ff,stroke:#a855f7 style J fill:#fef9c3,stroke:#eab308

5. Structure

5.1 Component Catalogue

Component	Responsibility	Technology Options
Blackboard	Shared world state — tasks, results, markers	Redis, DynamoDB, Postgres
Swarm Agents	Self-directed task execution based on blackboard state	LLM instances with tool access
Stigmergy Engine	Manages markers that guide agent self-selection	Weighted counters on the blackboard
Termination Monitor	Detects convergence and triggers synthesis	Background process checking blackboard state
Swarm Synthesiser	Aggregates all agent results into a final output	LLM with aggregation prompt
Swarm Observability	Aggregates signals from all agents	OpenTelemetry collector, time-series DB

5.2 Blackboard Record Schema

{
  "taskId": "uuid-v4",
  "taskType": "document-chunk-analysis",
  "status": "UNCLAIMED | IN_PROGRESS | COMPLETE | FAILED",
  "payload": { "chunkId": "...", "text": "..." },
  "claimedBy": "agent-uuid-or-null",
  "claimedAt": "ISO-8601-or-null",
  "completedAt": "ISO-8601-or-null",
  "result": { "entities": [], "sentiment": "...", "summary": "..." },
  "stigmergy": {
    "hotZone": 3,
    "explored": true,
    "explorationDepth": 2
  },
  "ttlMs": 300000
}

6. Behaviour

6.1 Shared Blackboard Communication

The blackboard is the sole communication channel between agents. Agents do not communicate directly with each other. The blackboard exposes:

Task queue. Ordered list of unclaimed tasks with priority and TTL.
Results store. Completed task records including agent output.
Stigmergy markers. Weighted signals left by agents to indicate areas of high or low value for further exploration.

Agent task selection uses an atomic claim operation (compare-and-swap on status: UNCLAIMED -> IN_PROGRESS with claimedBy: agentId). This prevents two agents from claiming the same task. If a claim fails (another agent beat them to it), the agent immediately re-evaluates the blackboard for the next available task.

6.2 Stigmergy

Stigmergy is the mechanism by which agents indirectly influence each other's behaviour through environmental markers, without direct communication. In an AI swarm:

Positive pheromone (hot zone marker): an agent that finds a highly productive task area (e.g., a document section with many relevant entities) increments a hotZone counter on that area. Other agents probabilistically bias their task selection toward high hot-zone areas.
Negative pheromone (explored marker): an agent that exhausts a task area marks it as explored: true. Other agents deprioritise already-explored areas.
Marker decay. Stigmergy markers decay over time (TTL-based counter reduction). This prevents the swarm from permanently fixating on a historically productive area that is no longer relevant. Decay rate is a tuning parameter.

6.3 Consensus Without Central Coordinator

For tasks requiring agreement among agents (e.g., document classification where multiple agents analyse the same document and must agree on a label):

Each agent deposits its classification result on the blackboard.
After N agents have deposited results (N is the consensus threshold), the termination monitor reads all results.
If majority agreement exists (> 50% for binary, configurable for multi-class), the consensus result is recorded.
If no consensus: spawn an additional agent with the full set of disagreeing results in context, asking it to adjudicate.

6.4 Swarm Stability Controls

Termination conditions. The swarm terminates when one of: all tasks in the blackboard are in COMPLETE or FAILED status; a wall-clock deadline is reached; the remaining unclaimed task count falls below a minimum threshold; the quality score of results reaches a target threshold.

Convergence detection. The termination monitor tracks the rate of new results being deposited. If the rate drops below a minimum threshold for a sustained period (configurable: e.g., fewer than 5 results per minute for 3 consecutive minutes), the swarm is declared converged even if tasks remain, indicating they are likely infeasible or blocked.

Anti-oscillation. Oscillation occurs when agents repeatedly claim and release the same tasks without making progress. Detect by tracking the number of IN_PROGRESS -> UNCLAIMED transitions per task. A task that has been claimed and abandoned more than 3 times is marked FAILED and removed from circulation.

Agent health monitoring. An agent that has been in IN_PROGRESS state for longer than 2× the expected task duration is presumed crashed. Its claimed tasks are returned to UNCLAIMED status for other agents to pick up.

7. Implementation Guide

7.1 Step-by-Step

Step 1 — Design the blackboard schema. Define your task record, result record, and stigmergy marker fields. Ensure the claim operation is atomic at the database level (use a transaction or conditional write).

Step 2 — Define agent selection logic. Each agent runs a loop: read blackboard → select best unclaimed task (weighted by priority + stigmergy) → atomic claim → execute → deposit result + update markers → repeat.

Step 3 — Implement termination conditions. Decide your termination criteria before deploying. Unclear termination is the most common swarm failure mode.

Step 4 — Implement marker decay. Run a background process that reduces stigmergy marker values by a decay factor every N seconds. Without decay, the swarm becomes permanently biased toward early high-value areas.

Step 5 — Build swarm observability. Before deploying to production, ensure you can answer: how many agents are currently active, what is the task completion rate, what is the current blackboard depth, and are any tasks oscillating?

Step 6 — Implement the swarm synthesiser. After termination, a single synthesis agent reads all completed results from the blackboard and produces the final output. This is the one centralised step in an otherwise decentralised architecture.

7.2 Code Skeleton (TypeScript)

class SwarmAgent {
  private agentId = crypto.randomUUID();

  async run(blackboard: Blackboard, maxIterations = 1000): Promise<void> {
    for (let i = 0; i < maxIterations; i++) {
      const task = await blackboard.claimNextTask(this.agentId);
      if (!task) {
        await sleep(500); // No tasks available, backoff
        continue;
      }

      const span = tracer.startSpan("swarm.agent.execute", { taskId: task.taskId, agentId: this.agentId });
      try {
        const result = await this.executeTask(task);
        await blackboard.depositResult(task.taskId, result);
        await blackboard.updateStigmergy(task.taskId, {
          hotZone: result.entityCount > 10 ? 3 : 1,
          explored: true
        });
        span.setStatus({ code: "OK" });
      } catch (e) {
        await blackboard.markFailed(task.taskId, this.agentId, String(e));
        span.setStatus({ code: "ERROR", message: String(e) });
      } finally {
        span.end();
      }
    }
  }

  private async executeTask(task: BlackboardTask): Promise<TaskResult> {
    return agentLLM.invoke({
      system: "You are a document analysis agent. Extract entities, sentiment, and key facts.",
      user: task.payload.text
    });
  }
}

// Launch swarm
const swarm = Array.from({ length: 20 }, () => new SwarmAgent());
await Promise.all(swarm.map(agent => agent.run(blackboard)));

8. Observability

8.1 Swarm-Level Metrics

The challenge of swarm observability is that individual agent traces are necessary but not sufficient — you need aggregate swarm health metrics in addition to per-agent spans.

Metric	Description	Alert Threshold
Active agent count	Agents currently executing tasks	< configured minimum (swarm shrinking unexpectedly)
Task completion rate	Tasks completed per minute	< 10% of initial rate sustained for 5m
Blackboard depth	Unclaimed tasks remaining	> 0 after termination deadline
Oscillating task rate	Tasks claimed and abandoned > 3 times	> 5% of total tasks
Convergence progress	% of tasks in COMPLETE or FAILED state	Used for progress estimation
Stigmergy concentration	Whether 80% of agent activity is concentrated on 20% of tasks	High concentration may indicate suboptimal coverage

8.2 Trace Aggregation

Each agent emits OpenTelemetry spans with the swarm run ID as the root trace context. The trace aggregation system must be able to: group spans by swarm run ID; show the timeline of task claims and completions across all agents; identify which agents had the highest error rates; show the evolution of the blackboard state over time.

9. Cost Governance

Agent count ceiling. Set a hard maximum on the number of agents that can run concurrently for a single swarm run. Without this ceiling, a runaway swarm can exhaust token budgets in minutes.
Per-task token budget. Each task on the blackboard has a maxTokensPerExecution field. Agents must honour this limit.
Swarm budget envelope. Set a total token budget for the entire swarm run. The termination monitor halts the swarm when this budget is reached, even if tasks remain.
Model tiering per task type. Simple tasks (chunked text extraction) use efficient models; complex tasks (cross-document reasoning) use frontier models. Encode the required model tier in the task record.

10. Security Considerations

10.1 Blackboard Isolation

The blackboard stores all task payloads and results. It must enforce tenant isolation — agents from one tenant must not read tasks or results belonging to another. Implement row-level security or key-prefix namespace separation.

10.2 Agent Identity

Each agent must authenticate to the blackboard using a short-lived token scoped to the current swarm run. Tokens expire when the swarm run ends. This prevents orphaned agents from continuing to access the blackboard after the run concludes.

10.3 Prompt Injection via Blackboard

Task payloads read from the blackboard may contain adversarial content. Sanitise task payloads before injecting them into agent prompts. Never allow task payload content to appear in the agent's system prompt — only in the user turn, clearly demarcated.

11. Failure Modes and Mitigations

Failure Mode	Detection	Mitigation
Swarm fails to converge	Completion rate drops to near zero before all tasks complete	Convergence detection triggers early termination; synthesiser works with partial results
Oscillating tasks block progress	Oscillation rate above threshold	Mark oscillating tasks as FAILED after 3 abandoned claims
Swarm fixates on one area	Stigmergy concentration above threshold	Increase marker decay rate; cap hot-zone score maximum
Agent flood (too many agents spawn)	Cost spike alert	Hard agent count ceiling per swarm run
Blackboard becomes consistency bottleneck	Claim operation latency spikes	Shard blackboard by task type; use optimistic locking
Human oversight loses track of emergent behaviour	No swarm-level audit trail	Swarm synthesiser must produce a narrative explaining which areas were explored and which were missed

12. Compliance and Governance

12.1 Auditability of Emergent Behaviour

The principal compliance challenge of the swarm pattern is that the execution path is non-deterministic — the same input will produce a different order of agent operations on each run. For regulated use cases requiring a deterministic audit trail, the swarm pattern is inappropriate. The centralised orchestration pattern (EAAPL-MAG001) or supervisor agent pattern (EAAPL-MAG002) should be used instead.

For enterprise use cases where swarm is appropriate (non-regulated, large-scale analysis), the audit record must capture: the full blackboard state at start and end of run; the aggregate list of tasks completed and failed; the final synthesised output; and the swarm run parameters (agent count, termination conditions, budget).

12.2 Human Oversight Integration

Because swarm behaviour is emergent and difficult to predict, human oversight must occur at the swarm output level rather than at individual agent decision points. Integrate EAAPL-MAG003 as a post-swarm checkpoint: before the swarm synthesiser's output is consumed by downstream systems, a human reviewer validates the aggregate result and approves publication.

13. Testing Strategy

13.1 Unit Tests

Atomic claim operation: two concurrent agents attempt to claim the same task; assert exactly one succeeds.
Stigmergy decay: a blackboard marker is written; after decay interval, assert the value has decreased by the expected factor.
Anti-oscillation: a task is claimed and abandoned 3 times; assert it is marked FAILED and removed from circulation.
Termination: all tasks transition to COMPLETE; assert the termination monitor fires and triggers synthesis.

13.2 Integration Tests

Swarm run with 5 agents and 50 pre-loaded tasks; assert all tasks complete within a configurable time limit.
Swarm run with one agent crashed mid-run; assert its claimed tasks are reclaimed by other agents and completed.
Swarm run with budget ceiling set to exhaust after 30 tasks; assert the swarm halts at the budget ceiling and returns a partial result.

13.3 Chaos Tests

Kill 50% of agents mid-run; assert remaining agents complete all tasks (possibly with increased latency).
Corrupt the blackboard state for 10% of task records; assert corrupted tasks are marked failed and do not block swarm completion.

13.4 Observability Tests

Assert that after a swarm run, the trace aggregation system contains spans from all active agents grouped under the swarm run ID.
Assert that the swarm summary metric (tasks completed / tasks total) reaches 100% or reports the correct partial completion rate.

14. Variants and Extensions

14.1 Hierarchical Swarm

A swarm that produces sub-tasks deposits them onto a secondary blackboard consumed by a child swarm. Enables recursive decomposition without a central orchestrator. Maximum depth: 2 levels recommended.

14.2 Swarm with Referee Agent

A single referee agent monitors swarm output quality in real time (without blocking swarm execution). If quality falls below threshold (e.g., too many agent results contradicting each other), the referee posts a correction task onto the blackboard for the swarm to address.

14.3 Hybrid Swarm-Orchestrator

A central orchestrator handles task decomposition and final synthesis; the execution of individual subtasks is delegated to a swarm of peer agents rather than assigned by the orchestrator. Preserves orchestrator observability for decomposition and synthesis while gaining swarm resilience for execution.

15. Trade-off Analysis

Dimension	Agent Swarm	Centralised Orchestration	Supervisor Agent
Throughput ceiling	None (horizontal scale)	Limited by orchestrator	Limited by supervisor
Single point of failure	None	Orchestrator	Supervisor
Predictability	Low (emergent)	High (deterministic)	High
Observability complexity	High	Moderate	Moderate
Compliance suitability	Low (non-regulated only)	High	Highest
Minimum viable team maturity	High	Moderate	Moderate

16. Known Implementations

Organisation Type	Use Case	Swarm Size	Reported Outcome
Legal tech platform	Large-scale contract corpus analysis (10K+ docs)	50 agents	14× throughput vs orchestrated approach; 3% missed task rate
Research institution	Distributed literature review across 100K papers	100 agents	Covered 94% of relevant papers in 4 hours vs 3 days manually
E-commerce	Product catalogue enrichment (1M+ SKUs)	200 agents	99.2% task completion rate; 0.8% required human review
Cybersecurity firm	Distributed vulnerability scanning across large codebase	30 agents	8× faster than sequential scan; false positive rate 2.1%

Pattern ID	Name	Relationship
EAAPL-MAG001	Multi-Agent Orchestration	Centralised alternative; recommended for regulated or lower-scale use cases
EAAPL-MAG002	Supervisor Agent	Hybrid: supervisor handles quality gates; swarm handles parallel execution
EAAPL-MAG003	Human-in-the-Loop Agent	Applied at swarm output level for post-synthesis human validation
EAAPL-MAG006	Agent Handoff Protocol	Informs blackboard task record schema design

18. References

Gartner, "Emergent AI Architectures: Beyond Orchestration," 2025 (ID: G00821567)
Dorigo, M. and Stutzle, T., "Ant Colony Optimization," MIT Press, 2004
Bonabeau, E. et al., "Swarm Intelligence: From Natural to Artificial Systems," Oxford University Press, 1999
Microsoft Research, "Magnetic-One: A Generalist Multi-Agent System for Solving Complex Tasks," 2024
AutoGen: Enabling Next-Generation Large Language Model Applications — arxiv.org/abs/2308.08155
LangGraph: Multi-Agent Networks — langchain-ai.github.io/langgraph/tutorials/multi_agent/multi-agent-network
Anthropic, "Building Effective Agents," 2025 — anthropic.com/research/building-effective-agents
NIST SP 800-204D: Strategies for the Integration of Software Supply Chains (emergent system auditability principles)
Wooldridge, M., "An Introduction to MultiAgent Systems," 2nd ed., Wiley, 2009
OpenTelemetry Specification: Trace Context Propagation — opentelemetry.io/docs/reference/specification/trace

← Back to Library More Multi-Agent Systems →

EAAPL-MAG004 — Agent Swarm

EAAPL-MAG004 — Agent Swarm

1. Pattern Identity

2. Executive Summary

3. Problem Statement

3.1 Context

3.2 Forces in Tension

3.3 Failure Modes Without This Pattern

4. Solution

4.1 Swarm Architecture Overview

4.2 Stigmergy Signal Flow

5. Structure

5.1 Component Catalogue

5.2 Blackboard Record Schema

6. Behaviour

6.1 Shared Blackboard Communication

6.2 Stigmergy

6.3 Consensus Without Central Coordinator

6.4 Swarm Stability Controls

7. Implementation Guide

7.1 Step-by-Step

7.2 Code Skeleton (TypeScript)

8. Observability

8.1 Swarm-Level Metrics

8.2 Trace Aggregation

9. Cost Governance

10. Security Considerations

10.1 Blackboard Isolation

10.2 Agent Identity

10.3 Prompt Injection via Blackboard

11. Failure Modes and Mitigations

12. Compliance and Governance

12.1 Auditability of Emergent Behaviour

12.2 Human Oversight Integration

13. Testing Strategy

13.1 Unit Tests

13.2 Integration Tests

13.3 Chaos Tests

13.4 Observability Tests

14. Variants and Extensions

14.1 Hierarchical Swarm

14.2 Swarm with Referee Agent

14.3 Hybrid Swarm-Orchestrator

15. Trade-off Analysis

16. Known Implementations

17. Related Patterns

18. References