Proven

EAAPL-INT006 — AI Webhook Pattern

Tags: event-driven audit-logging asynchronous low-complexity Status: Proven | Version: 1.0 | Domain: Integration

1. Executive Summary

The AI Webhook Pattern delivers AI-generated notifications, inference results, and asynchronous AI processing outcomes to external systems via HTTP callbacks (webhooks). As AI workloads increasingly run asynchronously — long-running batch jobs, background inference pipelines, AI agent task completions — consuming systems need a reliable, secure, and idempotent mechanism to receive AI results without polling.

This pattern extends classical webhook design with AI-specific payload fields — confidence scores, model version, processing time, and result schemas — enabling receivers to make informed decisions about whether to act on an AI result, which version of a model produced it, and how long inference took. Security hardening addresses the unique risks of AI webhook delivery: HMAC signature verification, replay attack prevention, idempotent receiver design, and dead letter queue handling for undeliverable AI results.

For CIOs and CTOs, the pattern is the glue between AI processing backends and the consuming business applications that act on AI outputs. It enables a clean architectural boundary: AI processing systems produce results when ready and deliver them reliably; consuming systems do not need to know when or how AI processing occurred. The result is loosely coupled, independently deployable integration that supports AI system upgrades without consumer redeployment.

2. Problem Statement

Business Problem

AI processing is inherently asynchronous. A document intelligence pipeline may take 30 seconds to process a contract. A multi-step AI agent may take 5 minutes to complete a research task. A nightly batch enrichment job completes at 3 AM. In all cases, a consuming business application needs to receive the result when it is ready — not by polling continuously, and not by waiting synchronously.

Technical Problem

Without a structured webhook pattern, teams implement ad-hoc polling loops, shared database tables as notification queues, or synchronous long-polling connections. These approaches are unreliable under load, difficult to monitor, and create tight coupling between AI systems and consuming applications. When the AI system is upgraded, every polling consumer must be updated.

Symptoms

Consuming applications poll AI status APIs every 5 seconds, creating unnecessary load on AI systems.
AI results are written to a shared database table that consumers query continuously — creating a polling bottleneck that also exposes the AI system's internal data model.
Webhook delivery failures are discovered by consumers noticing missing results — hours after the delivery failure occurred.
There is no retry logic — a single delivery failure means the result is permanently lost.

Cost of Inaction

Reliability: Without idempotent delivery and retry logic, AI results are silently lost on any transient receiver failure. In regulated workflows, lost results mean compliance gaps.
Coupling: Polling consumers are tightly coupled to AI system availability; AI system maintenance windows affect all consuming applications simultaneously.
Cost: Continuous polling of AI status APIs generates 10–100× more API calls than event-driven delivery, increasing infrastructure costs and rate limit exposure.
Developer experience: Building and maintaining custom polling logic in every consuming application is duplicated effort that disappears with a well-designed webhook pattern.

3. Context

When to Apply

AI processing is asynchronous and the consuming system needs to receive results when ready.
The consuming system is a third-party system or an internal system operated by a different team.
Result delivery must be reliable, with retry and dead letter queue handling for failures.
The consuming system does not have access to the AI processing system's internal message bus.

When NOT to Apply

Consuming system needs real-time AI results (< 1s) — use synchronous request/response or streaming.
Consuming system is an internal service in the same deployment environment — use direct message bus consumption (EAAPL-INT001) instead.
The consuming system cannot receive inbound HTTP connections (firewall restrictions, no public endpoint) — use polling API or message queue instead.
The number of events is very small and ad-hoc — synchronous delivery or email notification may be simpler.

Prerequisites

Consuming system can expose an HTTPS webhook endpoint.
Consuming system can implement HMAC signature verification.
AI processing system has a job/event identifier system for idempotency key generation.
A dead letter queue and alerting infrastructure for failed delivery tracking.

Industry Applicability

Industry	Applicability	Use Case	Receiver Type
Financial Services	High	Loan application AI assessment result → origination system; fraud alert → operations platform	Internal systems, partner banks
SaaS Platforms	Very High	AI feature result delivery to customer systems (document intelligence, content generation)	External customer systems
Healthcare	High	Clinical AI result → EMR system; prior authorisation AI outcome → payer system	Partner healthcare systems
Government	Medium	AI document classification result → case management system	Internal government systems
eCommerce	High	AI recommendation update → personalisation platform; content moderation result → content management	Internal and partner systems
Legal Tech	High	Contract AI analysis result → matter management system	Internal and customer systems

4. Architecture Overview

The AI Webhook Pattern is a push-based asynchronous notification system with six design concerns: payload design, security (HMAC + replay prevention), delivery management (retry + exponential backoff), idempotent receiver design, dead letter queue handling, and subscription management.

Webhook Payload Design for AI Results. The webhook payload carries two categories of fields. Standard webhook fields: event_id (globally unique UUID v4; the idempotency key), event_type (namespaced event type: e.g., ai.document.classified.v1), occurred_at (ISO-8601 UTC timestamp of the AI event; not the delivery timestamp), webhook_id (unique per delivery attempt; different from event_id — enables distinguishing original from retry deliveries). AI-specific fields: ai_result (the structured AI output; schema versioned), ai_result_schema_version (enables receivers to handle schema evolution), ai_confidence_score (0.0–1.0; receivers may filter low-confidence results), ai_model_id (the specific model that produced the result), ai_model_version (model version tag), ai_processing_time_ms (how long inference took; enables SLA tracking), ai_cost_usd (optional cost attribution field), ai_fallback_used (boolean; indicates whether a fallback model was used instead of the primary model). These fields enable receivers to make intelligent decisions about AI results without needing to query the AI system for metadata.

HMAC Security. Every webhook delivery is signed with HMAC-SHA256. The sender computes the signature over the raw request body using a shared secret (one secret per webhook subscription). The signature is delivered in the X-AI-Signature-SHA256 header as sha256=<hex digest>. The receiver verifies the signature before processing: compute the expected signature over the received body using the stored shared secret; compare to the received signature using a constant-time comparison function to prevent timing attacks. Any webhook with an invalid signature is rejected with HTTP 401 and logged as a security event.

Replay Attack Prevention. The occurred_at field in the payload serves as a time-bound for replay attack prevention: receivers reject webhooks where the occurred_at timestamp is more than 5 minutes in the past. The event_id serves as the idempotency key for within-window deduplication. Together, these two controls prevent both delayed replay attacks (timestamp check) and within-window duplicate delivery (event_id check).

Retry with Exponential Backoff. Delivery failure (HTTP 4xx except 400 and 410, HTTP 5xx, or connection timeout) triggers the retry schedule: immediate retry, then 1 minute, 5 minutes, 30 minutes, 2 hours, 8 hours, 24 hours. Jitter (±10% of interval) prevents retry storms when multiple webhooks fail simultaneously. After 7 unsuccessful attempts, the delivery is moved to the dead letter queue. HTTP 400 (malformed request) and HTTP 410 (subscription cancelled) terminate retries immediately — these are permanent failures, not transient.

Idempotent Receiver Design. Receivers must handle duplicate delivery. Even with exactly-once delivery intent, network conditions guarantee at-least-once semantics in practice. The receiver's idempotency implementation: check the event_id against the processed events store before processing; if already processed, return HTTP 200 (to terminate the sender's retry) without reprocessing; if not processed, process and insert event_id into the processed events store atomically. The processed events store needs to retain event IDs for the duration of the retry window (24 hours minimum).

Dead Letter Queue for Failed Webhooks. After 7 delivery attempts, the undeliverable webhook is written to the DLQ with: event payload, all delivery attempt records (timestamp, HTTP status, response body), subscription metadata. DLQ alerting: any DLQ entry triggers an alert to the integration team. DLQ growth rate triggers escalation. The DLQ provides a manual replay capability: operators can re-trigger delivery after the receiver endpoint has been fixed.

Subscription Management. Consumers register webhook subscriptions via a management API: endpoint URL, event types to receive, authentication method (HMAC shared secret; receiver may also provide a Bearer token for the sender to include in the Authorization header). On registration, the sender delivers a test event to verify endpoint reachability and HMAC configuration. Subscriptions can be paused, deleted, and replayed. Event type filtering: consumers receive only the event types they have subscribed to — not a firehose of all AI events.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Source["AI Source"] A[AI Processing System] B[Webhook Dispatcher + HMAC Sign] C[Subscription Manager] end subgraph Delivery["Delivery Layer"] D{Delivery Outcome} E[Retry Scheduler] F[(Dead Letter Queue)] end subgraph Receiver["Receiver Side"] G[Idempotency Check] H[Process AI Result] I[(Audit Logger)] end A --> B C -->|subscription config| B B --> D D -->|success| G D -->|transient fail| E D -->|permanent fail| F E -->|retry| B G -->|new event| H G -->|duplicate| A B --> I H --> I style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#fef9c3,stroke:#eab308 style D fill:#f3e8ff,stroke:#a855f7 style E fill:#f0fdf4,stroke:#22c55e style F fill:#fee2e2,stroke:#ef4444 style G fill:#f3e8ff,stroke:#a855f7 style H fill:#d1fae5,stroke:#10b981 style I fill:#fef9c3,stroke:#eab308

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Webhook Dispatcher	Service	Build signed payload, deliver to receiver endpoint, handle HTTP responses, trigger retry	Custom service, Svix, Hook0, Hookdeck	Critical
Retry Scheduler	Service	Track pending retries, apply exponential backoff with jitter, trigger redelivery	Redis sorted set, AWS SQS delay queues, Azure Service Bus scheduled messages	Critical
Subscription Manager	Service	CRUD for webhook subscriptions; event type filtering; shared secret management	Custom service + PostgreSQL, Svix subscription management	High
HMAC Signer	Library	Compute HMAC-SHA256 signature over request body with subscription secret	Standard library (Python hmac, Node.js crypto, Java javax.crypto)	Critical
Idempotency Store (Receiver)	Storage	Store processed event IDs for deduplication; TTL 25 hours minimum	Redis with TTL, DynamoDB with TTL attribute, PostgreSQL with expiry	Critical
Dead Letter Queue	Infrastructure	Store undeliverable webhooks with full delivery history for manual review	AWS SQS DLQ, Azure Service Bus DLQ, custom DB table	High
DLQ Review Interface	UI/Service	View, investigate, and manually replay DLQ entries	Custom admin UI, Svix DLQ interface	High
Audit Logger	Service	Log every delivery attempt: event_id, subscription, attempt number, status, latency	Structured logging → SIEM, observability platform	Critical
Test Event Service	Service	Deliver test events on subscription registration to verify endpoint reachability and HMAC configuration	Built into webhook dispatcher	Medium

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	AI Processing System	AI inference completes; produces result with event_id, model metadata, confidence score	Result event submitted to Webhook Dispatcher
2	Subscription Manager	Dispatcher looks up active subscriptions matching event_type	Subscriber list (endpoint URL, HMAC secret, event filter config)
3	Webhook Dispatcher	Constructs payload with all AI fields + standard webhook fields; signs with HMAC-SHA256 using subscription secret	Signed HTTP POST ready for delivery
4	Webhook Dispatcher	Delivers HTTP POST to receiver endpoint; sets 30-second connection timeout	HTTP response from receiver
5	Receiver	Verifies HMAC signature; checks occurred_at timestamp (replay prevention); checks event_id idempotency store	Valid + new: process; Valid + duplicate: 200 without reprocessing; Invalid signature: 401
6	Receiver	Processes AI result (updates database, triggers workflow, sends notification)	Business action taken
7	Receiver	Returns HTTP 200–299	Delivery confirmed
8	Audit Logger	Records: event_id, subscription_id, attempt=1, status=200, latency_ms	Audit record persisted

Error Flow

Step	Error Condition	Detection	Recovery
4	Receiver returns 5xx	HTTP 5xx status	Add to retry scheduler with immediate first retry
4	Connection timeout	No response within 30 seconds	Add to retry scheduler; log timeout event
4	Receiver returns 4xx (except 400, 410)	HTTP 4xx status	Add to retry scheduler; investigate receiver configuration
5	HMAC signature invalid	Receiver returns 401	Log security event; no retry; alert integration team — possible key mismatch or payload tampering
5	Timestamp replay check fails	Receiver returns 400	Log potential replay attack; no retry; alert if pattern repeats
Retry	All 7 retries exhausted	Retry count threshold reached	Move to DLQ; alert integration team; await manual resolution + replay
Ongoing	DLQ growth rate increases	DLQ message count metric	Escalation alert; investigate receiver endpoint health

8. Security Considerations

Authentication and Authorisation

HMAC-SHA256 signature verification is mandatory for every delivery; no exceptions for "trusted" internal endpoints.
Sender may optionally include a Bearer token (subscription-specific, not a master key) in the Authorization header for receivers that require additional authentication.
Subscription management API protected by API key or OAuth 2.0; only authorised teams can register webhook subscriptions.
One HMAC shared secret per subscription — compromise of one receiver's secret does not affect other subscriptions.

Secrets Management

HMAC shared secrets stored in centralised secrets manager; one secret per subscription ID.
Shared secrets minimum 32 bytes of cryptographically random data (256-bit).
Secret rotation: provide a 24-hour grace period where both old and new secrets are valid simultaneously, enabling zero-downtime rotation.
Secrets never logged; log only the subscription ID and event_id, not the secret or the full payload with sensitive fields.

Replay Attack Prevention

occurred_at field in every payload: receivers reject events where occurred_at is > 5 minutes in the past.
event_id uniqueness: each event is delivered with a globally unique UUID v4; idempotency store at receiver deduplicates within the retry window.
webhook_id (per attempt): distinguishes delivery attempts from the same event_id; receivers process on event_id, not webhook_id.

Data Classification

Webhook payloads containing PII must only be delivered to endpoints with appropriate data handling agreements.
AI result fields may contain inferred sensitive attributes; payload schema version enables receivers to handle sensitive field sets differently.
TLS 1.3 mandatory for all webhook delivery endpoints — reject delivery to HTTP (non-TLS) endpoints.

Encryption

All delivery over TLS 1.3; receiver certificate validated against CA bundle.
Sender should reject self-signed certificates from receivers in production (configurable for development environments).
DLQ storage encrypted at rest; DLQ entries may contain PII-adjacent AI inference results.

Auditability

Complete delivery audit: every attempt (event_id, subscription_id, attempt number, HTTP status, response body excerpt, latency, timestamp) logged immutably.
HMAC validation failures at receiver logged as security events and surfaced to SIEM.
Subscription creation and deletion events logged with operator identity and timestamp.

OWASP LLM Top 10 Mitigations

OWASP LLM Risk	Relevance	Mitigation in This Pattern
LLM01 — Prompt Injection	Low	Webhook delivers AI results, not prompts; no LLM invoked in delivery path
LLM02 — Insecure Output Handling	High	AI result in webhook payload is schema-versioned structured JSON; receivers should validate against schema before acting on content
LLM03 — Training Data Poisoning	Low	Delivery pattern only; no training pipeline
LLM04 — Model Denial of Service	Low	Webhook delivery is downstream of AI inference; DDoS of receiver does not affect AI processing system
LLM05 — Supply Chain Vulnerabilities	Low	Webhook dispatcher is a delivery service; AI model supply chain is upstream
LLM06 — Sensitive Information Disclosure	High	TLS mandatory; HMAC prevents payload tampering; payload schema versioning enables receivers to strip sensitive fields for logging
LLM07 — Insecure Plugin Design	Low	Webhook delivery pattern has no plugin or function-calling surface
LLM08 — Excessive Agency	Medium	Webhook payload delivers AI result; any automated action by the receiver on the AI result is the receiver's responsibility — receiver should implement confidence threshold gates before automated action
LLM09 — Overreliance	Medium	`ai_confidence_score` and `ai_fallback_used` fields enable receivers to apply quality thresholds; document recommended thresholds in subscription onboarding documentation
LLM10 — Model Theft	Low	Webhook delivers outputs, not model weights or parameters

9. Governance Considerations

Responsible AI

Webhook subscription registration requires the subscribing team to acknowledge the AI output usage policy: confidence threshold requirements for automated action, human review requirements for high-stakes outcomes, data retention policy for received AI results.
ai_confidence_score and ai_fallback_used fields in every payload are governance controls — receivers cannot claim they were unaware the result came from a fallback model or had low confidence.

Model Risk Management

ai_model_id and ai_model_version in every payload enable receivers to track which model version produced each result.
When an AI model is upgraded, existing webhook subscribers receive the new ai_model_version in payloads automatically; subscribers can filter by model version during a validation period.

Human Approval Gates

Webhook subscription configuration includes a requires_human_review flag for high-stakes event types; receivers that subscribe to these event types are contractually required to implement human review before automated action.
The webhook payload schema for high-stakes event types includes a human_review_required: true field as a reminder at the application level.

Policy and Traceability

Every delivered webhook creates an immutable audit record linking the AI event to the subscriber that received it — enabling traceability of AI result propagation across systems.
DLQ events are traceability gaps — a DLQ event means an AI result reached the delivery system but not the consuming application. DLQ resolution is required to close the traceability gap.

Governance Artefacts

Artefact	Owner	Update Frequency	Storage Location
Webhook Subscription Registry	Platform Engineering	Per subscription change	API gateway / subscription manager DB
AI Webhook Usage Policy	Chief AI Officer	Annually	Policy management system
Delivery Audit Log	Platform Engineering	Continuous	Immutable audit log store (7-year retention for regulated use cases)
DLQ Review Log	AI Governance	Per DLQ event	Governance dashboard
HMAC Secret Rotation Log	Security Engineering	Per rotation	Secrets manager audit log
High-Stakes Event Type Register	AI Governance	Per risk assessment	Risk register

10. Operational Considerations

Monitoring and SLOs

SLO	Target	Measurement	Alert Threshold
Successful delivery rate (first attempt)	> 95%	First-attempt successes / total deliveries	< 90% in any 15-min window
Overall delivery success rate (with retries)	> 99.9%	Total successes / total events (excluding DLQ)	< 99.5%
DLQ rate	< 0.1%	DLQ events / total events	Any DLQ entries — alert
Delivery latency (p99)	< 5s from event to receiver HTTP 200	Time from AI result ready to first delivery success	> 15s sustained
Retry queue depth	< 1000 events	Retry queue message count	> 5000 — investigate receiver health
HMAC validation failures at receiver	0	Receiver 401 responses logged in audit	Any occurrence — security event

Logging

Dispatcher: event_id, subscription_id, attempt number, endpoint URL, HTTP status, response body (first 500 chars), latency_ms, timestamp.
Retry scheduler: event_id, subscription_id, next_attempt_time, attempt number, backoff duration.
DLQ: full event payload, all attempt records, subscription metadata, DLQ timestamp.
Security events: HMAC validation failures, timestamp replay rejections, unexpected response patterns.

Incident Response

Receiver endpoint down: retry schedule activates; DLQ rate alert fires if retries exhausted; integration team contacts receiver team; manual replay from DLQ after endpoint recovery.
HMAC validation failures: security team alerted; investigate: key mismatch (normal during rotation grace period), payload tampering (security incident), wrong endpoint receiving delivery.
DLQ accumulation: root cause investigation required before replay — systematic receiver error, schema mismatch, or endpoint misconfiguration must be resolved first.
Retry storm: large volume of failed deliveries creates retry queue depth spike; backoff jitter limits peak retry concurrency; dispatcher rate limiting prevents receiver overload during recovery.

Disaster Recovery

Scenario	RTO	RPO	Recovery Procedure
Webhook dispatcher failure	5 minutes	0 (events queued durably before dispatch)	Kubernetes restart; pending events re-queued from durable store
Retry scheduler failure	10 minutes	0 (retry metadata in durable queue)	Restore; pending retries rehydrated from queue
Subscription manager DB failure	15 minutes	< 1 minute (replicated DB)	Failover to replica; dispatcher caches subscription data for short outages
DLQ storage failure	30 minutes	Up to recovery time	DLQ entries rebuilt from audit log; replay after storage recovery

Capacity Planning

Dispatcher throughput: (events per second) × (average subscribers per event type) = peak delivery concurrency.
Retry queue storage: (events per hour × expected failure rate × max retry window in hours) = maximum queue depth.
Idempotency store (receiver-side): (events per day) × (event_id record size of ~100 bytes) × 1.1 (TTL buffer) = daily storage requirement; TTL 25 hours auto-clears.

11. Cost Considerations

Cost Drivers

Cost Driver	Description	Typical Proportion
Dispatcher Compute	Container runtime for dispatcher service; scales with delivery volume	30–45%
Retry Queue Storage	Message queue for pending retries; proportional to failure rate	5–10%
Idempotency Store	Redis or DynamoDB for event_id deduplication; proportional to event volume	10–20%
DLQ Storage	Typically low volume but requires durable, reliable storage	3–5%
Audit Log Storage	Every delivery attempt logged; volume proportional to delivery count	15–25%
Subscription Manager DB	Low; single-digit GB for even large subscription catalogs	3–7%

Scaling Risks

High retry rate (many failing receivers) multiplies delivery attempts and associated compute cost; receiver health monitoring and subscription auto-pause on persistent failure limits the blast radius.
Audit log write volume is proportional to total delivery attempts including retries; high-failure-rate periods drive unexpected log storage cost.

Cost Optimisations

Auto-pause subscriptions after 24 hours of continuous delivery failure — prevents retry cost accumulation for persistently unavailable receivers.
Batch delivery for non-latency-sensitive event types (group multiple events into a single HTTP call) — reduces per-event HTTP overhead.
Tiered audit log retention: full detail for 90 days; compressed summary for 7 years.

Indicative Cost Range

Scale	Monthly Compute	Queue + Storage	Total Monthly
Small (100K events/mo, 5 subscribers avg)	$200–$600	$100–$300	$300–$900
Medium (10M events/mo, 10 subscribers avg)	$2,000–$5,000	$500–$1,500	$2,500–$6,500
Large (500M events/mo, 10 subscribers avg)	$15,000–$40,000	$5,000–$15,000	$20,000–$55,000

12. Trade-Off Analysis

Architectural Options Comparison

Option	Reliability	Latency	Coupling	Operations Overhead	Best For
Option A — Webhook (this pattern)	High (retry + DLQ)	Seconds to minutes	Low	Medium	External system delivery; push-based notification
Option B — Polling API	Medium (depends on poll interval)	Poll interval latency	Medium	Low (receiver manages)	Receivers that cannot accept inbound connections
Option C — Shared Message Queue (EAAPL-INT001)	Very High	Sub-second	Low	High (broker infrastructure)	Internal enterprise integration; high throughput
Option D — Synchronous API	High	Sub-second	High	Low	Low-latency requirements; tight coupling acceptable

Architectural Tensions

Tension	Trade-Off	Resolution
Retry aggressiveness vs. Receiver overload	More retries = higher delivery guarantee; aggressive retries can overload recovering receivers	Exponential backoff with jitter; circuit breaker per subscription on persistent failure
Payload completeness vs. Size	Full AI metadata in every payload enables receiver autonomy; large payloads increase delivery cost and latency	Include all AI-specific fields in base payload; large result payloads delivered as reference + retrieval URL for bodies > 1MB
Security strictness vs. Onboarding friction	HMAC + TLS + replay protection requires receiver implementation effort	Provide SDK and reference implementation for major languages to reduce onboarding friction

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Receiver endpoint returns 5xx consistently	Medium	Medium — retries accumulate; DLQ fills	Delivery failure rate alert; retry queue depth	Investigate receiver; DLQ replay after fix
HMAC secret mismatch after rotation	Medium	Medium — deliveries rejected with 401 during grace period	401 response spike in audit log	Grace period allows both old and new secrets; alert if 401s persist beyond rotation window
Clock skew causing timestamp replay rejection	Low	Low — deliveries rejected unnecessarily	400 response spike; error message in receiver log	Synchronise receiver clock to NTP; increase replay window from 5 to 10 minutes if needed
Dispatcher memory leak causing OOM	Low	High — dispatcher restarts; in-flight deliveries retried	Container OOM alert	Kubernetes OOM restart; pending events redelivered from durable queue
DLQ not monitored — silent result loss	High (if not configured)	High — AI results never reach consumers	DLQ size metric (if monitored)	Mandatory DLQ monitoring with alert; no exceptions
Receiver implements non-idempotent processing	Medium	Medium — duplicate processing on retry	Duplicate records in receiver's database	Receiver must implement event_id deduplication; SDK provides reference implementation

Cascading Failure Scenarios

Mass receiver outage + no auto-pause + large event volume: All subscribers return 5xx → 7 retries per event × (events per hour) → retry queue depth spikes → dispatcher CPU saturated by retry attempts → new events cannot be dispatched → AI processing results stack up in delivery queue. Mitigation: subscription auto-pause after configurable failure threshold; dispatcher rate limit per subscription.
DLQ not monitored + regulated workflow: Systematic delivery failure → DLQ accumulates → regulated workflow (AML screening result, credit decision result) never delivered → receiving system operates without AI results → compliance gap → regulatory inquiry. Mitigation: DLQ monitoring is non-optional for regulated event types; alert on first DLQ entry, not on threshold.

14. Regulatory Considerations

APRA CPS 230 — Operational Risk

Clause 36: Webhook delivery for AI results embedded in critical business workflows (credit decisions, AML alerts) must have documented RTO/RPO. The retry architecture with DLQ provides documented maximum delivery latency (7 attempts over 24 hours).
Clause 52: Webhook delivery platform providers (Svix, Hookdeck) are third-party service providers under CPS 230 if the enterprise relies on them for critical AI result delivery.

APRA CPS 234 — Information Security

Clause 15: HMAC signature, TLS 1.3, secrets manager for shared secrets, and subscription-scoped keys implement proportional information security controls for webhook delivery of AI results.
Clause 36: HMAC validation failures at receivers are potential security incidents — payload tampering or credential theft — and must be assessed under the CPS 234 incident notification framework.

Australian Privacy Act 1988

APP 6: Delivering AI results containing personal data (inferred risk scores, classification results about individuals) to third-party systems via webhook is a disclosure under APP 6 — must be within the scope of the collection purpose or a specific exception.
APP 11: Webhook payloads in transit are covered by the security obligation; TLS 1.3 mandatory.

EU AI Act (2024)

Article 12 (Record-keeping): Webhook delivery audit log (every attempt, event_id, subscription, timestamp, status) satisfies the logging requirement for AI output delivery in high-risk AI systems.
Article 13 (Transparency): ai_model_id, ai_model_version, and ai_confidence_score in every webhook payload provide the transparency information required for AI systems influencing decisions about natural persons.

ISO 42001

Clause 8.5 (AI System Lifecycle): Model version tracking in webhook payloads supports the AI lifecycle management requirement — receivers can correlate results with model versions for performance monitoring.

NIST AI RMF (2023)

MEASURE 2.5: Delivery success rate, retry rate, and DLQ rate are performance measurements of the AI output delivery system — required under NIST AI RMF for measuring AI system reliability.

15. Reference Implementations

AWS

Dispatcher: AWS Lambda triggered by SQS queue (delivery job per event × subscription)
Retry Scheduler: SQS delay queues (configurable delay per retry attempt)
Subscription Manager: Amazon API Gateway + Lambda + DynamoDB subscriptions table
Idempotency Store (receiver): Amazon DynamoDB with TTL attribute on event_id records
DLQ: SQS DLQ connected to retry queue; CloudWatch alarm on DLQ depth
Audit Logger: CloudWatch Logs + S3 for long-term retention
Secrets: AWS Secrets Manager (one secret per subscription_id)

Azure

Dispatcher: Azure Functions (Service Bus triggered) or Azure Event Grid (native webhook delivery with retry)
Retry Scheduler: Azure Service Bus scheduled messages; or Azure Event Grid built-in retry
Subscription Manager: Azure API Management + Functions + Cosmos DB subscriptions collection
Idempotency Store (receiver): Azure Cache for Redis with TTL or Cosmos DB with TTL
DLQ: Azure Service Bus dead-letter queue; Azure Monitor alert on DLQ depth
Audit Logger: Application Insights + Azure Blob Storage for long-term retention
Secrets: Azure Key Vault (one secret per subscription_id)

GCP

Dispatcher: Cloud Run service (Pub/Sub triggered)
Retry Scheduler: Pub/Sub retry with Cloud Tasks for custom retry schedules
Subscription Manager: Cloud Run API + Firestore subscriptions collection
Idempotency Store (receiver): Memorystore (Redis) with TTL or Firestore with TTL
DLQ: Pub/Sub dead-letter topic; Cloud Monitoring alert on DLQ subscription undelivered count
Audit Logger: Cloud Logging + Cloud Storage for long-term retention
Secrets: Secret Manager (one secret per subscription_id)

On-Premises / Private Cloud

Dispatcher: Custom Python/Node.js service on Kubernetes
Retry Scheduler: Redis sorted set (score = next_attempt_unix_timestamp) + scheduler loop
Subscription Manager: FastAPI service + PostgreSQL subscriptions table
Idempotency Store (receiver): Redis with TTL or PostgreSQL with expiry column
DLQ: PostgreSQL dlq_events table with full event and attempt history; Grafana alert
Audit Logger: Fluentd → Elasticsearch → Kibana; 7-year retention with ILM policy
Secrets: HashiCorp Vault (one KV entry per subscription_id)
Commercial option: Svix (self-hosted Docker) provides all components above with management UI

Pattern	Relationship	Notes
EAAPL-INT001 — Enterprise AI Service Bus	Complementary	Service bus for internal enterprise integration; webhook for external or cross-organisational delivery
EAAPL-INT005 — Batch AI Processing	Enables	Batch jobs use webhook pattern to notify consuming systems of job completion and result availability
EAAPL-INT007 — AI Circuit Breaker	Enables	Circuit breaker wraps the HTTP delivery call in the dispatcher — opens on persistent receiver failures
EAAPL-INT008 — Bidirectional AI Sync	Complementary	Webhook is one delivery channel for sync events; AI sync pattern uses it for push-based update delivery

17. Maturity Assessment

Overall Maturity: Proven

Dimension	Score (1–5)	Justification
Architectural Completeness	5	HMAC security, idempotency, retry schedule, DLQ, subscription management, AI-specific payload fields all specified
Operational Readiness	5	SLOs, monitoring, incident response, DR all defined; well-established operational pattern
Security Coverage	5	HMAC + timestamp + event_id triple defence; TLS mandatory; secrets management; OWASP LLM Top 10 addressed
Governance Coverage	4	Audit trail, model version tracking, usage policy; human approval gate is receiver responsibility
Cost Predictability	5	Low-complexity infrastructure; costs are predictable and low relative to other integration patterns
Implementation Complexity	4	Low-medium complexity — well-understood pattern; HMAC implementation requires care; idempotency at receiver often missed
Industry Validation	5	Universal pattern in production across all industries; GitHub, Stripe, Twilio webhook patterns widely studied

18. Revision History

Version	Date	Author	Changes
1.0	2026-06-12	EAAPL Working Group	Initial publication — integration patterns series

← Back to Library More AI Integration →

EAAPL-INT006 — AI Webhook Pattern

EAAPL-INT006 — AI Webhook Pattern

1. Executive Summary

2. Problem Statement

Business Problem

Technical Problem

Symptoms

Cost of Inaction

3. Context

When to Apply

When NOT to Apply

Prerequisites

Industry Applicability

4. Architecture Overview

5. Architecture Diagram

6. Components

7. Data Flow

Primary Flow

Error Flow

8. Security Considerations

Authentication and Authorisation

Secrets Management

Replay Attack Prevention

Data Classification

Encryption

Auditability

OWASP LLM Top 10 Mitigations

9. Governance Considerations

Responsible AI

Model Risk Management

Human Approval Gates

Policy and Traceability

Governance Artefacts

10. Operational Considerations

Monitoring and SLOs

Logging

Incident Response

Disaster Recovery

Capacity Planning

11. Cost Considerations

Cost Drivers

Scaling Risks

Cost Optimisations

Indicative Cost Range

12. Trade-Off Analysis

Architectural Options Comparison

Architectural Tensions

13. Failure Modes

Cascading Failure Scenarios

14. Regulatory Considerations

APRA CPS 230 — Operational Risk

APRA CPS 234 — Information Security

Australian Privacy Act 1988

EU AI Act (2024)

ISO 42001

NIST AI RMF (2023)

15. Reference Implementations

AWS

Azure

GCP

On-Premises / Private Cloud

16. Related Patterns

17. Maturity Assessment

18. Revision History