EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryAI Integration
Proven
⇄ Compare

EAAPL-INT006 — AI Webhook Pattern

EAAPL-INT006 — AI Webhook Pattern

Tags: event-driven audit-logging asynchronous low-complexity Status: Proven | Version: 1.0 | Domain: Integration


1. Executive Summary

The AI Webhook Pattern delivers AI-generated notifications, inference results, and asynchronous AI processing outcomes to external systems via HTTP callbacks (webhooks). As AI workloads increasingly run asynchronously — long-running batch jobs, background inference pipelines, AI agent task completions — consuming systems need a reliable, secure, and idempotent mechanism to receive AI results without polling.

This pattern extends classical webhook design with AI-specific payload fields — confidence scores, model version, processing time, and result schemas — enabling receivers to make informed decisions about whether to act on an AI result, which version of a model produced it, and how long inference took. Security hardening addresses the unique risks of AI webhook delivery: HMAC signature verification, replay attack prevention, idempotent receiver design, and dead letter queue handling for undeliverable AI results.

For CIOs and CTOs, the pattern is the glue between AI processing backends and the consuming business applications that act on AI outputs. It enables a clean architectural boundary: AI processing systems produce results when ready and deliver them reliably; consuming systems do not need to know when or how AI processing occurred. The result is loosely coupled, independently deployable integration that supports AI system upgrades without consumer redeployment.


2. Problem Statement

Business Problem

AI processing is inherently asynchronous. A document intelligence pipeline may take 30 seconds to process a contract. A multi-step AI agent may take 5 minutes to complete a research task. A nightly batch enrichment job completes at 3 AM. In all cases, a consuming business application needs to receive the result when it is ready — not by polling continuously, and not by waiting synchronously.

Technical Problem

Without a structured webhook pattern, teams implement ad-hoc polling loops, shared database tables as notification queues, or synchronous long-polling connections. These approaches are unreliable under load, difficult to monitor, and create tight coupling between AI systems and consuming applications. When the AI system is upgraded, every polling consumer must be updated.

Symptoms

  • Consuming applications poll AI status APIs every 5 seconds, creating unnecessary load on AI systems.
  • AI results are written to a shared database table that consumers query continuously — creating a polling bottleneck that also exposes the AI system's internal data model.
  • Webhook delivery failures are discovered by consumers noticing missing results — hours after the delivery failure occurred.
  • There is no retry logic — a single delivery failure means the result is permanently lost.

Cost of Inaction

  • Reliability: Without idempotent delivery and retry logic, AI results are silently lost on any transient receiver failure. In regulated workflows, lost results mean compliance gaps.
  • Coupling: Polling consumers are tightly coupled to AI system availability; AI system maintenance windows affect all consuming applications simultaneously.
  • Cost: Continuous polling of AI status APIs generates 10–100× more API calls than event-driven delivery, increasing infrastructure costs and rate limit exposure.
  • Developer experience: Building and maintaining custom polling logic in every consuming application is duplicated effort that disappears with a well-designed webhook pattern.

3. Context

When to Apply

  • AI processing is asynchronous and the consuming system needs to receive results when ready.
  • The consuming system is a third-party system or an internal system operated by a different team.
  • Result delivery must be reliable, with retry and dead letter queue handling for failures.
  • The consuming system does not have access to the AI processing system's internal message bus.

When NOT to Apply

  • Consuming system needs real-time AI results (< 1s) — use synchronous request/response or streaming.
  • Consuming system is an internal service in the same deployment environment — use direct message bus consumption (EAAPL-INT001) instead.
  • The consuming system cannot receive inbound HTTP connections (firewall restrictions, no public endpoint) — use polling API or message queue instead.
  • The number of events is very small and ad-hoc — synchronous delivery or email notification may be simpler.

Prerequisites

  • Consuming system can expose an HTTPS webhook endpoint.
  • Consuming system can implement HMAC signature verification.
  • AI processing system has a job/event identifier system for idempotency key generation.
  • A dead letter queue and alerting infrastructure for failed delivery tracking.

Industry Applicability

Industry Applicability Use Case Receiver Type
Financial Services High Loan application AI assessment result → origination system; fraud alert → operations platform Internal systems, partner banks
SaaS Platforms Very High AI feature result delivery to customer systems (document intelligence, content generation) External customer systems
Healthcare High Clinical AI result → EMR system; prior authorisation AI outcome → payer system Partner healthcare systems
Government Medium AI document classification result → case management system Internal government systems
eCommerce High AI recommendation update → personalisation platform; content moderation result → content management Internal and partner systems
Legal Tech High Contract AI analysis result → matter management system Internal and customer systems

4. Architecture Overview

The AI Webhook Pattern is a push-based asynchronous notification system with six design concerns: payload design, security (HMAC + replay prevention), delivery management (retry + exponential backoff), idempotent receiver design, dead letter queue handling, and subscription management.

Webhook Payload Design for AI Results. The webhook payload carries two categories of fields. Standard webhook fields: event_id (globally unique UUID v4; the idempotency key), event_type (namespaced event type: e.g., ai.document.classified.v1), occurred_at (ISO-8601 UTC timestamp of the AI event; not the delivery timestamp), webhook_id (unique per delivery attempt; different from event_id — enables distinguishing original from retry deliveries). AI-specific fields: ai_result (the structured AI output; schema versioned), ai_result_schema_version (enables receivers to handle schema evolution), ai_confidence_score (0.0–1.0; receivers may filter low-confidence results), ai_model_id (the specific model that produced the result), ai_model_version (model version tag), ai_processing_time_ms (how long inference took; enables SLA tracking), ai_cost_usd (optional cost attribution field), ai_fallback_used (boolean; indicates whether a fallback model was used instead of the primary model). These fields enable receivers to make intelligent decisions about AI results without needing to query the AI system for metadata.

HMAC Security. Every webhook delivery is signed with HMAC-SHA256. The sender computes the signature over the raw request body using a shared secret (one secret per webhook subscription). The signature is delivered in the X-AI-Signature-SHA256 header as sha256=<hex digest>. The receiver verifies the signature before processing: compute the expected signature over the received body using the stored shared secret; compare to the received signature using a constant-time comparison function to prevent timing attacks. Any webhook with an invalid signature is rejected with HTTP 401 and logged as a security event.

Replay Attack Prevention. The occurred_at field in the payload serves as a time-bound for replay attack prevention: receivers reject webhooks where the occurred_at timestamp is more than 5 minutes in the past. The event_id serves as the idempotency key for within-window deduplication. Together, these two controls prevent both delayed replay attacks (timestamp check) and within-window duplicate delivery (event_id check).

Retry with Exponential Backoff. Delivery failure (HTTP 4xx except 400 and 410, HTTP 5xx, or connection timeout) triggers the retry schedule: immediate retry, then 1 minute, 5 minutes, 30 minutes, 2 hours, 8 hours, 24 hours. Jitter (±10% of interval) prevents retry storms when multiple webhooks fail simultaneously. After 7 unsuccessful attempts, the delivery is moved to the dead letter queue. HTTP 400 (malformed request) and HTTP 410 (subscription cancelled) terminate retries immediately — these are permanent failures, not transient.

Idempotent Receiver Design. Receivers must handle duplicate delivery. Even with exactly-once delivery intent, network conditions guarantee at-least-once semantics in practice. The receiver's idempotency implementation: check the event_id against the processed events store before processing; if already processed, return HTTP 200 (to terminate the sender's retry) without reprocessing; if not processed, process and insert event_id into the processed events store atomically. The processed events store needs to retain event IDs for the duration of the retry window (24 hours minimum).

Dead Letter Queue for Failed Webhooks. After 7 delivery attempts, the undeliverable webhook is written to the DLQ with: event payload, all delivery attempt records (timestamp, HTTP status, response body), subscription metadata. DLQ alerting: any DLQ entry triggers an alert to the integration team. DLQ growth rate triggers escalation. The DLQ provides a manual replay capability: operators can re-trigger delivery after the receiver endpoint has been fixed.

Subscription Management. Consumers register webhook subscriptions via a management API: endpoint URL, event types to receive, authentication method (HMAC shared secret; receiver may also provide a Bearer token for the sender to include in the Authorization header). On registration, the sender delivers a test event to verify endpoint reachability and HMAC configuration. Subscriptions can be paused, deleted, and replayed. Event type filtering: consumers receive only the event types they have subscribed to — not a firehose of all AI events.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Source["AI Source"] A[AI Processing System] B[Webhook Dispatcher + HMAC Sign] C[Subscription Manager] end subgraph Delivery["Delivery Layer"] D{Delivery Outcome} E[Retry Scheduler] F[(Dead Letter Queue)] end subgraph Receiver["Receiver Side"] G[Idempotency Check] H[Process AI Result] I[(Audit Logger)] end A --> B C -->|subscription config| B B --> D D -->|success| G D -->|transient fail| E D -->|permanent fail| F E -->|retry| B G -->|new event| H G -->|duplicate| A B --> I H --> I style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#fef9c3,stroke:#eab308 style D fill:#f3e8ff,stroke:#a855f7 style E fill:#f0fdf4,stroke:#22c55e style F fill:#fee2e2,stroke:#ef4444 style G fill:#f3e8ff,stroke:#a855f7 style H fill:#d1fae5,stroke:#10b981 style I fill:#fef9c3,stroke:#eab308

6. Components

Component Type Responsibility Technology Options Criticality
Webhook Dispatcher Service Build signed payload, deliver to receiver endpoint, handle HTTP responses, trigger retry Custom service, Svix, Hook0, Hookdeck Critical
Retry Scheduler Service Track pending retries, apply exponential backoff with jitter, trigger redelivery Redis sorted set, AWS SQS delay queues, Azure Service Bus scheduled messages Critical
Subscription Manager Service CRUD for webhook subscriptions; event type filtering; shared secret management Custom service + PostgreSQL, Svix subscription management High
HMAC Signer Library Compute HMAC-SHA256 signature over request body with subscription secret Standard library (Python hmac, Node.js crypto, Java javax.crypto) Critical
Idempotency Store (Receiver) Storage Store processed event IDs for deduplication; TTL 25 hours minimum Redis with TTL, DynamoDB with TTL attribute, PostgreSQL with expiry Critical
Dead Letter Queue Infrastructure Store undeliverable webhooks with full delivery history for manual review AWS SQS DLQ, Azure Service Bus DLQ, custom DB table High
DLQ Review Interface UI/Service View, investigate, and manually replay DLQ entries Custom admin UI, Svix DLQ interface High
Audit Logger Service Log every delivery attempt: event_id, subscription, attempt number, status, latency Structured logging → SIEM, observability platform Critical
Test Event Service Service Deliver test events on subscription registration to verify endpoint reachability and HMAC configuration Built into webhook dispatcher Medium

7. Data Flow

Primary Flow

Step Actor Action Output
1 AI Processing System AI inference completes; produces result with event_id, model metadata, confidence score Result event submitted to Webhook Dispatcher
2 Subscription Manager Dispatcher looks up active subscriptions matching event_type Subscriber list (endpoint URL, HMAC secret, event filter config)
3 Webhook Dispatcher Constructs payload with all AI fields + standard webhook fields; signs with HMAC-SHA256 using subscription secret Signed HTTP POST ready for delivery
4 Webhook Dispatcher Delivers HTTP POST to receiver endpoint; sets 30-second connection timeout HTTP response from receiver
5 Receiver Verifies HMAC signature; checks occurred_at timestamp (replay prevention); checks event_id idempotency store Valid + new: process; Valid + duplicate: 200 without reprocessing; Invalid signature: 401
6 Receiver Processes AI result (updates database, triggers workflow, sends notification) Business action taken
7 Receiver Returns HTTP 200–299 Delivery confirmed
8 Audit Logger Records: event_id, subscription_id, attempt=1, status=200, latency_ms Audit record persisted

Error Flow

Step Error Condition Detection Recovery
4 Receiver returns 5xx HTTP 5xx status Add to retry scheduler with immediate first retry
4 Connection timeout No response within 30 seconds Add to retry scheduler; log timeout event
4 Receiver returns 4xx (except 400, 410) HTTP 4xx status Add to retry scheduler; investigate receiver configuration
5 HMAC signature invalid Receiver returns 401 Log security event; no retry; alert integration team — possible key mismatch or payload tampering
5 Timestamp replay check fails Receiver returns 400 Log potential replay attack; no retry; alert if pattern repeats
Retry All 7 retries exhausted Retry count threshold reached Move to DLQ; alert integration team; await manual resolution + replay
Ongoing DLQ growth rate increases DLQ message count metric Escalation alert; investigate receiver endpoint health

8. Security Considerations

Authentication and Authorisation

  • HMAC-SHA256 signature verification is mandatory for every delivery; no exceptions for "trusted" internal endpoints.
  • Sender may optionally include a Bearer token (subscription-specific, not a master key) in the Authorization header for receivers that require additional authentication.
  • Subscription management API protected by API key or OAuth 2.0; only authorised teams can register webhook subscriptions.
  • One HMAC shared secret per subscription — compromise of one receiver's secret does not affect other subscriptions.

Secrets Management

  • HMAC shared secrets stored in centralised secrets manager; one secret per subscription ID.
  • Shared secrets minimum 32 bytes of cryptographically random data (256-bit).
  • Secret rotation: provide a 24-hour grace period where both old and new secrets are valid simultaneously, enabling zero-downtime rotation.
  • Secrets never logged; log only the subscription ID and event_id, not the secret or the full payload with sensitive fields.

Replay Attack Prevention

  • occurred_at field in every payload: receivers reject events where occurred_at is > 5 minutes in the past.
  • event_id uniqueness: each event is delivered with a globally unique UUID v4; idempotency store at receiver deduplicates within the retry window.
  • webhook_id (per attempt): distinguishes delivery attempts from the same event_id; receivers process on event_id, not webhook_id.

Data Classification

  • Webhook payloads containing PII must only be delivered to endpoints with appropriate data handling agreements.
  • AI result fields may contain inferred sensitive attributes; payload schema version enables receivers to handle sensitive field sets differently.
  • TLS 1.3 mandatory for all webhook delivery endpoints — reject delivery to HTTP (non-TLS) endpoints.

Encryption

  • All delivery over TLS 1.3; receiver certificate validated against CA bundle.
  • Sender should reject self-signed certificates from receivers in production (configurable for development environments).
  • DLQ storage encrypted at rest; DLQ entries may contain PII-adjacent AI inference results.

Auditability

  • Complete delivery audit: every attempt (event_id, subscription_id, attempt number, HTTP status, response body excerpt, latency, timestamp) logged immutably.
  • HMAC validation failures at receiver logged as security events and surfaced to SIEM.
  • Subscription creation and deletion events logged with operator identity and timestamp.

OWASP LLM Top 10 Mitigations

OWASP LLM Risk Relevance Mitigation in This Pattern
LLM01 — Prompt Injection Low Webhook delivers AI results, not prompts; no LLM invoked in delivery path
LLM02 — Insecure Output Handling High AI result in webhook payload is schema-versioned structured JSON; receivers should validate against schema before acting on content
LLM03 — Training Data Poisoning Low Delivery pattern only; no training pipeline
LLM04 — Model Denial of Service Low Webhook delivery is downstream of AI inference; DDoS of receiver does not affect AI processing system
LLM05 — Supply Chain Vulnerabilities Low Webhook dispatcher is a delivery service; AI model supply chain is upstream
LLM06 — Sensitive Information Disclosure High TLS mandatory; HMAC prevents payload tampering; payload schema versioning enables receivers to strip sensitive fields for logging
LLM07 — Insecure Plugin Design Low Webhook delivery pattern has no plugin or function-calling surface
LLM08 — Excessive Agency Medium Webhook payload delivers AI result; any automated action by the receiver on the AI result is the receiver's responsibility — receiver should implement confidence threshold gates before automated action
LLM09 — Overreliance Medium ai_confidence_score and ai_fallback_used fields enable receivers to apply quality thresholds; document recommended thresholds in subscription onboarding documentation
LLM10 — Model Theft Low Webhook delivers outputs, not model weights or parameters

9. Governance Considerations

Responsible AI

  • Webhook subscription registration requires the subscribing team to acknowledge the AI output usage policy: confidence threshold requirements for automated action, human review requirements for high-stakes outcomes, data retention policy for received AI results.
  • ai_confidence_score and ai_fallback_used fields in every payload are governance controls — receivers cannot claim they were unaware the result came from a fallback model or had low confidence.

Model Risk Management

  • ai_model_id and ai_model_version in every payload enable receivers to track which model version produced each result.
  • When an AI model is upgraded, existing webhook subscribers receive the new ai_model_version in payloads automatically; subscribers can filter by model version during a validation period.

Human Approval Gates

  • Webhook subscription configuration includes a requires_human_review flag for high-stakes event types; receivers that subscribe to these event types are contractually required to implement human review before automated action.
  • The webhook payload schema for high-stakes event types includes a human_review_required: true field as a reminder at the application level.

Policy and Traceability

  • Every delivered webhook creates an immutable audit record linking the AI event to the subscriber that received it — enabling traceability of AI result propagation across systems.
  • DLQ events are traceability gaps — a DLQ event means an AI result reached the delivery system but not the consuming application. DLQ resolution is required to close the traceability gap.

Governance Artefacts

Artefact Owner Update Frequency Storage Location
Webhook Subscription Registry Platform Engineering Per subscription change API gateway / subscription manager DB
AI Webhook Usage Policy Chief AI Officer Annually Policy management system
Delivery Audit Log Platform Engineering Continuous Immutable audit log store (7-year retention for regulated use cases)
DLQ Review Log AI Governance Per DLQ event Governance dashboard
HMAC Secret Rotation Log Security Engineering Per rotation Secrets manager audit log
High-Stakes Event Type Register AI Governance Per risk assessment Risk register

10. Operational Considerations

Monitoring and SLOs

SLO Target Measurement Alert Threshold
Successful delivery rate (first attempt) > 95% First-attempt successes / total deliveries < 90% in any 15-min window
Overall delivery success rate (with retries) > 99.9% Total successes / total events (excluding DLQ) < 99.5%
DLQ rate < 0.1% DLQ events / total events Any DLQ entries — alert
Delivery latency (p99) < 5s from event to receiver HTTP 200 Time from AI result ready to first delivery success > 15s sustained
Retry queue depth < 1000 events Retry queue message count > 5000 — investigate receiver health
HMAC validation failures at receiver 0 Receiver 401 responses logged in audit Any occurrence — security event

Logging

  • Dispatcher: event_id, subscription_id, attempt number, endpoint URL, HTTP status, response body (first 500 chars), latency_ms, timestamp.
  • Retry scheduler: event_id, subscription_id, next_attempt_time, attempt number, backoff duration.
  • DLQ: full event payload, all attempt records, subscription metadata, DLQ timestamp.
  • Security events: HMAC validation failures, timestamp replay rejections, unexpected response patterns.

Incident Response

  • Receiver endpoint down: retry schedule activates; DLQ rate alert fires if retries exhausted; integration team contacts receiver team; manual replay from DLQ after endpoint recovery.
  • HMAC validation failures: security team alerted; investigate: key mismatch (normal during rotation grace period), payload tampering (security incident), wrong endpoint receiving delivery.
  • DLQ accumulation: root cause investigation required before replay — systematic receiver error, schema mismatch, or endpoint misconfiguration must be resolved first.
  • Retry storm: large volume of failed deliveries creates retry queue depth spike; backoff jitter limits peak retry concurrency; dispatcher rate limiting prevents receiver overload during recovery.

Disaster Recovery

Scenario RTO RPO Recovery Procedure
Webhook dispatcher failure 5 minutes 0 (events queued durably before dispatch) Kubernetes restart; pending events re-queued from durable store
Retry scheduler failure 10 minutes 0 (retry metadata in durable queue) Restore; pending retries rehydrated from queue
Subscription manager DB failure 15 minutes < 1 minute (replicated DB) Failover to replica; dispatcher caches subscription data for short outages
DLQ storage failure 30 minutes Up to recovery time DLQ entries rebuilt from audit log; replay after storage recovery

Capacity Planning

  • Dispatcher throughput: (events per second) × (average subscribers per event type) = peak delivery concurrency.
  • Retry queue storage: (events per hour × expected failure rate × max retry window in hours) = maximum queue depth.
  • Idempotency store (receiver-side): (events per day) × (event_id record size of ~100 bytes) × 1.1 (TTL buffer) = daily storage requirement; TTL 25 hours auto-clears.

11. Cost Considerations

Cost Drivers

Cost Driver Description Typical Proportion
Dispatcher Compute Container runtime for dispatcher service; scales with delivery volume 30–45%
Retry Queue Storage Message queue for pending retries; proportional to failure rate 5–10%
Idempotency Store Redis or DynamoDB for event_id deduplication; proportional to event volume 10–20%
DLQ Storage Typically low volume but requires durable, reliable storage 3–5%
Audit Log Storage Every delivery attempt logged; volume proportional to delivery count 15–25%
Subscription Manager DB Low; single-digit GB for even large subscription catalogs 3–7%

Scaling Risks

  • High retry rate (many failing receivers) multiplies delivery attempts and associated compute cost; receiver health monitoring and subscription auto-pause on persistent failure limits the blast radius.
  • Audit log write volume is proportional to total delivery attempts including retries; high-failure-rate periods drive unexpected log storage cost.

Cost Optimisations

  • Auto-pause subscriptions after 24 hours of continuous delivery failure — prevents retry cost accumulation for persistently unavailable receivers.
  • Batch delivery for non-latency-sensitive event types (group multiple events into a single HTTP call) — reduces per-event HTTP overhead.
  • Tiered audit log retention: full detail for 90 days; compressed summary for 7 years.

Indicative Cost Range

Scale Monthly Compute Queue + Storage Total Monthly
Small (100K events/mo, 5 subscribers avg) $200–$600 $100–$300 $300–$900
Medium (10M events/mo, 10 subscribers avg) $2,000–$5,000 $500–$1,500 $2,500–$6,500
Large (500M events/mo, 10 subscribers avg) $15,000–$40,000 $5,000–$15,000 $20,000–$55,000

12. Trade-Off Analysis

Architectural Options Comparison

Option Reliability Latency Coupling Operations Overhead Best For
Option A — Webhook (this pattern) High (retry + DLQ) Seconds to minutes Low Medium External system delivery; push-based notification
Option B — Polling API Medium (depends on poll interval) Poll interval latency Medium Low (receiver manages) Receivers that cannot accept inbound connections
Option C — Shared Message Queue (EAAPL-INT001) Very High Sub-second Low High (broker infrastructure) Internal enterprise integration; high throughput
Option D — Synchronous API High Sub-second High Low Low-latency requirements; tight coupling acceptable

Architectural Tensions

Tension Trade-Off Resolution
Retry aggressiveness vs. Receiver overload More retries = higher delivery guarantee; aggressive retries can overload recovering receivers Exponential backoff with jitter; circuit breaker per subscription on persistent failure
Payload completeness vs. Size Full AI metadata in every payload enables receiver autonomy; large payloads increase delivery cost and latency Include all AI-specific fields in base payload; large result payloads delivered as reference + retrieval URL for bodies > 1MB
Security strictness vs. Onboarding friction HMAC + TLS + replay protection requires receiver implementation effort Provide SDK and reference implementation for major languages to reduce onboarding friction

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Receiver endpoint returns 5xx consistently Medium Medium — retries accumulate; DLQ fills Delivery failure rate alert; retry queue depth Investigate receiver; DLQ replay after fix
HMAC secret mismatch after rotation Medium Medium — deliveries rejected with 401 during grace period 401 response spike in audit log Grace period allows both old and new secrets; alert if 401s persist beyond rotation window
Clock skew causing timestamp replay rejection Low Low — deliveries rejected unnecessarily 400 response spike; error message in receiver log Synchronise receiver clock to NTP; increase replay window from 5 to 10 minutes if needed
Dispatcher memory leak causing OOM Low High — dispatcher restarts; in-flight deliveries retried Container OOM alert Kubernetes OOM restart; pending events redelivered from durable queue
DLQ not monitored — silent result loss High (if not configured) High — AI results never reach consumers DLQ size metric (if monitored) Mandatory DLQ monitoring with alert; no exceptions
Receiver implements non-idempotent processing Medium Medium — duplicate processing on retry Duplicate records in receiver's database Receiver must implement event_id deduplication; SDK provides reference implementation

Cascading Failure Scenarios

  • Mass receiver outage + no auto-pause + large event volume: All subscribers return 5xx → 7 retries per event × (events per hour) → retry queue depth spikes → dispatcher CPU saturated by retry attempts → new events cannot be dispatched → AI processing results stack up in delivery queue. Mitigation: subscription auto-pause after configurable failure threshold; dispatcher rate limit per subscription.
  • DLQ not monitored + regulated workflow: Systematic delivery failure → DLQ accumulates → regulated workflow (AML screening result, credit decision result) never delivered → receiving system operates without AI results → compliance gap → regulatory inquiry. Mitigation: DLQ monitoring is non-optional for regulated event types; alert on first DLQ entry, not on threshold.

14. Regulatory Considerations

APRA CPS 230 — Operational Risk

  • Clause 36: Webhook delivery for AI results embedded in critical business workflows (credit decisions, AML alerts) must have documented RTO/RPO. The retry architecture with DLQ provides documented maximum delivery latency (7 attempts over 24 hours).
  • Clause 52: Webhook delivery platform providers (Svix, Hookdeck) are third-party service providers under CPS 230 if the enterprise relies on them for critical AI result delivery.

APRA CPS 234 — Information Security

  • Clause 15: HMAC signature, TLS 1.3, secrets manager for shared secrets, and subscription-scoped keys implement proportional information security controls for webhook delivery of AI results.
  • Clause 36: HMAC validation failures at receivers are potential security incidents — payload tampering or credential theft — and must be assessed under the CPS 234 incident notification framework.

Australian Privacy Act 1988

  • APP 6: Delivering AI results containing personal data (inferred risk scores, classification results about individuals) to third-party systems via webhook is a disclosure under APP 6 — must be within the scope of the collection purpose or a specific exception.
  • APP 11: Webhook payloads in transit are covered by the security obligation; TLS 1.3 mandatory.

EU AI Act (2024)

  • Article 12 (Record-keeping): Webhook delivery audit log (every attempt, event_id, subscription, timestamp, status) satisfies the logging requirement for AI output delivery in high-risk AI systems.
  • Article 13 (Transparency): ai_model_id, ai_model_version, and ai_confidence_score in every webhook payload provide the transparency information required for AI systems influencing decisions about natural persons.

ISO 42001

  • Clause 8.5 (AI System Lifecycle): Model version tracking in webhook payloads supports the AI lifecycle management requirement — receivers can correlate results with model versions for performance monitoring.

NIST AI RMF (2023)

  • MEASURE 2.5: Delivery success rate, retry rate, and DLQ rate are performance measurements of the AI output delivery system — required under NIST AI RMF for measuring AI system reliability.

15. Reference Implementations

AWS

  • Dispatcher: AWS Lambda triggered by SQS queue (delivery job per event × subscription)
  • Retry Scheduler: SQS delay queues (configurable delay per retry attempt)
  • Subscription Manager: Amazon API Gateway + Lambda + DynamoDB subscriptions table
  • Idempotency Store (receiver): Amazon DynamoDB with TTL attribute on event_id records
  • DLQ: SQS DLQ connected to retry queue; CloudWatch alarm on DLQ depth
  • Audit Logger: CloudWatch Logs + S3 for long-term retention
  • Secrets: AWS Secrets Manager (one secret per subscription_id)

Azure

  • Dispatcher: Azure Functions (Service Bus triggered) or Azure Event Grid (native webhook delivery with retry)
  • Retry Scheduler: Azure Service Bus scheduled messages; or Azure Event Grid built-in retry
  • Subscription Manager: Azure API Management + Functions + Cosmos DB subscriptions collection
  • Idempotency Store (receiver): Azure Cache for Redis with TTL or Cosmos DB with TTL
  • DLQ: Azure Service Bus dead-letter queue; Azure Monitor alert on DLQ depth
  • Audit Logger: Application Insights + Azure Blob Storage for long-term retention
  • Secrets: Azure Key Vault (one secret per subscription_id)

GCP

  • Dispatcher: Cloud Run service (Pub/Sub triggered)
  • Retry Scheduler: Pub/Sub retry with Cloud Tasks for custom retry schedules
  • Subscription Manager: Cloud Run API + Firestore subscriptions collection
  • Idempotency Store (receiver): Memorystore (Redis) with TTL or Firestore with TTL
  • DLQ: Pub/Sub dead-letter topic; Cloud Monitoring alert on DLQ subscription undelivered count
  • Audit Logger: Cloud Logging + Cloud Storage for long-term retention
  • Secrets: Secret Manager (one secret per subscription_id)

On-Premises / Private Cloud

  • Dispatcher: Custom Python/Node.js service on Kubernetes
  • Retry Scheduler: Redis sorted set (score = next_attempt_unix_timestamp) + scheduler loop
  • Subscription Manager: FastAPI service + PostgreSQL subscriptions table
  • Idempotency Store (receiver): Redis with TTL or PostgreSQL with expiry column
  • DLQ: PostgreSQL dlq_events table with full event and attempt history; Grafana alert
  • Audit Logger: Fluentd → Elasticsearch → Kibana; 7-year retention with ILM policy
  • Secrets: HashiCorp Vault (one KV entry per subscription_id)
  • Commercial option: Svix (self-hosted Docker) provides all components above with management UI

Pattern Relationship Notes
EAAPL-INT001 — Enterprise AI Service Bus Complementary Service bus for internal enterprise integration; webhook for external or cross-organisational delivery
EAAPL-INT005 — Batch AI Processing Enables Batch jobs use webhook pattern to notify consuming systems of job completion and result availability
EAAPL-INT007 — AI Circuit Breaker Enables Circuit breaker wraps the HTTP delivery call in the dispatcher — opens on persistent receiver failures
EAAPL-INT008 — Bidirectional AI Sync Complementary Webhook is one delivery channel for sync events; AI sync pattern uses it for push-based update delivery

17. Maturity Assessment

Overall Maturity: Proven

Dimension Score (1–5) Justification
Architectural Completeness 5 HMAC security, idempotency, retry schedule, DLQ, subscription management, AI-specific payload fields all specified
Operational Readiness 5 SLOs, monitoring, incident response, DR all defined; well-established operational pattern
Security Coverage 5 HMAC + timestamp + event_id triple defence; TLS mandatory; secrets management; OWASP LLM Top 10 addressed
Governance Coverage 4 Audit trail, model version tracking, usage policy; human approval gate is receiver responsibility
Cost Predictability 5 Low-complexity infrastructure; costs are predictable and low relative to other integration patterns
Implementation Complexity 4 Low-medium complexity — well-understood pattern; HMAC implementation requires care; idempotency at receiver often missed
Industry Validation 5 Universal pattern in production across all industries; GitHub, Stripe, Twilio webhook patterns widely studied

18. Revision History

Version Date Author Changes
1.0 2026-06-12 EAAPL Working Group Initial publication — integration patterns series
← Back to LibraryMore AI Integration