Proven

EAAPL-INT001 — Enterprise AI Service Bus

Tags: event-driven asynchronous enterprise-only high-complexity Status: Proven | Version: 1.0 | Domain: Integration

1. Executive Summary

The Enterprise AI Service Bus pattern establishes an event-driven integration backbone that routes, mediates, and governs AI capability consumption across the enterprise. Rather than allowing each business unit to wire directly to model providers, the pattern inserts a durable, schema-governed event mesh between AI producers (models, pipelines, agents) and AI consumers (applications, dashboards, downstream processes).

The pattern extends the CloudEvents 1.0 specification with AI-specific fields—model identity, prompt version, token usage, confidence score, latency, and cost—ensuring that every AI inference event is a first-class, auditable artefact. Topic design decouples consumers from model changes: one topic per AI use-case domain, not per model, so upgrading GPT-4 to GPT-4o does not require re-wiring 30 downstream subscribers.

For CIOs and CTOs, the bus provides three strategic outcomes: (1) unified cost visibility across all AI workloads through event-level cost attribution; (2) replay capability to reprocess historical inputs when a better model becomes available; (3) a single enforcement point for data classification, rate limiting, and policy compliance before any AI event reaches a consumer.

2. Problem Statement

Business Problem

AI capabilities are being procured and integrated independently by individual teams. There is no central visibility into total AI spend, no consistent governance of what data enters AI models, and no mechanism to upgrade models without coordinated redeployment across all consuming systems.

Technical Problem

Point-to-point integrations between applications and AI APIs create a tangled dependency graph. Each integration handles retries, error logging, cost tracking, and schema evolution differently. When a model API changes its response format or is deprecated, every consuming application must be updated independently.

Symptoms

Multiple teams have separate API keys for the same AI provider with no consolidated billing.
A model deprecation notice causes a multi-team incident requiring weeks of parallel migration work.
There is no audit trail linking a business decision to the specific AI model version and prompt that produced it.
AI inference costs are allocated to cloud infrastructure budgets rather than business unit P&Ls.
Failed AI inference events are silently discarded, making root cause analysis impossible.

Cost of Inaction

Financial: Duplicate AI spend across business units; inability to negotiate volume discounts without consolidated usage data. Typical over-spend: 30–60% of actual AI API cost.
Operational: Every model upgrade requires coordinated change across all consuming teams — 4 to 12 weeks of migration effort per model generation.
Risk: No audit trail for AI-assisted decisions exposes the organisation to regulatory non-compliance under EU AI Act Article 13 (transparency) and APRA CPS 230 operational risk standards.
Strategic: Inability to replay historical workloads with improved models forfeits compounding model improvement value.

3. Context

When to Apply

The enterprise has ≥3 distinct teams consuming AI capabilities.
AI inference is embedded in business-critical workflows where auditability is required.
The organisation operates under financial services, healthcare, or government regulatory regimes.
Model upgrade cycles must not require coordinated consumer redeployment.
Cost attribution to business units is a finance or governance requirement.

When NOT to Apply

Single-team AI workload with no cross-system integration.
Proof-of-concept or exploratory AI workloads where operational overhead is not justified.
Ultra-low-latency requirements (< 50ms) where broker overhead is architecturally incompatible.
Simple request/response integrations where event-driven complexity adds no value.

Prerequisites

A mature enterprise messaging platform (Kafka, Azure Service Bus, AWS EventBridge, Pub/Sub).
A schema registry capable of enforcing Avro, Protobuf, or JSON Schema evolution compatibility.
Centralised secrets management for AI provider API keys.
Observability platform capable of ingesting event-level metrics.

Industry Applicability

Industry	Applicability	Primary Driver
Financial Services	High	Regulatory auditability, cost attribution, model risk governance
Government	High	Data classification enforcement, audit trail requirements
Healthcare	High	PHI data governance, model version traceability for clinical decisions
Retail / eCommerce	Medium	Multi-team AI consumption, cost management
Telecommunications	Medium	High-volume event streams, multi-domain AI use cases
Startups (< 50 engineers)	Low	Overhead exceeds benefit at this scale

4. Architecture Overview

The Enterprise AI Service Bus is a layered event-driven architecture consisting of five logical planes: the ingestion plane, the governance plane, the routing plane, the processing plane, and the consumer plane.

Ingestion Plane. AI event producers — applications initiating AI inference requests — publish to the bus using an extended CloudEvents envelope. The CloudEvents 1.0 base fields (id, source, specversion, type, time, datacontenttype) are preserved intact. The AI extension fields are added as CloudEvents extension attributes: ai_model_id, ai_model_version, ai_prompt_version, ai_token_usage_prompt, ai_token_usage_completion, ai_confidence_score, ai_latency_ms, ai_cost_usd, ai_use_case_domain, ai_data_classification. Producers never call AI provider APIs directly. The AI SDK client library handles envelope construction, ensuring extension field completeness before the event is published.

Governance Plane. A policy enforcement processor subscribes to the raw inbound topic, validates the CloudEvents schema against the schema registry, applies data classification rules (blocking PII fields from reaching models not cleared for that classification), enforces per-producer rate limits, and re-publishes validated events to the routed topic. Failed validation events are routed to the governance dead letter queue with the specific violation reason attached. This plane is the single enforcement point for the enterprise AI usage policy.

Routing Plane. Topic design follows the domain-per-topic principle, not model-per-topic. Topics are named by business domain and event type: ai.creditrisk.application-assessment.v1, ai.customerservice.intent-classification.v1, ai.fraud.transaction-scoring.v1. This topology means upgrading the underlying model from GPT-4 to GPT-4o requires no change to topic names or consumer configurations — the model is a deployment detail of the AI inference worker, not an integration concern.

Processing Plane. AI inference workers subscribe to domain topics, execute inference against the configured model provider, and publish results to result topics following the same CloudEvents envelope pattern. The result event adds ai_result, ai_result_schema_version, and ai_fallback_used extension fields. Workers are stateless and horizontally scalable. Consumer group design ensures each logical consumer role (e.g., fraud-scorer, risk-ranker) receives every event independently without competing for the same partition offset.

Consumer Plane. Downstream applications subscribe to result topics. Consumers are shielded from model provider changes, prompt changes, and inference worker implementation details. The event schema version field enables consumers to handle multiple result schema versions concurrently during rolling upgrades.

Replay Architecture. All events — requests and results — are retained in compacted topics or object storage with a configurable retention period (recommended: 90 days for standard, 7 years for regulated use cases). Replay is initiated by re-publishing retained events to a replay topic. Replay events include the original id and a ai_replay_of extension field, enabling downstream deduplication and differentiation of original vs. replayed processing.

Back-Pressure Handling. AI inference is significantly slower than typical event processing (50ms–30s vs <1ms for simple transforms). Back-pressure is handled via consumer lag monitoring per consumer group: when lag exceeds the configured threshold, the auto-scaler adds inference worker instances. Hard rate limits per consumer group prevent a single workload from monopolising broker throughput.

Dead Letter Queue Architecture. Every consumer group has a corresponding DLQ topic. Events are routed to the DLQ after the configured maximum retry count with full event context preserved: original event, error message, retry count, last failure timestamp, and the consumer group that failed. DLQ topics are monitored; alerts fire at configurable message count thresholds. A replay-from-DLQ operator enables manual investigation and reprocessing.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Ingestion["Ingestion and Governance"] A[AI Event Producers] B[CloudEvents SDK Client] C[Schema Validator + Policy Enforcer] D[Governance DLQ] end subgraph Routing["Domain Topic Routing"] E[Domain Topics per Use Case] F[AI Inference Workers] G[Model Provider] end subgraph Consumers["Consumer and Archive"] H[Result Topics] I[Downstream Consumers] J[(Event Archive + Replay)] end A --> B B --> C C -->|violation| D C -->|routed| E E --> F F --> G G -->|result| F F --> H H --> I H --> J J -->|replay| B style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f3e8ff,stroke:#a855f7 style D fill:#fee2e2,stroke:#ef4444 style E fill:#fef9c3,stroke:#eab308 style F fill:#f0fdf4,stroke:#22c55e style G fill:#f0fdf4,stroke:#22c55e style H fill:#fef9c3,stroke:#eab308 style I fill:#d1fae5,stroke:#10b981 style J fill:#fef9c3,stroke:#eab308

6. Components

Component	Type	Responsibility	Technology Options	Criticality
AI SDK Client Library	Library	CloudEvents envelope construction, extension field population, publisher abstraction	Custom SDK (Python/Java/Node), Dapr SDK	Critical
Schema Registry	Infrastructure	Enforce event schema evolution compatibility; validate inbound events	Confluent Schema Registry, AWS Glue Schema Registry, Azure Schema Registry	Critical
Message Broker	Infrastructure	Durable topic management, consumer group offsets, replay retention	Apache Kafka, Azure Service Bus Premium, AWS MSK, Google Pub/Sub	Critical
Governance Processor	Service	Schema validation, data classification enforcement, rate limiting, governance DLQ routing	Kafka Streams app, Azure Stream Analytics, custom Flink job	Critical
AI Inference Worker	Service	Topic subscription, model provider API call, result event publication, retry logic	Containerised Python/Node service, AWS Lambda, Azure Functions	High
Dead Letter Queue Processor	Service	DLQ monitoring, alerting, manual replay tooling	Custom service + alerting integration	High
Event Archive	Storage	Long-term event retention for audit and replay	Kafka compacted topics + S3/ADLS/GCS, Apache Iceberg, Delta Lake	High
Replay Operator	Service	Re-publish archived events to inbound topic with replay metadata	Custom CLI/service	Medium
Observability Collector	Infrastructure	Consume all topics to extract cost, latency, quality metrics per domain	Kafka consumer + Prometheus metrics, Datadog, Splunk	High
Consumer Group Manager	Configuration	Define and enforce consumer group isolation across domains	Kafka AdminClient, Terraform-managed topic ACLs	Medium

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	Application	Calls AI SDK Client Library with domain payload and data classification label	CloudEvents envelope with AI extension fields populated
2	AI SDK	Publishes event to `ai.raw.inbound.v1` topic	Event persisted in broker with offset
3	Governance Processor	Validates event schema against registry; checks data classification vs model clearance; checks rate limit	Validated event forwarded to domain topic OR rejected to governance DLQ
4	AI Inference Worker	Subscribes to domain topic, receives event, constructs model provider API request	Model provider API call with prompt and context
5	Model Provider	Executes inference	AI response with token counts, finish reason
6	AI Inference Worker	Constructs result CloudEvent with `ai_result`, `ai_confidence_score`, `ai_latency_ms`, `ai_cost_usd`, `ai_fallback_used`	Result event published to result topic
7	Consumer Application	Subscribes to result topic, processes AI result, updates business state	Business process continues with AI-enriched data
8	Event Archive	Subscribes to all topics; archives events to long-term storage	Immutable event log for audit and replay

Error Flow

Step	Error Condition	Detection	Recovery
2	Schema validation failure	Schema registry rejects event	Event routed to governance DLQ with violation detail
3	Data classification violation	Policy enforcer classification check fails	Event rejected to governance DLQ; producer alerted
4	Model provider API error (5xx)	HTTP error or timeout from provider	Retry with exponential backoff; after max retries, route to inference DLQ
4	Model provider rate limit (429)	HTTP 429 response	Back-off per Retry-After header; consumer group lag accumulates; auto-scaler adjusts
6	Result schema validation failure	Result event fails schema check	Worker logs error; original event moved to inference DLQ with error context
7	Consumer processing failure	Consumer throws exception after N retries	Consumer framework routes to consumer-group-specific DLQ

8. Security Considerations

Authentication and Authorisation

All producers authenticated to broker using mTLS client certificates or SASL/SCRAM.
Topic ACLs enforced: each producer has write access only to ai.raw.inbound.v1; each inference worker has read access only to its assigned domain topics.
AI provider API keys stored in centralised secrets manager (not in event payloads); injected into worker environment at runtime.
Consumer applications have read-only ACL to their subscribed result topics only.

Secrets Management

AI provider API keys rotated on a 90-day cycle; rotation does not require worker redeployment (secrets manager dynamic injection).
Broker TLS certificates managed by PKI infrastructure with automated renewal.
Schema registry credentials managed via service accounts with least-privilege access.

Data Classification

All events tagged with data classification at source; governance processor enforces model clearance against classification.
PII-tagged events are only routed to models with verified PII data processing agreements.
Event payloads in transit encrypted (TLS 1.3); at rest encrypted (AES-256) in broker storage and event archive.

Auditability

Every event carries a globally unique id (UUID v4); the full audit trail from request to result is reconstructable by correlating on id and ai_replay_of.
Governance DLQ events include the specific policy violation reason, enabling compliance reporting on rejected AI usage attempts.

OWASP LLM Top 10 Mitigations

OWASP LLM Risk	Relevance	Mitigation in This Pattern
LLM01 — Prompt Injection	High	Governance processor validates event payload schema; free-text fields flagged for prompt injection scanning before routing to inference workers
LLM02 — Insecure Output Handling	High	Result events validated against result schema before publication; consumers receive structured, schema-typed fields not raw model output
LLM03 — Training Data Poisoning	Medium	Read-only audit trail of all training-relevant events; replay events flagged separately to prevent replay data polluting training pipelines
LLM04 — Model Denial of Service	High	Per-producer and per-consumer-group rate limits enforced by governance processor; cost spike circuit breaker triggers circuit open
LLM05 — Supply Chain Vulnerabilities	Medium	Model provider API calls go through inference workers only; SDK pinned versions in worker container images; SBOM generated per release
LLM06 — Sensitive Information Disclosure	High	Data classification enforcement prevents PII reaching uncertified models; no raw prompt or response stored in topics beyond configurable retention
LLM07 — Insecure Plugin Design	Medium	Function-calling plugins not applicable to this pattern; inference workers expose no external plugin surface
LLM08 — Excessive Agency	High	Inference workers are passive responders; no autonomous action capability; all results require consumer application to act
LLM09 — Overreliance	Medium	`ai_confidence_score` field in every result event; consumers can implement confidence thresholds before acting on AI results
LLM10 — Model Theft	Medium	API keys never in event payloads; model provider credentials not accessible to consumers; inference workers isolated in dedicated network segment

9. Governance Considerations

Responsible AI

Every AI inference event carries ai_use_case_domain enabling post-hoc analysis of AI usage by domain against ethical use policies.
Confidence scores and model version in every result event support bias monitoring per domain over time.
Human override mechanism: consumers can publish to ai.[domain].human-override.v1 topic to record cases where AI result was rejected by a human decision-maker.

Model Risk Management

Schema registry enforces that breaking prompt changes result in a new ai_prompt_version value, enabling performance comparison between prompt versions using event analytics.
Model upgrade path: deploy new inference worker version subscribing to same domain topic; run shadow mode (dual-publish old and new results to separate result topics); compare result quality before cutover.

Human Approval Gates

High-stakes domains (credit decisions, medical recommendations) configure a requires_human_review flag in domain topic config; governance processor enriches events with this flag before routing to inference workers; result events include human_review_required: true to trigger downstream approval workflow.

Policy and Traceability

AI usage policy stored in policy-as-code repository; governance processor references versioned policy definitions; policy version embedded in governance validation result.
Full event lineage from source application through governance validation through inference to consumer available via event id correlation in the event archive.

Governance Artefacts

Artefact	Owner	Update Frequency	Storage Location
AI Usage Policy (policy-as-code)	Chief AI Risk Officer	Per policy change	Policy repository (Git-backed)
Schema Registry Schemas	Platform Engineering	Per event schema change	Schema Registry + Git backup
Topic ACL Configuration	Platform Engineering	Per onboarding/offboarding	Terraform state + Git
DLQ Review Report	AI Governance Team	Weekly	Governance dashboard
Model Upgrade Decision Record	AI Platform Team	Per model version change	Architecture Decision Record repository
Cost Attribution Report	Finance / FinOps	Monthly	FinOps platform

10. Operational Considerations

Monitoring and SLOs

SLO	Target	Measurement	Alert Threshold
Event end-to-end latency (p99)	< 10s for async; < 500ms for near-real-time	Time from publish to result topic to consumer receipt	> 15s sustained for 5 min
Consumer group lag (all groups)	< 1000 events	Broker consumer lag metric	> 5000 events accumulating
Governance rejection rate	< 0.5%	DLQ event count / total events	> 2% in any 15-min window
Inference worker availability	99.9%	Worker health check success rate	< 99.5% over 5 min
DLQ growth rate	0 net new per hour (steady state)	DLQ message count delta	Any sustained growth
Event archive completeness	100%	Archive record count vs broker offset	Any gap

Logging

Every governance processor decision logged with: event id, producer, domain, classification, policy version, decision (allow/reject), rejection reason.
Every inference worker call logged with: event id, model provider, model id, prompt version, token usage, latency, cost, success/failure.
Logs shipped to SIEM for security analysis; to observability platform for operational analysis.

Incident Response

Governance processor failure: producers continue publishing to raw topic; events accumulate until processor recovers; no data loss (broker durability). Alert fires within 60 seconds of processor unavailability.
Inference worker failure: domain topic consumer lag accumulates; auto-scaler adds new worker instances within 3 minutes; SLO breach alert if lag exceeds 5000 events.
Model provider outage: circuit breaker opens after configured error rate threshold; fallback response or human queue escalation activated; incident ticket auto-created with cost-so-far and impacted domains.

Disaster Recovery

Scenario	RTO	RPO	Recovery Procedure
Single inference worker failure	3 minutes	0 (broker retains events)	Auto-scaling replaces worker; consumer group resumes from last committed offset
Governance processor failure	5 minutes	0	Kubernetes deployment restart; events accumulate in raw topic during outage
Broker node failure	10 minutes	0 (replicated partitions)	Kafka partition leader election; consumers reconnect automatically
Full broker cluster failure	4 hours	0 (cross-region replica)	Failover to replica cluster; update producer/consumer connection strings
Event archive corruption	24 hours	Up to retention boundary	Restore from backup; replay from broker if within retention period

Capacity Planning

Broker storage: (average event size in KB) × (events per day) × (retention days) × 3 (replication factor).
Inference worker sizing: target throughput (events/min) / per-worker throughput (events/min) = minimum worker count; add 50% headroom for burst.
Schema registry: low resource requirements; size for HA (3-node ensemble) not throughput.

11. Cost Considerations

Cost Drivers

Cost Driver	Description	Typical Proportion
AI Model Provider API Costs	Token-based charges for every inference event; dominant cost driver	55–70%
Managed Broker (MSK/Service Bus)	Per-partition-hour + data transfer + storage	10–20%
Inference Worker Compute	Container/function runtime for worker fleet	8–15%
Event Archive Storage	Long-term event retention in object storage	3–8%
Schema Registry	Managed service or self-hosted compute	1–3%
Observability (metrics/logs)	Event-level metric ingestion volume	3–7%

Scaling Risks

AI provider token costs scale linearly with event volume; cost spike protection requires cost-rate circuit breaker or monthly budget alerts.
Kafka storage costs can grow unexpectedly with long retention periods on high-volume topics; topic-level retention policies must be actively managed.
Inference worker auto-scaling lags behind sudden traffic spikes by 2–5 minutes; pre-warm workers for known batch jobs.

Cost Optimisations

Batch small events into micro-batches in the inference worker to reduce per-call API overhead and take advantage of batch inference pricing.
Use spot/preemptible instances for non-latency-sensitive inference workers (batch domains).
Implement caching layer in inference worker for identical or near-identical prompts (semantic deduplication) — typical cache hit rate 15–30% for structured workloads.
Compress event payloads (Snappy/LZ4 for Kafka) to reduce broker storage and network costs.

Indicative Cost Range

Scale	Monthly Infrastructure	AI Provider API	Total Monthly
Small (10M events/mo, 3 domains)	$1,500–$3,000	$5,000–$15,000	$6,500–$18,000
Medium (100M events/mo, 10 domains)	$8,000–$15,000	$40,000–$120,000	$48,000–$135,000
Large (1B+ events/mo, 30+ domains)	$40,000–$80,000	$300,000–$800,000	$340,000–$880,000

12. Trade-Off Analysis

Architectural Options Comparison

Option	Description	Latency	Cost	Governance	Complexity	Recommended For
Option A — Enterprise AI Service Bus (this pattern)	Asynchronous event bus with schema governance, domain topics, replay	500ms–30s	Medium infrastructure + AI API	Centralised, strong	High	Large enterprise, regulated industries, multi-team AI consumption
Option B — Direct AI API Integration	Each application calls AI provider API directly	100ms–10s	Low infrastructure, highest AI API	Decentralised, weak	Low	Single-team, exploratory, non-regulated
Option C — Synchronous AI Gateway	Synchronous API gateway proxying AI provider calls; no broker	200ms–15s	Medium	Medium	Medium	Medium enterprise, request/response workloads, low replay requirement

Architectural Tensions

Tension	Trade-Off	Resolution
Latency vs. Governance	Adding governance processor to event path adds 50–200ms latency	Accept latency for regulated domains; implement fast-path bypass for pre-approved, non-sensitive use cases
Topic granularity vs. Consumer flexibility	Coarse domain topics couple unrelated use cases; fine-grained topics increase management overhead	One topic per domain AND event type version; avoid sub-domain splits until consumer count justifies it
Replay completeness vs. Storage cost	Full event retention enables unlimited replay; drives storage costs	Tiered retention: 90 days hot (broker), 7 years cold (object storage with restore latency)
Schema evolution rigidity vs. Innovation speed	Strict schema compatibility slows prompt experimentation	Use schema registry for result events (consumer-facing); allow looser schema for internal inference events behind the governance plane

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Governance processor becomes unavailable	Low	High — all new events blocked from routing	Consumer lag on raw topic grows; health check fails	Kubernetes restart; events accumulate durably in broker
AI provider API key expires or is revoked	Medium	High — all inference workers fail	HTTP 401 errors from provider; inference DLQ growth	Rotate key in secrets manager; workers pick up automatically
Schema registry unavailable	Low	High — new events cannot be validated	Governance processor errors; alert fires	Read-through cache on governance processor provides short-term continuity; restore registry
Consumer group offset corruption	Very Low	Medium — some events may be reprocessed	Duplicate events in consumer application	Idempotent consumer processing (dedup on event `id`); replay from known-good offset
Back-pressure causing broker disk exhaustion	Medium	Critical — broker stops accepting new events	Broker disk usage alert	Increase broker storage; add topic retention policy enforcement; throttle producers
Model provider rate limit hit	High	Medium — inference latency increases	HTTP 429 responses; consumer lag growth	Exponential backoff; distribute load across multiple provider API keys; activate fallback model

Cascading Failure Scenarios

Governance processor failure + high event volume: Raw topic fills beyond retention period → events lost. Mitigation: extend raw topic retention to 7 days; alert on raw topic consumer lag within 60 seconds.
Inference DLQ accumulation + no DLQ monitoring: Silent event loss for hours; downstream consumers starved of results, triggering application-level failures. Mitigation: DLQ monitoring and alerting is mandatory, not optional.
Model provider global outage + no circuit breaker + no fallback: All inference workers retry indefinitely → exhausts retry budget → all events land in DLQ → consumers receive no results → downstream business processes halt. Mitigation: circuit breaker with fallback response is non-negotiable for production deployments.

14. Regulatory Considerations

APRA CPS 230 — Operational Risk

Clause 36 (Business Continuity): The event bus must have documented RTO/RPO for each failure scenario. Replay capability directly addresses recovery of AI processing after outages.
Clause 52 (Service Provider Management): AI model providers are third-party service providers; the governance processor enforces usage controls required under third-party risk management.

APRA CPS 234 — Information Security

Clause 15 (Information Security Controls): mTLS authentication, topic ACLs, and data classification enforcement address the requirement for controls proportional to data sensitivity.
Clause 36 (Incident Notification): Governance DLQ violations and model provider outages must be assessed as potential security incidents under CPS 234 notification obligations.

Australian Privacy Act 1988 (as amended 2024)

APP 6 (Use and Disclosure): Data classification enforcement in the governance processor operationalises the requirement to use personal information only for the primary purpose disclosed at collection.
APP 8 (Cross-border Disclosure): Events routed to offshore model providers must have the country of processing recorded in the AI extension fields; governance processor must block cross-border routing for events exceeding permitted data sharing boundaries.

EU AI Act (2024)

Article 13 (Transparency): ai_model_id, ai_model_version, and ai_prompt_version in every event satisfy the requirement to document the AI system used in automated decisions affecting natural persons.
Article 17 (Quality Management): Schema registry enforcement, DLQ monitoring, and replay capability are evidence of a quality management system for AI outputs.
Article 12 (Record-keeping): Event archive with 7-year retention for high-risk AI use cases directly satisfies the logging obligation for high-risk AI systems.

ISO 42001 — AI Management System

Clause 6.1.2 (AI Risk Assessment): Per-domain circuit breakers and confidence score tracking operationalise the risk assessment and monitoring requirements.
Clause 8.5 (AI System Lifecycle): Prompt versioning, model version tracking, and replay capability support the AI lifecycle management requirements.

NIST AI RMF (2023)

GOVERN 1.1: AI usage policy encoded in governance processor addresses the organisational risk governance requirement.
MEASURE 2.5: Confidence score monitoring and quality degradation circuit breaker conditions implement the performance measurement requirement.
MANAGE 2.4: DLQ with full context capture and replay capability addresses the AI risk treatment and incident response requirements.

15. Reference Implementations

AWS

Broker: Amazon MSK (Kafka-compatible) with MSK Connect for governance processor
Schema Registry: AWS Glue Schema Registry
Inference Workers: AWS Lambda (event-driven) or ECS Fargate containers
DLQ: Amazon SQS DLQ connected to MSK via Kafka SQS Sink Connector
Event Archive: S3 via Kafka S3 Sink Connector; query via Athena
Observability: Amazon CloudWatch + AWS Cost Explorer for per-event cost tracking
Secrets: AWS Secrets Manager with Lambda execution role access

Azure

Broker: Azure Event Hubs (Kafka-compatible surface) or Azure Service Bus Premium
Schema Registry: Azure Schema Registry (built into Event Hubs namespace)
Inference Workers: Azure Functions (event-driven triggers) or AKS pods
DLQ: Azure Service Bus dead-letter queues
Event Archive: Azure Data Lake Storage Gen2 via Event Hubs Capture
Observability: Azure Monitor + Application Insights; Cost Management for attribution
Secrets: Azure Key Vault with managed identity binding to workers

GCP

Broker: Google Cloud Pub/Sub (native) or GKE-hosted Kafka
Schema Registry: Confluent Schema Registry on GKE or Apicurio Registry
Inference Workers: Cloud Run (event-driven) or GKE deployments
DLQ: Pub/Sub dead-letter topics with subscription-level configuration
Event Archive: Cloud Storage via Pub/Sub export; query via BigQuery external tables
Observability: Cloud Monitoring + Cloud Logging; BigQuery for cost analytics
Secrets: Secret Manager with Workload Identity binding

On-Premises / Private Cloud

Broker: Apache Kafka (self-managed) on Kubernetes via Strimzi Operator
Schema Registry: Confluent Schema Registry OSS or Apicurio Registry
Inference Workers: Kubernetes Deployments with KEDA event-driven autoscaling
DLQ: Dedicated Kafka topics with Kafka UI for manual review
Event Archive: MinIO (S3-compatible) + Apache Iceberg for query
Observability: Prometheus + Grafana + Loki stack
Secrets: HashiCorp Vault with Kubernetes auth method

Pattern	Relationship	Notes
EAAPL-INT007 — AI Circuit Breaker	Enables	Circuit breaker per model provider is a required sub-component of each inference worker in this pattern
EAAPL-INT004 — Real-Time AI Stream Processing	Specialises	Stream processing pattern is a specific consumer topology for this bus in low-latency domains
EAAPL-INT005 — Batch AI Processing	Specialises	Batch processing is a consumer topology for this bus in high-throughput, non-latency-sensitive domains
EAAPL-INT002 — Legacy System AI Augmentation	Complementary	Legacy systems publish to and consume from this bus through adapter components
EAAPL-INT008 — Bidirectional AI Sync	Complementary	Sync pattern consumes result events from this bus to update enterprise data stores

17. Maturity Assessment

Overall Maturity: Proven

Dimension	Score (1–5)	Justification
Architectural Completeness	5	All integration, governance, processing, and consumer concerns addressed
Operational Readiness	4	Runbook templates defined; some DR procedures require organisation-specific customisation
Security Coverage	5	mTLS, ACLs, classification enforcement, OWASP LLM Top 10 addressed
Governance Coverage	5	Policy-as-code, audit trail, model risk management, human override all included
Cost Predictability	4	Indicative ranges provided; AI API costs remain variable; budget alerting required
Implementation Complexity	3	High — requires mature messaging platform and operational tooling; not suitable for small teams
Industry Validation	4	Pattern applied in production at major financial institutions and government agencies

18. Revision History

Version	Date	Author	Changes
1.0	2026-06-12	EAAPL Working Group	Initial publication — integration patterns series

← Back to Library More AI Integration →

EAAPL-INT001 — Enterprise AI Service Bus

EAAPL-INT001 — Enterprise AI Service Bus

1. Executive Summary

2. Problem Statement

Business Problem

Technical Problem

Symptoms

Cost of Inaction

3. Context

When to Apply

When NOT to Apply

Prerequisites

Industry Applicability

4. Architecture Overview

5. Architecture Diagram

6. Components

7. Data Flow

Primary Flow

Error Flow

8. Security Considerations

Authentication and Authorisation

Secrets Management

Data Classification

Auditability

OWASP LLM Top 10 Mitigations

9. Governance Considerations

Responsible AI

Model Risk Management

Human Approval Gates

Policy and Traceability

Governance Artefacts

10. Operational Considerations

Monitoring and SLOs

Logging

Incident Response

Disaster Recovery

Capacity Planning

11. Cost Considerations

Cost Drivers

Scaling Risks

Cost Optimisations

Indicative Cost Range

12. Trade-Off Analysis

Architectural Options Comparison

Architectural Tensions

13. Failure Modes

Cascading Failure Scenarios

14. Regulatory Considerations

APRA CPS 230 — Operational Risk

APRA CPS 234 — Information Security

Australian Privacy Act 1988 (as amended 2024)

EU AI Act (2024)

ISO 42001 — AI Management System

NIST AI RMF (2023)

15. Reference Implementations

AWS

Azure

GCP

On-Premises / Private Cloud

16. Related Patterns

17. Maturity Assessment

18. Revision History