[EAAPL-SEC001] AI Gateway
Category: Security / API Control Plane
Sub-category: Traffic Management & Policy Enforcement
Version: 2.1
Maturity: Mature
Tags: api-gateway rate-limiting authentication cost-allocation circuit-breaker policy-enforcement ai-operations
Regulatory Relevance: APRA CPS234, EU AI Act Art. 9 (Risk Management), ISO 42001 §6.1, NIST AI RMF GOVERN 1.2
1. Executive Summary
The AI Gateway pattern establishes a centralised, enterprise-grade control plane through which all AI traffic flows — inbound requests from applications and users, and outbound calls to model providers. It functions as the "first and last line of defence" for every AI interaction in the enterprise.
From a business perspective, the AI Gateway solves three compounding problems that emerge when AI usage scales without discipline: uncontrolled spend (teams independently acquiring model API keys lead to budget overruns with no visibility), inconsistent security posture (each team re-inventing authentication, logging, and abuse controls), and regulatory exposure (no single audit trail for AI interactions).
The gateway provides authentication and authorisation for every AI request, enforces rate limits and cost budgets per team or product, routes traffic intelligently across multiple model providers, captures structured logs for compliance, and breaks the circuit when downstream models are degraded. Organisations that deploy this pattern typically report 30–50% reduction in AI spend waste through visibility and quota enforcement, and can demonstrate AI audit trails to regulators within 24 hours of a request.
This pattern is the foundation upon which all other AI security and observability patterns depend. It should be the first pattern deployed in any enterprise AI programme.
2. Problem Statement
Business Problem
Enterprise organisations adopting AI at scale face ungoverned sprawl: dozens of teams independently calling OpenAI, Anthropic, Azure OpenAI, and other providers with individual API keys. There is no budget control, no unified audit trail, no abuse detection, and no single point where policy can be enforced. A single misconfigured application or compromised key can generate hundreds of thousands of dollars in model API spend within hours. Regulatory bodies (APRA, ASIC, EU regulators) increasingly require organisations to demonstrate comprehensive audit trails for AI-assisted decisions — an impossibility without a centralised control point.
Technical Problem
Without a gateway:
- Each application must independently implement auth, rate limiting, retry logic, and logging — creating N inconsistent implementations.
- Model provider credentials are distributed across dozens of services, dramatically increasing the blast radius of a credential leak.
- There is no circuit breaker: a degraded model provider cascades into application failures.
- Cost attribution is impossible: spend cannot be allocated to teams, products, or use cases.
- Routing logic (e.g., fallback to a cheaper model for low-complexity requests) must be duplicated across every consuming application.
Symptoms of Absence
- Unexplained spikes in model API bills.
- Security incidents involving leaked model API keys.
- Different applications enforcing different content policies, creating inconsistent user experiences.
- Inability to produce AI usage reports for compliance audits.
- Cascading application failures when a model provider has an outage.
- No capacity to enforce organisational AI usage policies (e.g., "no patient data to external models").
Cost of Inaction
| Dimension | Impact |
|---|---|
| Financial | Uncontrolled model API spend; potential for runaway costs from abuse or bugs |
| Regulatory | Cannot demonstrate AI audit trail to APRA/EU AI Act auditors; enforcement risk |
| Security | Distributed credentials; no unified threat detection; full blast radius on key leak |
| Operational | N × duplicated retry/rate-limit/log implementations; no unified model health visibility |
| Reputational | Policy violations reach users (harmful content, data leakage) without a filter layer |
3. Context
When to Apply
- Organisation has more than one team or application calling AI model APIs.
- AI model API spend exceeds $5,000/month or is forecast to.
- Organisation operates in a regulated industry (financial services, healthcare, government).
- Multiple model providers are in use or planned.
- Security team requires audit trails for AI interactions.
- AI applications are user-facing and require content policy enforcement.
When NOT to Apply
- Single-team proof-of-concept with a 90-day sunset — gateway adds operational overhead disproportionate to PoC scope.
- Fully offline/on-premises model inference where the model is a library call within the same process — a gateway adds latency without security benefit at the network boundary.
- When a cloud-native AI platform (e.g., Azure AI Studio with built-in APIM integration) already provides all required controls natively and team can accept vendor lock-in.
Prerequisites
| Prerequisite | Detail |
|---|---|
| Identity Provider | OIDC/SAML IdP capable of issuing JWT tokens to calling applications |
| Secrets Management | Vault or equivalent for model provider credentials |
| Observability Stack | Log aggregation and metrics platform to receive gateway telemetry |
| Network Topology | Gateway must be reachable by all AI-consuming applications; egress to model providers permitted |
| API Catalogue | Inventory of existing AI API calls to route through the gateway |
Industry Applicability
| Industry | Applicability | Key Driver |
|---|---|---|
| Financial Services | High | APRA CPS234, audit trails, cost governance |
| Healthcare | High | Patient data controls, regulatory AI traceability |
| Government | High | Sovereignty, audit, classification enforcement |
| Retail / E-commerce | Medium | Cost control, content policy |
| Technology / SaaS | Medium | Multi-team cost allocation, developer platform |
| Education | Medium | Content policy, budget governance |
4. Architecture Overview
The AI Gateway is deployed as a horizontally scalable reverse proxy that sits at the intersection of all AI-consuming workloads and all model provider endpoints. It is not a simple HTTP proxy — it is a stateful policy engine with its own data plane (real-time request processing) and control plane (policy configuration, key management, quota administration).
Why a dedicated gateway rather than embedding controls in each application?
The fundamental architectural reason is that cross-cutting concerns — authentication, rate limiting, cost allocation, audit logging, circuit breaking — are almost always implemented inconsistently when distributed across teams. The gateway externalises these concerns into a single, auditable, independently operated service. This mirrors the established API gateway pattern for REST/GraphQL APIs, extended with AI-specific capabilities.
Request Path Design
Inbound requests arrive from applications carrying a service identity token (mTLS client certificate or JWT). The gateway's authentication middleware validates the token against the enterprise IdP before any processing occurs. This ensures that unauthenticated requests fail fast and are never forwarded to model providers — preventing credential abuse if an internal application is compromised.
After authentication, the policy engine evaluates the request against a rule set: Does this caller have permission to use this model? Does this request exceed the caller's rate quota? Does this request carry a data classification label that prohibits forwarding to the requested external provider? Policy decisions are made in-process against an in-memory policy cache (refreshed from the policy store every 60 seconds) to keep decision latency under 1ms.
Routing and Provider Abstraction
The gateway abstracts model provider APIs behind a unified internal schema. Consuming applications call a single internal endpoint (/v1/chat/completions) regardless of whether the request will be served by GPT-4, Claude 3.7, or an on-premises Llama deployment. The routing layer maps requests to providers based on model name, caller preference, load, cost optimisation rules, and provider health. This abstraction is critical: it allows organisations to switch providers, add fallbacks, or introduce shadow routing for model evaluation without changing consuming applications.
Why circuit breaking at the gateway?
Model providers have variable availability SLAs, and LLM inference latency is orders of magnitude higher than typical microservice calls. Without a circuit breaker at the gateway, a degraded provider causes cascading timeouts across all consuming applications. The gateway's circuit breaker monitors error rates and latency per provider, opens the circuit when thresholds are breached (e.g., >10% 5xx over 60 seconds), routes traffic to the fallback provider, and attempts provider recovery with exponential backoff. This dramatically improves overall application resilience.
Cost Allocation Architecture
Each request is tagged with a cost allocation key (team, product, use-case, user) at ingress. The gateway calculates cost in real-time by multiplying token counts (extracted from the provider response) by the current pricing table (refreshed daily from a configuration store). Cost events are written to a time-series cost ledger. Budget monitors subscribe to this ledger and emit alerts or enforcement actions (soft-block, hard-block) when budgets are approached or exceeded. This gives finance teams the ability to allocate AI spend on a monthly basis without manual reconciliation.
Audit Logging
Every request and response traverses the audit logger, which writes a structured log record to an immutable audit log store (append-only, tamper-evident). The log record captures: caller identity, request timestamp, model requested, model served, token counts, cost, policy decisions made, response status, and a truncated hash of the request content (full content logging is optional and controlled by data classification). This log is the evidentiary foundation for regulatory compliance.
5. Architecture Diagram
6. Components
| Component | Type | Responsibility | Technology Options | Criticality |
|---|---|---|---|---|
| mTLS / JWT Auth | Security Middleware | Validates caller identity on every request; terminates unauthenticated requests immediately | Envoy, Kong, custom Go/Rust middleware | Critical |
| Policy Engine | Decision Engine | Evaluates per-request policy rules (model access, data classification, content type) against policy store | Open Policy Agent (OPA), Cedar, custom rule engine | Critical |
| Rate Limiter | Traffic Control | Enforces per-caller, per-model, and global token/request quotas; returns 429 on breach | Redis + Lua, Envoy rate limit service, Kong rate limiting plugin | Critical |
| Request Router | Routing Layer | Maps requests to model providers based on model name, load, cost, health; enables fallback routing | Envoy, Kong, NGINX + Lua, custom Go service | High |
| Prompt Firewall | Security Filter | Inline prompt injection and policy violation detection (see EAAPL-SEC002) | Custom classifier, AWS Guardrails, Azure Content Safety | High |
| Output Filter | Security Filter | Post-generation content and PII filtering (see EAAPL-SEC006) | Microsoft Presidio, AWS Comprehend, custom NLP pipeline | High |
| Cost Calculator | Cost Accounting | Real-time cost computation from token counts × pricing table; writes cost events | Custom service with pricing API, FinOps platform integration | Medium |
| Circuit Breaker | Resilience | Monitors provider health; opens/closes circuit; routes to fallback on failure | Hystrix, Resilience4j, Envoy outlier detection | High |
| Audit Logger | Compliance | Writes immutable, structured audit records for every request/response | Kafka → S3/GCS immutable store, Splunk, Datadog | Critical |
| Policy Store | Configuration | Authoritative store of gateway policies (model ACLs, data classification rules, content policies) | OPA Bundles, AWS S3 + IAM, HashiCorp Vault | Critical |
| Quota Store | State Store | Real-time quota counters per caller, per model, per period | Redis Cluster, DynamoDB, Dragonfly | High |
| Key Vault | Secrets | Stores and dispenses model provider credentials; see EAAPL-SEC008 | HashiCorp Vault, AWS Secrets Manager, Azure Key Vault | Critical |
| Cost Ledger | Financial | Time-series store of cost events for dashboarding and budget enforcement | InfluxDB, Prometheus, BigQuery, Snowflake | Medium |
7. Data Flow
Primary Flow
| Step | Actor | Action | Output |
|---|---|---|---|
| 1 | Consumer Application | Sends HTTP POST to gateway /v1/chat/completions with mTLS client cert + JWT Bearer token in Authorization header |
Inbound request at gateway TLS terminator |
| 2 | Auth Middleware | Validates mTLS client certificate against CA; validates JWT signature and claims (iss, aud, exp, scope) against IdP JWKS endpoint | Authenticated identity context attached to request |
| 3 | Policy Engine | Looks up caller in policy store; evaluates model access ACL, data classification label on request, and content type rules | ALLOW or DENY decision; deny returns 403 immediately |
| 4 | Rate Limiter | Atomically increments caller's token and request counters in Redis; checks against quota for current period | ALLOW or 429 Too Many Requests |
| 5 | Prompt Firewall | Scans request body for prompt injection patterns, PII, and policy violations | Sanitised request body or 400 Bad Request |
| 6 | Request Router | Evaluates routing rules; selects target model provider based on requested model, provider health, and load | Routing decision + provider credentials retrieved from vault |
| 7 | Circuit Breaker | Checks provider circuit state (CLOSED/OPEN/HALF-OPEN); if OPEN, routes to fallback provider | Forwarded request or fallback routing |
| 8 | Model Provider | Processes request; returns response with token usage metadata | Raw model response |
| 9 | Output Filter | Inspects response for PII leakage, harmful content, and policy violations | Filtered response or 502 if blocked |
| 10 | Cost Calculator | Extracts prompt_tokens + completion_tokens from response; multiplies by provider pricing; writes cost event | Cost-annotated response headers |
| 11 | Audit Logger | Writes structured log record (identity, model, tokens, cost, policy decisions, response status, content hash) | Audit record in immutable log store |
| 12 | Consumer Application | Receives filtered, cost-annotated response | Business logic continues |
Error Flow
| Error Condition | Gateway Behaviour | HTTP Status | Alert Triggered |
|---|---|---|---|
| Invalid/expired JWT | Reject at auth middleware; log failed auth attempt | 401 | Auth anomaly alert if >10/min |
| Policy DENY | Reject at policy engine; log policy violation | 403 | Policy violation alert |
| Rate limit exceeded | Reject at rate limiter; return Retry-After header | 429 | Quota alert to team budget owner |
| Prompt injection detected | Reject at prompt firewall; log sanitised indicator | 400 | Security incident alert |
| Provider circuit OPEN | Route to fallback; if no fallback, return 503 | 503 | Provider health alert |
| Output policy violation | Block response; return opaque error to caller | 502 | Content policy alert |
| Vault unavailable | Fail closed: all requests rejected until vault recovers | 503 | Critical infrastructure alert |
8. Security Considerations
Authentication & Authorisation
- Mutual TLS (mTLS): All inbound connections from consumer applications require a client certificate issued by the enterprise CA. This provides cryptographic identity that cannot be forged with a stolen JWT alone.
- JWT Validation: Bearer tokens carry caller identity, scope (which models are accessible), and expiry. Tokens are validated against the IdP's JWKS endpoint with a local cache (refreshed every 5 minutes). Short token lifetimes (15–60 minutes) limit the window of a compromised token.
- Service-to-Service Identity: Consumer applications authenticate as service principals, not human users. Human-facing applications should not forward end-user tokens to the gateway — the application authenticates as itself and includes user context as a claim.
- Scope-Based Model Access: JWT scopes define which model families a caller may access. A customer service application should not have scope to access GPT-4 if it only requires GPT-3.5-turbo. Principle of least privilege applies to model access.
Secrets Management
- Model provider API keys are never stored in gateway configuration files, environment variables, or source code. All credentials are retrieved at runtime from a vault (see EAAPL-SEC008).
- Gateway retrieves short-lived, dynamically generated credentials where the provider supports it (e.g., AWS Bedrock via IAM role assumption, Azure OpenAI via managed identity).
- Gateway logs never include raw API keys; only key IDs are logged for traceability.
Data Classification
- Requests carrying data classification labels above a permitted threshold for a given provider are blocked by the policy engine. For example, requests labelled
CONFIDENTIALmay not be routed to external commercial model providers; only to on-premises inference endpoints. - Data classification labels are injected by the consuming application or inferred by the prompt firewall (EAAPL-SEC005).
Encryption
- All traffic in transit uses TLS 1.3. TLS 1.0/1.1/1.2 are disabled.
- Audit logs are encrypted at rest using AES-256. Log encryption keys are managed separately from gateway operational keys.
- Request/response content stored in audit logs (if enabled) is encrypted with per-record keys, limiting the impact of a log store breach.
Auditability
- Every request generates an audit record with: caller identity, timestamp (nanosecond precision), model requested, model served, token counts, cost, policy decisions, response HTTP status, and a SHA-256 hash of the request body.
- Audit logs are written to an append-only store (S3 Object Lock, Azure Immutable Blob Storage) with a minimum 7-year retention for regulated entities.
OWASP LLM Top 10 Coverage
| OWASP LLM Risk | Gateway Mitigation | Coverage |
|---|---|---|
| LLM01: Prompt Injection | Prompt Firewall (SEC002) inline at gateway; pattern and semantic analysis | High |
| LLM02: Insecure Output Handling | Output Filter (SEC006) inspects all responses before delivery | High |
| LLM03: Training Data Poisoning | Out of scope for gateway (training-time control); gateway logs anomalous output patterns for investigation | Low |
| LLM04: Model Denial of Service | Rate limiting per caller and globally; circuit breaker prevents provider overload from cascading | High |
| LLM05: Supply Chain Vulnerabilities | Provider allow-list enforced at routing layer; only approved providers are routable | Medium |
| LLM06: Sensitive Information Disclosure | Output Filter detects PII in responses; input sanitisation prevents PII from entering prompts | High |
| LLM07: Insecure Plugin Design | Secure Tool Invocation pattern (SEC004) enforced as a gateway policy for agent tool calls | Medium |
| LLM08: Excessive Agency | Human approval gates can be enforced at gateway for high-risk request types | Medium |
| LLM09: Overreliance | Out-of-scope for gateway; addressed in application layer | None |
| LLM10: Model Theft | Model provider credentials protected in vault; no credential exposure via gateway APIs | High |
9. Governance Considerations
Responsible AI
- The gateway is the enforcement point for the organisation's AI Acceptable Use Policy. Policy rules in the policy store codify the AUP into enforceable controls.
- Every AI interaction is logged with sufficient context to support post-hoc review of AI-assisted decisions — a core requirement of responsible AI frameworks.
Model Risk Management
- The gateway's routing rules enforce which models may be used for which use cases. High-risk use cases (credit decisions, medical triage) can be restricted to approved, validated models only.
- Model version pinning at the gateway ensures that model updates do not reach production applications without going through the change management process.
Human Approval Gates
- The policy engine can require human approval for request types flagged as high-risk (e.g., requests to execute code, send communications, or modify records). Human approval workflows are triggered via an integration with the organisation's ITSM platform.
Policy Management
- AI usage policies are maintained as code (OPA Rego or Cedar policies) in a version-controlled repository. Changes undergo PR review, automated policy testing, and staged rollout through gateway environments (dev → staging → production).
Traceability
- Every policy decision is logged with the rule ID that triggered it, enabling governance teams to audit which policies are most frequently triggered, identify policy gaps, and demonstrate regulatory compliance.
Governance Artefacts
| Artefact | Owner | Frequency | Purpose |
|---|---|---|---|
| AI Usage Policy (OPA/Cedar) | AI Governance Team | Reviewed quarterly; updated as needed | Codifies AUP into enforceable gateway rules |
| Model Access Control List | AI Platform Team | Updated with each new model onboarding | Defines which teams may use which models |
| Audit Log Export | Compliance Team | Monthly extract; on-demand for incidents | Regulatory evidence; incident investigation |
| Cost Allocation Report | Finance + AI Platform | Monthly | AI spend governance; budget vs actuals |
| Policy Violation Report | Security Operations | Weekly | Identifies abuse patterns; tuning of policy rules |
| Circuit Breaker Runbook | AI Platform / SRE | Reviewed after each provider incident | Operational response to provider degradation |
10. Operational Considerations
Monitoring
- Gateway metrics must be collected at sub-second granularity: request rate, error rate, p50/p95/p99 latency per provider, token throughput, quota utilisation per caller, circuit breaker state, and cost rate.
- Dashboards provide both real-time operational view (SRE) and 30-day trend view (governance).
SLOs
| SLO | Target | Measurement Method |
|---|---|---|
| Gateway availability | 99.95% | Synthetic health checks from all availability zones every 30s |
| Request latency added by gateway (p99) | <10ms (excluding model latency) | Distributed trace: gateway entry → provider forward timestamp |
| Authentication success rate | >99.9% | Count of 401s / total requests |
| Policy decision latency (p99) | <2ms | Internal span: policy_engine_start → policy_engine_end |
| Audit log write durability | 100% (zero lost records) | Log record count reconciliation; dead-letter queue for failed writes |
| Circuit breaker false positive rate | <0.1% | Manual review of circuit open events |
Logging
- Structured JSON logs. Mandatory fields:
trace_id,span_id,caller_id,model_requested,model_served,request_tokens,response_tokens,cost_usd,policy_decision,http_status,latency_ms,timestamp_utc. - Log level
INFOfor all requests;WARNfor policy violations;ERRORfor auth failures and circuit breaker events;AUDITfor all request/response pairs (separate immutable log stream).
Incident Management
- P1: Gateway unavailable — all AI workloads impacted. Pager alert to AI Platform SRE + escalation to Architecture owner within 5 minutes.
- P2: Provider circuit open with no healthy fallback. Pager alert; initiate fallback provider activation.
- P3: Sustained rate of policy violations (>1% of requests). Alert to Security Operations for investigation.
DR
| Scenario | RTO | RPO | Recovery Approach |
|---|---|---|---|
| Single gateway instance failure | 30s | 0 (stateless data plane) | Load balancer removes unhealthy instance; autoscaling adds replacement |
| Redis quota store failure | 5min | Accept brief over-quota traffic | Fail-open mode: allow traffic with alert; quota store cluster with automatic failover |
| Vault unavailable | 2min | 0 | Gateway fails closed (no credentials = no traffic); vault HA cluster |
| Full gateway region failure | 15min | 0 | Active-active multi-region deployment; Route 53/Azure Traffic Manager DNS failover |
Capacity
- Gateway is stateless in the data plane (policy decisions made against in-process cache). Scale horizontally with demand.
- Redis quota store: size for peak token rate × TTL. At 10M tokens/minute with 1-minute rolling window: ~10M counters × 20 bytes = ~200MB — comfortably within Redis memory limits.
- Provision for 3× normal peak to absorb burst without autoscaling lag.
11. Cost Considerations
Cost Drivers
| Cost Driver | Description | Relative Impact |
|---|---|---|
| Compute (gateway instances) | CPU/memory for request processing, policy evaluation, TLS termination | Medium |
| Redis (quota store) | Managed Redis cluster for rate limiting state | Low |
| Vault (secrets management) | HashiCorp Vault Enterprise or cloud-native equivalent | Low–Medium |
| Log storage (audit logs) | Immutable log storage for 7 years; grows linearly with request volume | Medium (long-term) |
| Egress (model provider calls) | Dominates total cost; gateway adds ~0.1% overhead per request | Low (gateway-specific) |
| Engineering (operations) | SRE time to operate, tune, and evolve the gateway | Medium |
Scaling Risks
- Audit log storage grows unboundedly. Implement tiered storage (hot → warm → cold → archive) with automated lifecycle policies.
- Redis memory pressure at extreme token volumes. Use token bucket algorithm with decay to limit state size.
Optimisations
- Cache policy decisions for stable caller+model combinations (1-minute TTL) to avoid OPA evaluation on every request.
- Use spot/preemptible instances for non-stateful gateway replicas (failover to on-demand automatically).
- Compress audit logs before writing to object storage (LZ4/Zstandard): typical 70–80% compression ratio on structured JSON.
Indicative Cost Range
| Scale | Monthly AWS Cost (USD) | Notes |
|---|---|---|
| Small (< 1M requests/day) | $500–$1,500 | 2 ECS Fargate tasks, ElastiCache t3.medium, CloudWatch Logs |
| Medium (1M–50M requests/day) | $2,000–$8,000 | 4–8 ECS tasks, ElastiCache r6g.large cluster, S3 immutable logs |
| Large (> 50M requests/day) | $15,000–$40,000 | EKS cluster, ElastiCache r6g.4xlarge, dedicated log pipeline |
These figures cover gateway infrastructure only. Model provider API costs are the dominant expenditure and are not included.
12. Trade-Off Analysis
Option Comparison
| Option | Description | Pros | Cons | Best For |
|---|---|---|---|---|
| A: Build custom gateway | Develop gateway as an internal service using Envoy or Kong as a base | Full control; can add AI-specific features; no vendor lock-in | High development and maintenance cost; requires specialist expertise | Large enterprises with unique AI governance requirements |
| B: Cloud-native AI gateway | Use Azure APIM + Azure AI Content Safety, or AWS API Gateway + Bedrock Guardrails | Low operational overhead; native integration with cloud AI services; managed SLA | Vendor lock-in; limited multi-cloud support; less flexible policy engine | Organisations committed to a single cloud provider |
| C: Commercial AI gateway product | LiteLLM Proxy (open-source), Portkey, Martian, or similar | Purpose-built for LLM use cases; fast time-to-value; community support | Less mature enterprise features; vendor viability risk; may not meet all compliance requirements | Mid-market organisations; teams needing quick deployment |
| D: Service mesh with AI extensions | Extend existing Envoy-based service mesh (Istio/Consul) with AI-specific filters | Reuses existing investment; consistent with microservices observability | Significant customisation required; LLM-specific features (token counting, streaming) require custom WASM/Lua filters | Organisations with mature service mesh already deployed |
Architectural Tensions
| Tension | Trade-Off |
|---|---|
| Security vs Latency | Every security check (auth, policy, prompt firewall) adds latency. Target: <10ms gateway overhead. Achieve through in-process caching, async audit logging, and hardware-accelerated TLS. |
| Observability vs Privacy | Full request/response logging maximises audit capability but risks logging sensitive data. Resolution: log content hashes by default; full content logging opt-in per data classification level, with field-level redaction. |
| Centralisation vs Resilience | A gateway is a single logical control point; if poorly designed, it becomes a single point of failure. Resolution: active-active multi-region deployment; fail-open for quota (not for auth) to maintain availability. |
| Policy Strictness vs Developer Productivity | Overly strict policies block legitimate use; overly permissive policies defeat the purpose. Resolution: graduated enforcement (warn → soft-block → hard-block) with developer-visible explanations. |
| Cost Visibility vs Performance | Fine-grained cost tagging (per-request, per-user) requires token counting and cost ledger writes on every request. Resolution: async cost event writes to a queue; batch persist to ledger every 5 seconds. |
13. Failure Modes
| Failure | Likelihood | Impact | Detection | Recovery |
|---|---|---|---|---|
| Gateway instance crash | Low | High (if single instance) | Load balancer health check failure → alert | Autoscaling replaces instance; deploy minimum 3 instances in production |
| Redis quota store timeout | Medium | Medium (brief over-quota traffic) | Latency spike on rate limit check → alert | Fail-open for quota; Redis Sentinel/Cluster for HA |
| Vault unreachable | Low | Critical (all traffic blocked) | 503 spike → critical alert | Vault HA cluster; cached credentials TTL 5min as emergency fallback |
| Policy store stale | Medium | Medium (stale policy decisions) | Policy cache age metric → alert | Cache TTL 60s; background refresh; explicit invalidation API |
| Prompt firewall false positive rate spike | Medium | High (legitimate traffic blocked) | 400 rate spike from prompt firewall → alert | Tuning runbook; emergency bypass flag per caller (audited) |
| Audit log write failure | Low | Critical (regulatory compliance gap) | Dead-letter queue depth > 0 → critical alert | Retry with exponential backoff; dead-letter queue with separate drain process |
| TLS certificate expiry | Low | Critical (all traffic blocked) | Certificate expiry monitoring → 30-day warning | Automated certificate rotation via cert-manager or ACM |
| Model provider mass outage | Medium | High | Circuit breaker opens for multiple providers simultaneously | Fallback to on-premises model; queue non-urgent requests; alert users |
Cascading Failure Scenarios
Scenario 1: Vault + Redis simultaneous failure If both Vault (credential store) and Redis (quota store) fail simultaneously, the gateway cannot retrieve credentials AND cannot enforce quotas. The gateway must fail closed (return 503) — accepting quota bypass (fail-open) while Vault is down would allow unlimited uncredentialed requests to reach providers once credentials are cached. Mitigation: Vault and Redis must be deployed on independent infrastructure with no shared failure domain.
Scenario 2: Policy store becomes unavailable during a security incident If the policy store becomes unavailable at the same time as a security event requiring policy update (e.g., a compromised caller key), the gateway will continue serving the last cached policy. Mitigation: emergency policy override API that writes directly to the in-memory cache on each gateway instance; secured with break-glass credentials.
14. Regulatory Considerations
| Regulation | Requirement | Gateway Implementation |
|---|---|---|
| APRA CPS234 (Information Security) | Maintain information security controls for third-party service providers | Model provider access through gateway enforces ACL; audit trail demonstrates access governance |
| APRA CPS230 (Operational Risk) | Identify and manage risks from third-party dependencies | Circuit breaker provides operational resilience; provider health metrics enable risk monitoring |
| Australian Privacy Act 1988 | Personal information must not be disclosed to overseas recipients without consent | Data classification enforcement in policy engine blocks requests containing PI from routing to non-approved providers |
| EU AI Act Article 9 (Risk Management) | High-risk AI systems must implement risk management measures | Gateway enforces model access controls for high-risk use cases; audit log supports risk documentation |
| EU AI Act Article 12 (Record-Keeping) | High-risk AI systems must maintain logs enabling post-hoc audit | Immutable audit log with 7-year retention satisfies this requirement |
| ISO/IEC 42001 §6.1 (Risk Treatment) | Implement controls for identified AI risks | Gateway operationalises risk treatment actions from AI risk register |
| NIST AI RMF GOVERN 1.2 | Accountability mechanisms for AI systems | Caller identity + audit log creates clear accountability chain for every AI request |
| NIST AI RMF MANAGE 2.4 | Monitor AI system performance | Gateway metrics and alerts implement continuous AI performance monitoring |
15. Reference Implementations
AWS
| Component | AWS Service |
|---|---|
| Gateway compute | ECS Fargate (Kong or custom Go service) or API Gateway with Lambda authoriser |
| Auth | Cognito (IdP) + Lambda JWT authoriser |
| Policy engine | Lambda function hosting OPA with S3 policy bundle |
| Rate limiting | ElastiCache for Redis (token bucket counters) |
| Secrets | AWS Secrets Manager with automatic rotation |
| Routing | Application Load Balancer + ECS service discovery |
| Audit logs | Kinesis Firehose → S3 with Object Lock (WORM) |
| Cost tracking | Custom Lambda → Cost and Usage Report + Athena |
| Monitoring | CloudWatch + X-Ray distributed tracing |
Azure
| Component | Azure Service |
|---|---|
| Gateway | Azure API Management (APIM) with custom policies |
| Auth | Azure AD + APIM JWT validation policy |
| Policy engine | APIM policy expressions + Azure Functions for complex rules |
| Rate limiting | APIM built-in rate limiting + Azure Cache for Redis |
| Secrets | Azure Key Vault with managed identity |
| Routing | APIM backends + Azure Application Gateway |
| Audit logs | Event Hub → Azure Immutable Blob Storage |
| Content safety | Azure AI Content Safety (integrates natively with APIM) |
| Monitoring | Azure Monitor + Application Insights |
GCP
| Component | GCP Service |
|---|---|
| Gateway | Apigee API Management or Cloud Run (Kong/Envoy) |
| Auth | Google Identity Platform + Cloud IAP |
| Policy engine | Cloud Run (OPA) with Cloud Storage policy bundles |
| Rate limiting | Memorystore for Redis |
| Secrets | Secret Manager |
| Audit logs | Cloud Logging → Cloud Storage with retention lock |
| Monitoring | Cloud Monitoring + Cloud Trace |
On-Premises
| Component | Technology |
|---|---|
| Gateway | Kong Enterprise or Envoy Proxy with custom filters |
| Auth | Active Directory Federation Services + OAuth2 Proxy |
| Policy engine | OPA deployed as sidecar or standalone service |
| Rate limiting | Redis Sentinel cluster |
| Secrets | HashiCorp Vault Enterprise |
| Audit logs | Kafka → Elasticsearch with ILM immutability policy |
| Monitoring | Prometheus + Grafana + Jaeger |
16. Related Patterns
| Pattern | ID | Relationship |
|---|---|---|
| Prompt Firewall | EAAPL-SEC002 | Deployed inline within gateway; gateway calls firewall as a filter stage |
| LLM Input Sanitisation | EAAPL-SEC005 | Complementary to prompt firewall; deeper PII/schema validation within gateway pipeline |
| AI Output Filtering | EAAPL-SEC006 | Deployed as post-generation filter within gateway; shares audit log infrastructure |
| Zero-Trust AI Pipeline | EAAPL-SEC007 | Gateway is the primary enforcement point for zero-trust policy; SEC007 extends to intra-pipeline trust |
| Secrets Management for AI | EAAPL-SEC008 | Gateway depends on this pattern for all model provider credentials |
| AI Data Classification | EAAPL-SEC009 | Classification labels consumed by gateway policy engine for routing decisions |
| AI Telemetry | EAAPL-OBS001 | Gateway is the primary source of AI telemetry (token counts, latency, errors) |
| AI Cost Observability | EAAPL-OBS006 | Gateway's cost ledger is the primary data source for cost observability |
| Model Isolation | EAAPL-SEC003 | Gateway enforces network boundaries that complement model isolation at the compute layer |
17. Maturity Assessment
Overall Maturity: Mature
| Dimension | Score (1–5) | Rationale |
|---|---|---|
| Pattern definition clarity | 5 | Well-defined, unambiguous scope and responsibilities |
| Technology availability | 5 | Mature OSS and commercial options available across all major clouds |
| Industry adoption | 4 | Widely adopted in financial services and regulated industries; emerging in other sectors |
| Operational tooling | 4 | Strong monitoring and operations tooling; some AI-specific metrics require custom implementation |
| Regulatory alignment | 5 | Directly addresses APRA CPS234, EU AI Act, Privacy Act requirements |
| Reference implementation availability | 4 | Reference implementations available for all major clouds; AI-specific extensions require custom work |
| Community knowledge | 4 | Strong API gateway community; LLM-specific extensions are an emerging body of knowledge |
18. Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2024-01-15 | AI Architecture Team | Initial pattern definition |
| 1.1 | 2024-04-20 | AI Architecture Team | Added EU AI Act regulatory mapping; expanded DR scenarios |
| 2.0 | 2024-09-10 | AI Architecture Team | Major revision: added streaming support guidance; updated OWASP LLM Top 10 to 2024 edition; added GCP reference implementation |
| 2.1 | 2025-03-01 | AI Architecture Team | Added cost observability integration; expanded failure mode analysis; aligned with ISO 42001 §6.1 |