[EAAPL-PLT003] Model Routing
Category: Platform Engineering
Sub-category: Traffic Management
Version: 1.2
Maturity: Proven
Tags: model-routing, intelligent-routing, cost-based-routing, latency-routing, capability-routing, shadow-routing, a-b-routing, fallback, routing-rules-as-code
Regulatory Relevance: EU AI Act Article 9 (Risk Management), ISO 42001, NIST AI RMF MAP 2.1
1. Executive Summary
The Model Routing pattern establishes intelligent, policy-driven dispatch of AI inference requests to the optimal model from a pool of candidates. As organisations operate multiple model providers and tiers—frontier models for complex reasoning, mid-tier models for standard tasks, specialist models for domain-specific workloads—the routing layer translates business intent (minimise cost, maximise quality, meet latency SLO) into per-request model selection decisions without burdening product teams with this logic.
The commercial impact is significant: organisations that implement tiered routing consistently report 30–50% reduction in model API spend by directing simple tasks to cheaper models while reserving frontier compute for genuinely complex requests. Additionally, shadow routing enables risk-free model evaluation in production traffic, and fallback routing maintains availability when individual providers degrade. Routing rules expressed as code integrate with GitOps workflows, giving governance teams an auditable, reviewable change process for every routing policy change.
2. Problem Statement
Business Problem
Organisations pay frontier model prices for tasks that could be handled by models costing 10–20× less. There is no systematic mechanism to evaluate new model versions without exposing production traffic to risk. When a model provider has an outage, AI features fail rather than failing over to an available alternative.
Technical Problem
Routing logic is hardcoded in product team applications: each team selects a specific model endpoint and implements its own fallback logic. When routing strategy needs to change (e.g., switch primary model, adjust fallback order, enable cost-based routing), each team must make independent code changes. There is no A/B framework for comparing model quality systematically.
Symptoms
- 100% of AI requests going to the single most expensive model regardless of task complexity
- New model evaluation requiring full production deployment with rollback risk
- Model provider outage causing complete AI feature failure rather than graceful failover
- No mechanism to compare quality of two models on the same production traffic
- Teams spending engineering time implementing and maintaining per-team fallback logic
Cost of Inaction
- Unnecessary model API costs of 30–50% above optimal routing
- Model evaluation cycles of 4–8 weeks due to lack of production traffic comparison tooling
- Provider outage MTTR of hours instead of minutes due to hardcoded model selection
- Inability to demonstrate model governance to auditors (no audit trail of routing decisions)
3. Context
When to Apply
- Organisation operates ≥2 model providers or model tiers simultaneously
- Cost optimisation of AI spend is a priority
- Availability requirements demand provider failover capability
- Model evaluation and comparison is a recurring operational need
- Platform team centralises model access (see EAAPL-PLT001)
When NOT to Apply
- Single model, single provider with no plans for multi-provider: routing overhead not warranted
- Models are fundamentally incompatible in output format such that failover would break consuming applications
- Ultra-low latency requirements (<100ms total) where routing overhead is prohibitive (use direct integration)
Prerequisites
- AI API Gateway (EAAPL-PLT002) as the host for routing logic
- Model Registry with capability cards per model (PLT001 Layer 2)
- Multiple model provider credentials managed in Secrets Manager
- Observability infrastructure for routing decision logging and model performance metrics
- Response schema normalisation across providers (or application tolerance for schema variation)
Industry Applicability
| Industry |
Applicability |
Routing Strategy Priority |
| Financial Services |
High |
Capability-based (accuracy critical); fallback for availability |
| Healthcare |
High |
Capability-based (clinical accuracy); cost-based for administrative tasks |
| Media / Content |
Very High |
Cost-based routing dominant; high volume, variable complexity |
| E-commerce |
High |
Latency-based for customer-facing; cost-based for batch enrichment |
| Technology / SaaS |
Very High |
Multi-strategy; A/B routing for model evaluation is core practice |
| Government |
Medium |
Capability and data-residency routing; complex policy rules |
4. Architecture Overview
The Model Routing layer sits within or immediately behind the AI API Gateway and executes per-request model selection before the upstream proxy forwards the call. The routing decision is deterministic given the same input context and routing configuration, making it reproducible and auditable. The routing configuration is stored as code in a Git repository, enabling GitOps workflows for policy changes.
Intent Classification is the first stage of routing logic. The incoming request carries signals that inform routing: the declared use case tag in the request metadata (e.g., use-case: summarisation), the consumer's team namespace (which may have team-level routing overrides), the estimated complexity of the request (derived from prompt length, presence of structured data, declared reasoning requirement), and any explicit model hint from the consumer (which is subject to policy gating). Intent classification can be as simple as a rule lookup against the use-case tag or as sophisticated as a lightweight classifier that scores request complexity in <10ms.
Routing Strategy Evaluation applies the configured strategy for the consumer/use-case combination. Four primary strategies are defined:
Cost-based routing assigns a cost tier to each request (low/medium/high) based on complexity signals and routes to the cheapest model within that tier that meets the quality threshold. Cost tiers map to model families: low-cost (GPT-4o-mini, Claude Haiku, Gemini Flash), mid-cost (GPT-4o, Claude Sonnet), high-cost (o1, Claude Opus, Gemini Ultra). The quality threshold per tier is expressed as a minimum benchmark score on the organisation's evaluation dataset.
Latency-based routing selects the model with the lowest current P90 latency from real-time metrics. This is particularly valuable for interactive user-facing features where model quality differences are marginal but latency differences are perceived. The latency metric is maintained as a sliding 5-minute window per provider endpoint.
Capability-based routing matches the request's declared requirements against model capability cards in the registry. A request requiring 128K+ context routes only to models with sufficient context windows; a request requiring tool use routes only to models with function-calling capability; a request requiring structured JSON output routes to models with reliable JSON mode. Capability routing is essentially a filter, often combined with cost or latency routing for final selection.
Fallback routing defines an ordered preference list for a given model alias. When the primary model's circuit breaker is open or the provider returns persistent errors, the router advances to the next candidate. The fallback chain is explicit and version-controlled, not implicit.
A/B and Shadow Routing are layered on top of the primary strategy. A/B routing sends a configurable percentage of traffic to a candidate model, comparing outputs against the primary on the organisation's quality metrics. Shadow routing duplicates requests to a candidate model asynchronously without serving its response to the consumer; this enables zero-risk production traffic evaluation. Both mechanisms write routing experiment metadata to the Evaluation Framework (EAAPL-PLT008) for analysis.
Circuit Breaker Integration makes routing resilient. Each model endpoint has an associated circuit breaker tracking success rate and latency over a rolling window. When a circuit opens, the router excludes that endpoint from selection for the duration of the open window (configurable, typically 60 seconds). After the open window, a half-open state tests with a single request. This means the router inherently implements provider failover without a separate failover mechanism.
Routing Rules as Code is a first-class governance principle. All routing configuration—strategy assignments per use case and consumer, fallback chains, A/B experiment configurations, capability requirements, cost tier thresholds—is expressed in a structured configuration format (YAML/JSON) stored in the platform's Git repository. Changes go through pull request review with platform team approval and are applied to the routing engine via a configuration deployment pipeline. Every routing configuration version is recorded in the audit log alongside the routing decisions it produced.
5. Architecture Diagram
flowchart TD
subgraph Request["Request + Config"]
A[Incoming Request]
B[Routing Rules GitOps]
C[Model Registry]
end
subgraph Router["Model Router"]
D[Intent Classifier]
E[Strategy Engine]
F{Circuit Breaker}
end
subgraph Models["Model Endpoints"]
G[Frontier Tier]
H[Mid-Cost Tier]
I[Efficiency Tier]
end
A --> D
B --> E
C --> E
D --> E
E --> F
F -->|primary| G
F -->|cost route| H
F -->|efficiency| I
E --> J[(Routing Audit Log)]
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#fef9c3,stroke:#eab308
style C fill:#fef9c3,stroke:#eab308
style D fill:#f0fdf4,stroke:#22c55e
style E fill:#f0fdf4,stroke:#22c55e
style F fill:#f3e8ff,stroke:#a855f7
style G fill:#dbeafe,stroke:#3b82f6
style H fill:#dbeafe,stroke:#3b82f6
style I fill:#dbeafe,stroke:#3b82f6
style J fill:#fef9c3,stroke:#eab308
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Intent Classifier |
Service |
Estimate request complexity; extract use-case signals |
Rule-based lookup, lightweight ML classifier (DistilBERT), regex patterns |
High |
| Routing Strategy Engine |
Service |
Apply configured strategy to produce ranked model list |
Custom rule engine, LiteLLM router, Envoy route configuration |
Critical |
| Circuit Breaker State Store |
Service |
Maintain per-endpoint health state (closed/open/half-open) |
Redis, in-memory (single instance), Resilience4j |
Critical |
| A/B Traffic Splitter |
Service |
Distribute traffic according to experiment configuration |
Custom weighted random, LaunchDarkly, feature flag service |
Medium |
| Shadow Router |
Service |
Duplicate requests to shadow model asynchronously |
Async task queue (Celery, asyncio), Kafka producer |
Medium |
| Routing Rules Store |
Configuration |
Version-controlled routing configuration |
Git repository + ConfigMap (Kubernetes), Consul K/V |
High |
| Real-Time Metrics Collector |
Service |
Maintain sliding window of model performance metrics |
Prometheus, in-memory metrics cache with TTL |
High |
| Model Registry Client |
Service |
Query model capability cards for capability-based routing |
gRPC/HTTP client to Model Registry service |
High |
| Routing Decision Logger |
Service |
Write routing decision record to audit log |
Async writer to Kafka/OpenTelemetry |
High |
| Evaluation Integration |
Service |
Publish A/B results to Evaluation Framework |
REST/event client to PLT008 |
Medium |
7. Data Flow
Primary Flow — Cost-Based Routing Request
| Step |
Actor |
Action |
Output |
| 1 |
Incoming Request |
Arrive at router with use-case tag summarisation and consumer team team-marketing |
Request context with metadata |
| 2 |
Intent Classifier |
Look up summarisation in use-case taxonomy; estimate complexity as LOW from prompt token count |
Complexity: LOW; Use case: summarisation |
| 3 |
Routing Strategy Selector |
Look up team-marketing + summarisation in routing rules; find strategy: cost-based |
Strategy: cost-based |
| 4 |
Cost-Based Strategy |
Map LOW complexity to Tier 3 efficiency models; retrieve list: [Claude Haiku, GPT-4o-mini] |
Candidate list: [Claude Haiku, GPT-4o-mini] |
| 5 |
Circuit Breaker Check |
Check circuit state for Claude Haiku (CLOSED) and GPT-4o-mini (CLOSED) |
Both available |
| 6 |
Final Selector |
Select Claude Haiku (primary preference in rules); check A/B config — no active experiment for this consumer |
Selected: Claude Haiku endpoint |
| 7 |
Routing Decision Log |
Emit routing record: {request_id, strategy, candidates, selected, reason, timestamp} |
Audit log record written |
| 8 |
Upstream Proxy |
Forward request to Claude Haiku endpoint |
Model response |
Error Flow
| Error Condition |
Detection |
Action |
Consumer Impact |
| Primary model circuit open |
Circuit breaker state check at step 5 |
Advance to next candidate in fallback chain |
Transparent; higher cost model may be used |
| All candidates circuit open |
Step 5 all candidates unavailable |
Return 503 with routing-exhausted code; trigger incident alert |
Service degraded; no AI response |
| Capability mismatch (no capable model available) |
Capability filter produces empty list |
Return 422 with no-capable-model code |
Consumer must adjust request parameters |
| Routing rules not found for use case |
Strategy selector miss |
Apply default strategy (configured globally) |
Potential non-optimal routing; logs warning |
| Intent classification timeout |
<10ms budget exceeded |
Apply default routing strategy without classification |
Routing proceeds; log classification timeout |
8. Security Considerations
Authentication and Authorisation
- Model selection may not be manipulated by consumer input beyond the declared use-case tag; raw model names in consumer requests are validated against authorised models for that consumer
- Team-level routing overrides require platform team approval; they are stored in the version-controlled routing rules, not consumer-controllable at request time
Secrets Management
- Model provider credentials for each endpoint are retrieved from Secrets Manager at routing decision time; credentials are not embedded in routing rules
- Shadow routing uses separate credentials with read-only scoping where possible to prevent shadow model being used for mutations
Data Classification and Encryption
- Routing decisions involving RESTRICTED or CONFIDENTIAL data are logged with the classification label for audit trail completeness
- Shadow requests must be subject to the same data classification and policy enforcement as primary requests
Auditability
- Every routing decision is logged with: strategy applied, candidates considered, circuit breaker states, selected endpoint, reason code, any experiment configuration active
- Routing configuration changes are version-controlled and auditable as Git commits with author, timestamp, and review record
OWASP LLM Top 10 Controls
| OWASP LLM Risk |
Routing-Layer Control |
| LLM01 Prompt Injection |
Routing does not modify prompts; injection risk handled at gateway layer |
| LLM04 Model DoS |
Circuit breaker prevents failed model from absorbing continued traffic |
| LLM05 Supply Chain |
Only models in the approved registry are eligible routing targets |
| LLM09 Overreliance |
Routing logs which model produced each response; enables per-model quality monitoring |
9. Governance Considerations
Responsible AI
- Routing rules must not route high-risk AI use cases to models without a completed Model Risk Card
- A/B experiments involving high-risk use cases require explicit Governance Board approval before activation
- Shadow routing results feed into model evaluation decisions that are recorded in the Evaluation Framework
Model Risk Management
- The routing fallback chain defines the approved substitution hierarchy; arbitrary model substitution is not permitted
- When a new model is added to the registry and routing rules, a Model Risk Card delta review is required comparing the new model to existing candidates
- Routing telemetry (which model served which volume of requests) is a key input to the quarterly model risk review
Governance Artefacts
| Artefact |
Owner |
Cadence |
Location |
| routing-rules.yaml |
Platform Team |
Per change via PR |
Git repository |
| A/B experiment registry |
Platform Team + Model Owner |
Per experiment |
Evaluation Framework |
| Fallback chain approval records |
Platform Governance Board |
Per change |
GRC system / Git PR comments |
| Routing telemetry report |
Platform Team |
Monthly |
Observability dashboard |
| Model substitution impact assessment |
Risk Team |
Per fallback chain change |
Model Registry |
10. Operational Considerations
Monitoring
| Signal |
Source |
Alert Threshold |
Owner |
| Fallback activation rate |
Routing decision log |
>5% of requests using non-primary model |
Platform On-Call |
| Circuit breaker state changes |
Circuit breaker events |
Any circuit opening |
Platform On-Call + Model Owner |
| Intent classification error rate |
Intent classifier metrics |
>1% classification errors |
Platform Team |
| Routing rule miss rate |
Routing engine logs |
>0.1% requests hitting default fallback |
Platform Team |
| A/B experiment quality delta |
Evaluation Framework |
Statistically significant quality degradation in B variant |
Platform Team + Product Owner |
SLOs
| SLO |
Target |
Window |
| Routing decision latency P99 |
<15ms (overhead beyond gateway) |
Rolling 7 days |
| Routing availability (decisions produced) |
99.99% |
Rolling 30 days |
| Fallback success rate |
>99% of requests served even when primary unavailable |
Rolling 30 days |
| Circuit breaker false positive rate |
<0.1% circuits opened without actual provider failure |
Rolling 30 days |
Logging
- Routing decisions logged as structured JSON with correlation to the gateway request ID
- Circuit breaker state transitions logged separately for operational analysis
- A/B experiment decisions include experiment ID and variant for analysis join
Incident Response
| Incident |
Detection |
Response |
RTO |
| Routing engine crash |
Health check failure; 100% routing errors |
Kubernetes pod restart; DNS failover to secondary |
2 min |
| All circuits open (full blackout) |
Zero successful upstream calls |
Activate static fallback responses; page platform + engineering leadership |
5 min |
| Routing misconfiguration deployed |
Fallback rate spike after deployment |
Rollback routing-rules.yaml via GitOps; circuit breakers reset |
10 min |
Disaster Recovery
| Component |
RPO |
RTO |
Strategy |
| Routing engine (stateless) |
0 |
2 min |
Multi-replica; pod auto-restart |
| Routing rules config |
0 |
5 min |
Git-backed; ConfigMap reload |
| Circuit breaker state (Redis) |
5 min |
2 min |
Redis Sentinel; acceptable brief stale state |
| Routing decision audit log |
<1 min |
10 min |
Kafka replication + S3 cross-region |
11. Cost Considerations
Cost Drivers
| Driver |
Description |
Relative Weight |
| Routing engine compute |
Stateless; minimal CPU; scales with request count |
Very Low |
| Intent classifier inference |
If ML-based, adds per-request compute |
Low |
| Circuit breaker state (Redis) |
Small memory footprint |
Very Low |
| Cost savings from tier routing |
Negative cost — 30–50% reduction in model API spend |
Dominant positive ROI |
Optimisations
- Most valuable optimisation: aggressive Tier 3 routing for high-volume, low-complexity tasks (summarisation, classification, entity extraction)
- Intent classifier should be rule-based for speed (latency budget <5ms) unless complexity estimation materially improves routing quality
- Cache routing decisions for identical consumer + use-case combinations with short TTL (1 minute) to reduce routing computation
Indicative Cost Range
| Scale |
Monthly Routing Infra Cost |
Notes |
| Any scale |
$100–$500/month |
Routing engine is minimal compute; ROI is entirely from model cost savings |
| Cost savings at medium scale (10M tokens/day) |
-$3,000–$8,000/month |
From tier routing directing 60% of traffic to Tier 3 models |
| Cost savings at large scale (100M tokens/day) |
-$30,000–$80,000/month |
Tier routing ROI dominates; dedicated cost optimisation team warranted |
12. Trade-Off Analysis
Routing Strategy Options
| Strategy |
Description |
Pros |
Cons |
Best For |
| Static Routing |
Fixed model per use-case; no dynamic selection |
Simplest; predictable; easy to audit |
No cost optimisation; no failover |
Initial deployment; highly regulated use cases |
| Cost-Based Routing |
Route to cheapest model meeting quality threshold |
30–50% cost reduction |
Requires quality benchmarks; threshold tuning effort |
High-volume, mixed-complexity workloads |
| Capability-Based Routing |
Filter by capability; then cost or latency within capable set |
Accurate capability matching; prevents capability-mismatch errors |
Requires maintained capability cards in registry |
Multi-model deployments with specialised models |
| ML-Based Routing |
Classify request complexity with ML model; route accordingly |
Most accurate tier assignment |
Adds latency; ML model requires training and maintenance |
Very high volume where marginal accuracy gains justify overhead |
Intent Classification Options
| Option |
Latency |
Accuracy |
Maintenance |
Best For |
| Rule-based (use-case tag lookup) |
<1ms |
Depends on caller discipline |
Low |
Structured internal API with disciplined callers |
| Regex + heuristics on prompt |
1–5ms |
Moderate |
Low-Medium |
General purpose with structured prompts |
| Lightweight ML classifier |
5–15ms |
High |
Medium |
High-volume workloads where routing accuracy has large cost impact |
Architectural Tensions
| Tension |
Option A |
Option B |
Resolution |
| Routing transparency vs. complexity |
Expose routing decision to consumers |
Black box |
Include X-Model-Used header in response; audit log accessible to consumers for own requests |
| Routing speed vs. accuracy |
Rule-based (fast, less accurate) |
ML classifier (slower, more accurate) |
Rule-based default; ML opt-in for high-volume use cases where ROI justifies latency |
| Consumer control vs. platform governance |
Allow consumers to specify exact model |
Platform controls all routing |
Allow model family hints; platform selects within family; override audited |
| Failover quality vs. consistency |
Always fail over to available model |
Return error if preferred model unavailable |
Fail-over default for availability; consumer can opt for fail-fast if consistency required |
13. Failure Modes
| Failure |
Likelihood |
Impact |
Detection |
Recovery |
| Intent classifier crash |
Medium |
Medium — all requests use default routing |
Classifier health check; default routing rate spikes |
Restart classifier; default routing adequate in interim |
| Routing rules desync (ConfigMap stale) |
Low |
Medium — requests using outdated routing policy |
Rules version mismatch alert |
Force ConfigMap reload; GitOps pipeline re-applies |
| Circuit breaker stuck open (false positive) |
Low |
Medium — model excluded despite being healthy |
Provider health check succeeds while circuit open |
Manual circuit reset; post-incident investigation |
| A/B experiment misconfiguration (100% to B) |
Low |
High — all traffic to unvalidated model |
Traffic split monitoring alert |
Rollback experiment config; route to primary |
| Model capability card stale in registry |
Medium |
Low-Medium — capability routing sends to incapable model |
Capability mismatch error from model |
Update registry; add error handler for capability mismatch |
Cascading Scenario
- Mass circuit opening storm: Under a broad cloud provider degradation, multiple circuits open simultaneously. The router falls back to the next tier for all requests. If the fallback tier is also degraded (same cloud region), the cascade proceeds through all fallback candidates and the router returns 503 for all requests. Mitigation: fallback chains must span cloud providers or include on-premises/alternative-region endpoints.
14. Regulatory Considerations
EU AI Act Article 9
- Routing decisions must be recorded to demonstrate that the risk management system controls which models process which use cases; the routing audit log satisfies this requirement
- High-risk AI systems must not be subject to automatic fallback to lower-quality or unapproved models without human oversight configuration
NIST AI RMF MAP 2.1
- The routing configuration explicitly documents the intended deployment context for each model, satisfying MAP 2.1's requirement to document AI deployment context
Audit and Record-Keeping
- Routing decision logs must be retained for the same period as the AI system's operational records (typically 7 years for regulated decisions)
- Routing configuration Git history constitutes an auditable record of every routing policy change with author and approval
15. Reference Implementations
AWS
| Component |
AWS Service |
| Routing engine |
LiteLLM Proxy on ECS, or custom Lambda function |
| Circuit breaker state |
ElastiCache Redis |
| Routing rules |
SSM Parameter Store or S3 config object |
| Intent classifier |
Lambda + custom rules, or SageMaker endpoint (ML-based) |
| Model endpoints |
Bedrock (Claude, Llama, Titan), SageMaker endpoints for self-hosted |
Azure
| Component |
Azure Service |
| Routing engine |
APIM with AI routing policies, or custom AKS deployment |
| Circuit breaker |
APIM native circuit breaker policy |
| Routing rules |
App Configuration |
| Model endpoints |
Azure OpenAI multiple deployments |
GCP
| Component |
Azure Service |
| Routing engine |
Cloud Run service with LiteLLM or custom Python |
| Circuit breaker |
Custom Redis-backed on Memorystore |
| Model endpoints |
Vertex AI multiple model deployments |
On-Premises
| Component |
Technology |
| Routing engine |
LiteLLM Proxy or custom Python/Go service |
| Circuit breaker |
Resilience4j (Java) or custom Redis-backed |
| Routing rules |
Consul K/V or Git-synced ConfigMap |
| Model endpoints |
vLLM serving multiple models on GPU cluster |
| Pattern ID |
Name |
Relationship |
| EAAPL-PLT001 |
Enterprise AI Platform |
Parent — routing is a core capability of the platform |
| EAAPL-PLT002 |
AI API Gateway |
Host — routing executes within or behind the gateway |
| EAAPL-PLT004 |
LLM Cost Control |
Complementary — cost-based routing is primary cost control lever |
| EAAPL-PLT008 |
AI Experiment Tracking |
Dependency — A/B and shadow routing results feed experiment tracking |
| EAAPL-INT007 |
AI Circuit Breaker |
Component — circuit breaker is embedded within routing |
17. Maturity Assessment
Overall Maturity: Proven
Model routing is production-proven across dozens of enterprise deployments. LiteLLM and Kong AI Gateway provide mature implementations. ML-based intent classification is still an emerging practice; rule-based routing is the proven approach.
Scoring Matrix
| Dimension |
Score (1–5) |
Rationale |
| Pattern Completeness |
5 |
All sections documented |
| Implementation Evidence |
4 |
Core routing proven; ML-based intent classification less so |
| Tooling Stability |
4 |
LiteLLM router mature; ML classification tooling evolving |
| Regulatory Alignment |
4 |
Audit logging mapped; specific regulatory requirements vary by use case |
| Cost ROI Evidence |
5 |
Consistent 30–50% cost reduction reported across multiple deployments |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2024-03-10 |
EAAPL Working Group |
Initial publication |
| 1.1 |
2024-09-15 |
EAAPL Working Group |
Added A/B and shadow routing sections; ML-based intent classification |
| 1.2 |
2025-06-12 |
EAAPL Working Group |
Cost savings data updated; cascading failure scenario added; GCP reference added |