EAAPL-PLT003Proven

Model Routing

⚙️ Platform EngineeringEU AI ActISO/IEC 42001

[EAAPL-PLT003] Model Routing

Category: Platform Engineering Sub-category: Traffic Management Version: 1.2 Maturity: Proven Tags: model-routing, intelligent-routing, cost-based-routing, latency-routing, capability-routing, shadow-routing, a-b-routing, fallback, routing-rules-as-code Regulatory Relevance: EU AI Act Article 9 (Risk Management), ISO 42001, NIST AI RMF MAP 2.1

1. Executive Summary

The Model Routing pattern establishes intelligent, policy-driven dispatch of AI inference requests to the optimal model from a pool of candidates. As organisations operate multiple model providers and tiers—frontier models for complex reasoning, mid-tier models for standard tasks, specialist models for domain-specific workloads—the routing layer translates business intent (minimise cost, maximise quality, meet latency SLO) into per-request model selection decisions without burdening product teams with this logic.

The commercial impact is significant: organisations that implement tiered routing consistently report 30–50% reduction in model API spend by directing simple tasks to cheaper models while reserving frontier compute for genuinely complex requests. Additionally, shadow routing enables risk-free model evaluation in production traffic, and fallback routing maintains availability when individual providers degrade. Routing rules expressed as code integrate with GitOps workflows, giving governance teams an auditable, reviewable change process for every routing policy change.

2. Problem Statement

Business Problem

Organisations pay frontier model prices for tasks that could be handled by models costing 10–20× less. There is no systematic mechanism to evaluate new model versions without exposing production traffic to risk. When a model provider has an outage, AI features fail rather than failing over to an available alternative.

Technical Problem

Routing logic is hardcoded in product team applications: each team selects a specific model endpoint and implements its own fallback logic. When routing strategy needs to change (e.g., switch primary model, adjust fallback order, enable cost-based routing), each team must make independent code changes. There is no A/B framework for comparing model quality systematically.

Symptoms

100% of AI requests going to the single most expensive model regardless of task complexity
New model evaluation requiring full production deployment with rollback risk
Model provider outage causing complete AI feature failure rather than graceful failover
No mechanism to compare quality of two models on the same production traffic
Teams spending engineering time implementing and maintaining per-team fallback logic

Cost of Inaction

Unnecessary model API costs of 30–50% above optimal routing
Model evaluation cycles of 4–8 weeks due to lack of production traffic comparison tooling
Provider outage MTTR of hours instead of minutes due to hardcoded model selection
Inability to demonstrate model governance to auditors (no audit trail of routing decisions)

3. Context

When to Apply

Organisation operates ≥2 model providers or model tiers simultaneously
Cost optimisation of AI spend is a priority
Availability requirements demand provider failover capability
Model evaluation and comparison is a recurring operational need
Platform team centralises model access (see EAAPL-PLT001)

When NOT to Apply

Single model, single provider with no plans for multi-provider: routing overhead not warranted
Models are fundamentally incompatible in output format such that failover would break consuming applications
Ultra-low latency requirements (<100ms total) where routing overhead is prohibitive (use direct integration)

Prerequisites

AI API Gateway (EAAPL-PLT002) as the host for routing logic
Model Registry with capability cards per model (PLT001 Layer 2)
Multiple model provider credentials managed in Secrets Manager
Observability infrastructure for routing decision logging and model performance metrics
Response schema normalisation across providers (or application tolerance for schema variation)

Industry Applicability

Industry	Applicability	Routing Strategy Priority
Financial Services	High	Capability-based (accuracy critical); fallback for availability
Healthcare	High	Capability-based (clinical accuracy); cost-based for administrative tasks
Media / Content	Very High	Cost-based routing dominant; high volume, variable complexity
E-commerce	High	Latency-based for customer-facing; cost-based for batch enrichment
Technology / SaaS	Very High	Multi-strategy; A/B routing for model evaluation is core practice
Government	Medium	Capability and data-residency routing; complex policy rules

4. Architecture Overview

The Model Routing layer sits within or immediately behind the AI API Gateway and executes per-request model selection before the upstream proxy forwards the call. The routing decision is deterministic given the same input context and routing configuration, making it reproducible and auditable. The routing configuration is stored as code in a Git repository, enabling GitOps workflows for policy changes.

Intent Classification is the first stage of routing logic. The incoming request carries signals that inform routing: the declared use case tag in the request metadata (e.g., use-case: summarisation), the consumer's team namespace (which may have team-level routing overrides), the estimated complexity of the request (derived from prompt length, presence of structured data, declared reasoning requirement), and any explicit model hint from the consumer (which is subject to policy gating). Intent classification can be as simple as a rule lookup against the use-case tag or as sophisticated as a lightweight classifier that scores request complexity in <10ms.

Routing Strategy Evaluation applies the configured strategy for the consumer/use-case combination. Four primary strategies are defined:

Cost-based routing assigns a cost tier to each request (low/medium/high) based on complexity signals and routes to the cheapest model within that tier that meets the quality threshold. Cost tiers map to model families: low-cost (GPT-4o-mini, Claude Haiku, Gemini Flash), mid-cost (GPT-4o, Claude Sonnet), high-cost (o1, Claude Opus, Gemini Ultra). The quality threshold per tier is expressed as a minimum benchmark score on the organisation's evaluation dataset.

Latency-based routing selects the model with the lowest current P90 latency from real-time metrics. This is particularly valuable for interactive user-facing features where model quality differences are marginal but latency differences are perceived. The latency metric is maintained as a sliding 5-minute window per provider endpoint.

Capability-based routing matches the request's declared requirements against model capability cards in the registry. A request requiring 128K+ context routes only to models with sufficient context windows; a request requiring tool use routes only to models with function-calling capability; a request requiring structured JSON output routes to models with reliable JSON mode. Capability routing is essentially a filter, often combined with cost or latency routing for final selection.

Fallback routing defines an ordered preference list for a given model alias. When the primary model's circuit breaker is open or the provider returns persistent errors, the router advances to the next candidate. The fallback chain is explicit and version-controlled, not implicit.

A/B and Shadow Routing are layered on top of the primary strategy. A/B routing sends a configurable percentage of traffic to a candidate model, comparing outputs against the primary on the organisation's quality metrics. Shadow routing duplicates requests to a candidate model asynchronously without serving its response to the consumer; this enables zero-risk production traffic evaluation. Both mechanisms write routing experiment metadata to the Evaluation Framework (EAAPL-PLT008) for analysis.

Circuit Breaker Integration makes routing resilient. Each model endpoint has an associated circuit breaker tracking success rate and latency over a rolling window. When a circuit opens, the router excludes that endpoint from selection for the duration of the open window (configurable, typically 60 seconds). After the open window, a half-open state tests with a single request. This means the router inherently implements provider failover without a separate failover mechanism.

Routing Rules as Code is a first-class governance principle. All routing configuration—strategy assignments per use case and consumer, fallback chains, A/B experiment configurations, capability requirements, cost tier thresholds—is expressed in a structured configuration format (YAML/JSON) stored in the platform's Git repository. Changes go through pull request review with platform team approval and are applied to the routing engine via a configuration deployment pipeline. Every routing configuration version is recorded in the audit log alongside the routing decisions it produced.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Request["Request + Config"] A[Incoming Request] B[Routing Rules GitOps] C[Model Registry] end subgraph Router["Model Router"] D[Intent Classifier] E[Strategy Engine] F{Circuit Breaker} end subgraph Models["Model Endpoints"] G[Frontier Tier] H[Mid-Cost Tier] I[Efficiency Tier] end A --> D B --> E C --> E D --> E E --> F F -->|primary| G F -->|cost route| H F -->|efficiency| I E --> J[(Routing Audit Log)] style A fill:#dbeafe,stroke:#3b82f6 style B fill:#fef9c3,stroke:#eab308 style C fill:#fef9c3,stroke:#eab308 style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#f3e8ff,stroke:#a855f7 style G fill:#dbeafe,stroke:#3b82f6 style H fill:#dbeafe,stroke:#3b82f6 style I fill:#dbeafe,stroke:#3b82f6 style J fill:#fef9c3,stroke:#eab308

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Intent Classifier	Service	Estimate request complexity; extract use-case signals	Rule-based lookup, lightweight ML classifier (DistilBERT), regex patterns	High
Routing Strategy Engine	Service	Apply configured strategy to produce ranked model list	Custom rule engine, LiteLLM router, Envoy route configuration	Critical
Circuit Breaker State Store	Service	Maintain per-endpoint health state (closed/open/half-open)	Redis, in-memory (single instance), Resilience4j	Critical
A/B Traffic Splitter	Service	Distribute traffic according to experiment configuration	Custom weighted random, LaunchDarkly, feature flag service	Medium
Shadow Router	Service	Duplicate requests to shadow model asynchronously	Async task queue (Celery, asyncio), Kafka producer	Medium
Routing Rules Store	Configuration	Version-controlled routing configuration	Git repository + ConfigMap (Kubernetes), Consul K/V	High
Real-Time Metrics Collector	Service	Maintain sliding window of model performance metrics	Prometheus, in-memory metrics cache with TTL	High
Model Registry Client	Service	Query model capability cards for capability-based routing	gRPC/HTTP client to Model Registry service	High
Routing Decision Logger	Service	Write routing decision record to audit log	Async writer to Kafka/OpenTelemetry	High
Evaluation Integration	Service	Publish A/B results to Evaluation Framework	REST/event client to PLT008	Medium

7. Data Flow

Primary Flow — Cost-Based Routing Request

Step	Actor	Action	Output
1	Incoming Request	Arrive at router with use-case tag `summarisation` and consumer team `team-marketing`	Request context with metadata
2	Intent Classifier	Look up `summarisation` in use-case taxonomy; estimate complexity as LOW from prompt token count	Complexity: LOW; Use case: summarisation
3	Routing Strategy Selector	Look up `team-marketing` + `summarisation` in routing rules; find strategy: `cost-based`	Strategy: cost-based
4	Cost-Based Strategy	Map LOW complexity to Tier 3 efficiency models; retrieve list: [Claude Haiku, GPT-4o-mini]	Candidate list: [Claude Haiku, GPT-4o-mini]
5	Circuit Breaker Check	Check circuit state for Claude Haiku (CLOSED) and GPT-4o-mini (CLOSED)	Both available
6	Final Selector	Select Claude Haiku (primary preference in rules); check A/B config — no active experiment for this consumer	Selected: Claude Haiku endpoint
7	Routing Decision Log	Emit routing record: {request_id, strategy, candidates, selected, reason, timestamp}	Audit log record written
8	Upstream Proxy	Forward request to Claude Haiku endpoint	Model response

Error Flow

Error Condition	Detection	Action	Consumer Impact
Primary model circuit open	Circuit breaker state check at step 5	Advance to next candidate in fallback chain	Transparent; higher cost model may be used
All candidates circuit open	Step 5 all candidates unavailable	Return 503 with routing-exhausted code; trigger incident alert	Service degraded; no AI response
Capability mismatch (no capable model available)	Capability filter produces empty list	Return 422 with no-capable-model code	Consumer must adjust request parameters
Routing rules not found for use case	Strategy selector miss	Apply default strategy (configured globally)	Potential non-optimal routing; logs warning
Intent classification timeout	<10ms budget exceeded	Apply default routing strategy without classification	Routing proceeds; log classification timeout

8. Security Considerations

Authentication and Authorisation

Model selection may not be manipulated by consumer input beyond the declared use-case tag; raw model names in consumer requests are validated against authorised models for that consumer
Team-level routing overrides require platform team approval; they are stored in the version-controlled routing rules, not consumer-controllable at request time

Secrets Management

Model provider credentials for each endpoint are retrieved from Secrets Manager at routing decision time; credentials are not embedded in routing rules
Shadow routing uses separate credentials with read-only scoping where possible to prevent shadow model being used for mutations

Data Classification and Encryption

Routing decisions involving RESTRICTED or CONFIDENTIAL data are logged with the classification label for audit trail completeness
Shadow requests must be subject to the same data classification and policy enforcement as primary requests

Auditability

Every routing decision is logged with: strategy applied, candidates considered, circuit breaker states, selected endpoint, reason code, any experiment configuration active
Routing configuration changes are version-controlled and auditable as Git commits with author, timestamp, and review record

OWASP LLM Top 10 Controls

OWASP LLM Risk	Routing-Layer Control
LLM01 Prompt Injection	Routing does not modify prompts; injection risk handled at gateway layer
LLM04 Model DoS	Circuit breaker prevents failed model from absorbing continued traffic
LLM05 Supply Chain	Only models in the approved registry are eligible routing targets
LLM09 Overreliance	Routing logs which model produced each response; enables per-model quality monitoring

9. Governance Considerations

Responsible AI

Routing rules must not route high-risk AI use cases to models without a completed Model Risk Card
A/B experiments involving high-risk use cases require explicit Governance Board approval before activation
Shadow routing results feed into model evaluation decisions that are recorded in the Evaluation Framework

Model Risk Management

The routing fallback chain defines the approved substitution hierarchy; arbitrary model substitution is not permitted
When a new model is added to the registry and routing rules, a Model Risk Card delta review is required comparing the new model to existing candidates
Routing telemetry (which model served which volume of requests) is a key input to the quarterly model risk review

Governance Artefacts

Artefact	Owner	Cadence	Location
routing-rules.yaml	Platform Team	Per change via PR	Git repository
A/B experiment registry	Platform Team + Model Owner	Per experiment	Evaluation Framework
Fallback chain approval records	Platform Governance Board	Per change	GRC system / Git PR comments
Routing telemetry report	Platform Team	Monthly	Observability dashboard
Model substitution impact assessment	Risk Team	Per fallback chain change	Model Registry

10. Operational Considerations

Monitoring

Signal	Source	Alert Threshold	Owner
Fallback activation rate	Routing decision log	>5% of requests using non-primary model	Platform On-Call
Circuit breaker state changes	Circuit breaker events	Any circuit opening	Platform On-Call + Model Owner
Intent classification error rate	Intent classifier metrics	>1% classification errors	Platform Team
Routing rule miss rate	Routing engine logs	>0.1% requests hitting default fallback	Platform Team
A/B experiment quality delta	Evaluation Framework	Statistically significant quality degradation in B variant	Platform Team + Product Owner

SLOs

SLO	Target	Window
Routing decision latency P99	<15ms (overhead beyond gateway)	Rolling 7 days
Routing availability (decisions produced)	99.99%	Rolling 30 days
Fallback success rate	>99% of requests served even when primary unavailable	Rolling 30 days
Circuit breaker false positive rate	<0.1% circuits opened without actual provider failure	Rolling 30 days

Logging

Routing decisions logged as structured JSON with correlation to the gateway request ID
Circuit breaker state transitions logged separately for operational analysis
A/B experiment decisions include experiment ID and variant for analysis join

Incident Response

Incident	Detection	Response	RTO
Routing engine crash	Health check failure; 100% routing errors	Kubernetes pod restart; DNS failover to secondary	2 min
All circuits open (full blackout)	Zero successful upstream calls	Activate static fallback responses; page platform + engineering leadership	5 min
Routing misconfiguration deployed	Fallback rate spike after deployment	Rollback routing-rules.yaml via GitOps; circuit breakers reset	10 min

Disaster Recovery

Component	RPO	RTO	Strategy
Routing engine (stateless)	0	2 min	Multi-replica; pod auto-restart
Routing rules config	0	5 min	Git-backed; ConfigMap reload
Circuit breaker state (Redis)	5 min	2 min	Redis Sentinel; acceptable brief stale state
Routing decision audit log	<1 min	10 min	Kafka replication + S3 cross-region

11. Cost Considerations

Cost Drivers

Driver	Description	Relative Weight
Routing engine compute	Stateless; minimal CPU; scales with request count	Very Low
Intent classifier inference	If ML-based, adds per-request compute	Low
Circuit breaker state (Redis)	Small memory footprint	Very Low
Cost savings from tier routing	Negative cost — 30–50% reduction in model API spend	Dominant positive ROI

Optimisations

Most valuable optimisation: aggressive Tier 3 routing for high-volume, low-complexity tasks (summarisation, classification, entity extraction)
Intent classifier should be rule-based for speed (latency budget <5ms) unless complexity estimation materially improves routing quality
Cache routing decisions for identical consumer + use-case combinations with short TTL (1 minute) to reduce routing computation

Indicative Cost Range

Scale	Monthly Routing Infra Cost	Notes
Any scale	$100–$500/month	Routing engine is minimal compute; ROI is entirely from model cost savings
Cost savings at medium scale (10M tokens/day)	-$3,000–$8,000/month	From tier routing directing 60% of traffic to Tier 3 models
Cost savings at large scale (100M tokens/day)	-$30,000–$80,000/month	Tier routing ROI dominates; dedicated cost optimisation team warranted

12. Trade-Off Analysis

Routing Strategy Options

Strategy	Description	Pros	Cons	Best For
Static Routing	Fixed model per use-case; no dynamic selection	Simplest; predictable; easy to audit	No cost optimisation; no failover	Initial deployment; highly regulated use cases
Cost-Based Routing	Route to cheapest model meeting quality threshold	30–50% cost reduction	Requires quality benchmarks; threshold tuning effort	High-volume, mixed-complexity workloads
Capability-Based Routing	Filter by capability; then cost or latency within capable set	Accurate capability matching; prevents capability-mismatch errors	Requires maintained capability cards in registry	Multi-model deployments with specialised models
ML-Based Routing	Classify request complexity with ML model; route accordingly	Most accurate tier assignment	Adds latency; ML model requires training and maintenance	Very high volume where marginal accuracy gains justify overhead

Intent Classification Options

Option	Latency	Accuracy	Maintenance	Best For
Rule-based (use-case tag lookup)	<1ms	Depends on caller discipline	Low	Structured internal API with disciplined callers
Regex + heuristics on prompt	1–5ms	Moderate	Low-Medium	General purpose with structured prompts
Lightweight ML classifier	5–15ms	High	Medium	High-volume workloads where routing accuracy has large cost impact

Architectural Tensions

Tension	Option A	Option B	Resolution
Routing transparency vs. complexity	Expose routing decision to consumers	Black box	Include X-Model-Used header in response; audit log accessible to consumers for own requests
Routing speed vs. accuracy	Rule-based (fast, less accurate)	ML classifier (slower, more accurate)	Rule-based default; ML opt-in for high-volume use cases where ROI justifies latency
Consumer control vs. platform governance	Allow consumers to specify exact model	Platform controls all routing	Allow model family hints; platform selects within family; override audited
Failover quality vs. consistency	Always fail over to available model	Return error if preferred model unavailable	Fail-over default for availability; consumer can opt for fail-fast if consistency required

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Intent classifier crash	Medium	Medium — all requests use default routing	Classifier health check; default routing rate spikes	Restart classifier; default routing adequate in interim
Routing rules desync (ConfigMap stale)	Low	Medium — requests using outdated routing policy	Rules version mismatch alert	Force ConfigMap reload; GitOps pipeline re-applies
Circuit breaker stuck open (false positive)	Low	Medium — model excluded despite being healthy	Provider health check succeeds while circuit open	Manual circuit reset; post-incident investigation
A/B experiment misconfiguration (100% to B)	Low	High — all traffic to unvalidated model	Traffic split monitoring alert	Rollback experiment config; route to primary
Model capability card stale in registry	Medium	Low-Medium — capability routing sends to incapable model	Capability mismatch error from model	Update registry; add error handler for capability mismatch

Cascading Scenario

Mass circuit opening storm: Under a broad cloud provider degradation, multiple circuits open simultaneously. The router falls back to the next tier for all requests. If the fallback tier is also degraded (same cloud region), the cascade proceeds through all fallback candidates and the router returns 503 for all requests. Mitigation: fallback chains must span cloud providers or include on-premises/alternative-region endpoints.

14. Regulatory Considerations

EU AI Act Article 9

Routing decisions must be recorded to demonstrate that the risk management system controls which models process which use cases; the routing audit log satisfies this requirement
High-risk AI systems must not be subject to automatic fallback to lower-quality or unapproved models without human oversight configuration

NIST AI RMF MAP 2.1

The routing configuration explicitly documents the intended deployment context for each model, satisfying MAP 2.1's requirement to document AI deployment context

Audit and Record-Keeping

Routing decision logs must be retained for the same period as the AI system's operational records (typically 7 years for regulated decisions)
Routing configuration Git history constitutes an auditable record of every routing policy change with author and approval

15. Reference Implementations

AWS

Component	AWS Service
Routing engine	LiteLLM Proxy on ECS, or custom Lambda function
Circuit breaker state	ElastiCache Redis
Routing rules	SSM Parameter Store or S3 config object
Intent classifier	Lambda + custom rules, or SageMaker endpoint (ML-based)
Model endpoints	Bedrock (Claude, Llama, Titan), SageMaker endpoints for self-hosted

Azure

Component	Azure Service
Routing engine	APIM with AI routing policies, or custom AKS deployment
Circuit breaker	APIM native circuit breaker policy
Routing rules	App Configuration
Model endpoints	Azure OpenAI multiple deployments

GCP

Component	Azure Service
Routing engine	Cloud Run service with LiteLLM or custom Python
Circuit breaker	Custom Redis-backed on Memorystore
Model endpoints	Vertex AI multiple model deployments

On-Premises

Component	Technology
Routing engine	LiteLLM Proxy or custom Python/Go service
Circuit breaker	Resilience4j (Java) or custom Redis-backed
Routing rules	Consul K/V or Git-synced ConfigMap
Model endpoints	vLLM serving multiple models on GPU cluster

Pattern ID	Name	Relationship
EAAPL-PLT001	Enterprise AI Platform	Parent — routing is a core capability of the platform
EAAPL-PLT002	AI API Gateway	Host — routing executes within or behind the gateway
EAAPL-PLT004	LLM Cost Control	Complementary — cost-based routing is primary cost control lever
EAAPL-PLT008	AI Experiment Tracking	Dependency — A/B and shadow routing results feed experiment tracking
EAAPL-INT007	AI Circuit Breaker	Component — circuit breaker is embedded within routing

17. Maturity Assessment

Overall Maturity: Proven Model routing is production-proven across dozens of enterprise deployments. LiteLLM and Kong AI Gateway provide mature implementations. ML-based intent classification is still an emerging practice; rule-based routing is the proven approach.

Scoring Matrix

Dimension	Score (1–5)	Rationale
Pattern Completeness	5	All sections documented
Implementation Evidence	4	Core routing proven; ML-based intent classification less so
Tooling Stability	4	LiteLLM router mature; ML classification tooling evolving
Regulatory Alignment	4	Audit logging mapped; specific regulatory requirements vary by use case
Cost ROI Evidence	5	Consistent 30–50% cost reduction reported across multiple deployments

18. Revision History

Version	Date	Author	Changes
1.0	2024-03-10	EAAPL Working Group	Initial publication
1.1	2024-09-15	EAAPL Working Group	Added A/B and shadow routing sections; ML-based intent classification
1.2	2025-06-12	EAAPL Working Group	Cost savings data updated; cascading failure scenario added; GCP reference added

← Back to Library More Platform Engineering →