[EAAPL-PLT004] LLM Cost Control
Category: Platform Engineering
Sub-category: FinOps / Cost Management
Version: 1.2
Maturity: Proven
Tags: finops, cost-management, token-budget, prompt-caching, model-tier-routing, cost-alerting, chargeback, spending-dashboards
Regulatory Relevance: APRA CPS 230 (Operational Risk — cost controls), ISO 42001
1. Executive Summary
LLM inference costs exhibit a dangerous property shared with no previous enterprise technology: they scale with usage in ways that are invisible until the cloud bill arrives. A single poorly-scoped prompt with an unbounded context window can consume more compute in one request than an hour of traditional API calls. Without systematic controls, a single runaway AI feature or misconfigured pipeline can generate tens of thousands of dollars in unexpected spend within hours.
This pattern establishes a comprehensive cost control framework covering the full lifecycle of an LLM request: upfront budget enforcement (token limits per request, per consumer, per time period), intelligent routing to cost-appropriate model tiers, prompt caching to eliminate redundant computation, batch versus real-time optimisation, and real-time spend alerting with dashboard visibility for FinOps and engineering leadership. Organisations that implement this pattern systematically report 40–60% reduction in LLM spend compared to unmanaged baseline while maintaining feature quality, enabling AI investment to scale with genuine business value rather than inefficiency.
2. Problem Statement
Business Problem
LLM API costs appear as undifferentiated cloud charges with no attribution to products, teams, or decisions. When spend spikes, root cause analysis takes days. Budget sign-off for AI initiatives is difficult because cost projections are unreliable. AI spend is growing faster than business value in organisations without controls, triggering executive concern about AI investment ROI.
Technical Problem
Individual LLM requests have highly variable token consumption based on prompt construction, context window usage, and response length. Without per-request token limits, a buggy prompt template can send 100K-token requests when 2K was intended. Without model tier routing, all requests use frontier model pricing. Without caching, identical or near-identical prompts are computed fresh on every call. Without budget enforcement, a single batch job can exhaust a monthly budget.
Symptoms
- Monthly AI cloud bills with variance >50% month-to-month without corresponding business activity change
- No ability to attribute AI spend to individual products, teams, or features
- Alerts for AI cost anomalies discovered retrospectively when the bill arrives
- All AI traffic routed to the most expensive model regardless of task requirements
- Identical FAQ-style prompts computed fresh on every call with no caching
Cost of Inaction
- AI spend growing to unsustainable levels, threatening AI investment programme shutdown
- Executive loss of confidence in AI ROI due to uncontrolled cost growth
- Inability to negotiate volume discounts with providers without consolidated spend data
- Cross-team cost externalities: one team's runaway workload degrades token budget for all teams
3. Context
When to Apply
- Organisation's monthly AI API spend exceeds $5,000 or is projected to exceed this within 3 months
- Multiple teams or use cases share AI infrastructure without cost isolation
- FinOps team requires per-team or per-product cost attribution
- AI cost efficiency is an explicit KPI for the AI programme
When NOT to Apply
- Single small-scale proof of concept: overhead of full cost control not warranted
- Single team with a single predictable, fixed-cost workload: direct budget monitoring sufficient
- Air-gapped self-hosted deployments with no per-token cost: infrastructure cost management applies instead
Prerequisites
- AI API Gateway (PLT002) as enforcement point for budget controls
- Cost allocation taxonomy agreed between FinOps and engineering (team/product/environment dimensions)
- Observability stack for real-time cost event ingestion
- Stakeholder agreement on what constitutes a budget threshold and escalation path
Industry Applicability
| Industry |
Applicability |
Key Cost Driver |
| Technology / SaaS |
Very High |
AI features at scale; customer-facing token consumption |
| Retail / E-commerce |
Very High |
Product descriptions, search, personalisation at catalog scale |
| Financial Services |
High |
Research automation, document processing, customer service |
| Healthcare |
High |
Clinical documentation, patient communication at volume |
| Media / Content |
Very High |
Content generation, summarisation, moderation at scale |
| Government |
Medium |
Document processing; typically lower volume |
4. Architecture Overview
The LLM Cost Control pattern operates across three time horizons: per-request controls that enforce hard limits on individual calls, per-period budget controls that enforce cumulative spending limits over time windows (daily/weekly/monthly), and strategic optimisations that systematically reduce the per-token cost of all traffic.
Per-Request Token Budget Enforcement is the first line of defence. Every request entering the AI API Gateway is evaluated for its estimated token consumption. The max_tokens parameter is enforced as a hard ceiling; requests without an explicit max_tokens receive a platform default (configurable per model tier and use case). Input token limits per request prevent context window abuse: a request exceeding the configured input token limit for its use case classification is rejected with a 413 response and a recommendation to use the batch API instead. This single control eliminates the most common cause of surprise cost spikes.
Consumer and Team Budget Tracking maintains real-time token consumption counters per consumer, team, project, and environment. These counters are maintained in a Redis data structure (sorted sets for time-windowed aggregation) and updated atomically on every request completion. Budget thresholds are configured at multiple levels: a soft warning threshold (80% of period budget consumed → alert to team lead), a hard throttle threshold (100% → requests rate-limited to a configured percentage), and an emergency ceiling (110% → requests blocked entirely until human approval to extend). The tiered response prevents hard stops from creating operational incidents while still enforcing accountability.
Model Tier Routing (see EAAPL-PLT003 for full treatment) is the largest lever for strategic cost reduction. The cost control layer maintains a cost model for each available model endpoint (cost per 1K input tokens, cost per 1K output tokens) and uses this in conjunction with the routing strategy to route each request to the cheapest model meeting the quality requirement. The cost model is updated automatically from provider pricing APIs where available. A/B routing experiments track cost efficiency alongside quality to inform routing policy updates.
Prompt Caching operates at two levels. Provider-side prompt caching (supported by Anthropic Claude and OpenAI) caches the KV computation for prompt prefixes at the model provider level; this requires structuring prompts with stable system prompt prefixes at the beginning of the context. Platform-side semantic caching (PLT006) caches full responses for near-identical prompts at the gateway level. Both mechanisms reduce effective token consumption; the platform cost model tracks cache hit rates and attributable savings separately so the value of caching investment is visible.
Batch vs. Real-Time Optimisation provides a structural cost reduction for non-interactive workloads. The cost control layer routes requests tagged as execution-mode: batch through provider batch APIs (OpenAI Batch API, Anthropic Message Batches) which offer 50% token cost reduction at the expense of 24-hour latency. Product teams are guided to tag their use cases appropriately during onboarding; the developer portal surfaces the cost differential to encourage correct classification.
Cost Alerting and Dashboards provide the operational visibility layer. Real-time cost events from all requests are streamed to the Cost Management Service, which aggregates by team/product/environment dimensions and evaluates against configured budget thresholds. Alerts are delivered via PagerDuty (emergency), Slack (warning), and email (daily digest). The FinOps dashboard (Grafana or Superset) provides spend-by-team, spend-by-model, cache savings, and projection-to-period-end views.
5. Architecture Diagram
flowchart TD
subgraph Enforcement["Request Enforcement"]
A[Incoming Request]
B[Token Limit Check]
C{Budget Tracker}
end
subgraph Routing["Cost-Aware Routing"]
D[Model Tier Router]
E[Prompt Cache Check]
end
subgraph Models["Model Endpoints"]
F[Efficiency Model]
G[Frontier Model]
H[Batch API]
end
A --> B
B --> C
C -->|within budget| D
C -->|over budget| I[Block + Alert]
D --> E
E -->|cache miss| F
E -->|complex task| G
E -->|batch tag| H
F --> J[(Token Counter)]
G --> J
H --> J
J --> K[FinOps Dashboard]
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#f0fdf4,stroke:#22c55e
style C fill:#f3e8ff,stroke:#a855f7
style D fill:#f0fdf4,stroke:#22c55e
style E fill:#fef9c3,stroke:#eab308
style F fill:#d1fae5,stroke:#10b981
style G fill:#dbeafe,stroke:#3b82f6
style H fill:#dbeafe,stroke:#3b82f6
style I fill:#fee2e2,stroke:#ef4444
style J fill:#fef9c3,stroke:#eab308
style K fill:#d1fae5,stroke:#10b981
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Input Token Limit Enforcer |
Middleware |
Validate max_tokens parameter; enforce input token ceiling per use case |
Custom gateway middleware; token counting library (tiktoken) |
High |
| Consumer Budget Tracker |
Service |
Maintain real-time token consumption counters per consumer/team/period |
Redis sorted sets (ZADD/ZRANGEBYSCORE for time windows) |
Critical |
| Budget Threshold Evaluator |
Service |
Evaluate thresholds; trigger warnings and blocks |
Custom service backed by Redis |
Critical |
| Cost Model Store |
Service |
Maintain per-model pricing data; update from provider pricing APIs |
Redis hash or PostgreSQL table |
High |
| Model Tier Router |
Service |
Select cheapest adequate model for request (see PLT003) |
LiteLLM cost-based routing, custom rule engine |
Critical |
| Provider Prompt Cache Manager |
Service |
Structure prompts for provider-side KV cache; track cache hit rates |
Custom, provider SDK integration |
High |
| Semantic Cache (Platform-Side) |
Service |
Cache full responses for near-identical prompts (see PLT006) |
GPTCache, Redis + vector index |
High |
| Batch Route Classifier |
Service |
Classify requests as batch-eligible based on execution mode tag |
Custom rule-based classifier |
Medium |
| Cost Event Publisher |
Service |
Emit per-request cost events for aggregation |
Kafka producer, CloudWatch PutMetricData |
Critical |
| Alert Engine |
Service |
Evaluate budget thresholds; dispatch alerts |
PagerDuty, Slack webhook, email (SES/Sendgrid) |
High |
| Cost Dashboard |
Service |
Real-time and historical spend visualisation |
Grafana, Apache Superset, PowerBI |
Medium |
| Chargeback Report Generator |
Service |
Monthly per-team cost attribution reports |
Custom SQL on cost events, Metabase |
Medium |
7. Data Flow
Primary Flow — Request with Budget Enforcement
| Step |
Actor |
Action |
Output |
| 1 |
Consumer Application |
Submit request with max_tokens: 2048, use-case: summarisation, team: marketing |
Request at gateway cost control stage |
| 2 |
Input Token Limit Enforcer |
Count input tokens using tiktoken; compare to use-case ceiling (summarisation: 8192 input tokens) |
Tokens within limit; proceed |
| 3 |
Consumer Budget Tracker |
Query Redis for marketing team's tokens used this month vs. monthly budget |
Remaining: 2.4M tokens (80% used → warning threshold crossed) |
| 4 |
Budget Threshold Evaluator |
80% threshold crossed; emit warning alert to team-marketing Slack channel |
Warning alert dispatched; request continues |
| 5 |
Cost Model Lookup |
Retrieve cost model for routing: Claude Haiku ($0.0001/1K input, $0.0002/1K output) vs. Claude Sonnet ($0.003/$0.015) |
Cost delta available for routing decision |
| 6 |
Model Tier Router |
Complexity LOW; select Claude Haiku (cost-based); circuit breaker CLOSED |
Selected: Claude Haiku |
| 7 |
Provider Prompt Cache Check |
Check if prompt prefix is in Anthropic KV cache; cache HIT |
Provider cache hit; 90% of prompt tokens not charged |
| 8 |
Upstream Call |
Forward to Claude Haiku; cache hit reduces effective input tokens |
Response returned; actual billed tokens: ~200 (uncached suffix) |
| 9 |
Cost Event Publish |
Emit cost event: {team: marketing, model: claude-haiku, input_tokens: 200 (cache hit), output_tokens: 512, cost_usd: 0.000122} |
Cost event in stream |
| 10 |
Budget Counter Update |
Update Redis counter for marketing team: +712 effective tokens |
Counter updated atomically |
Error Flow
| Error Condition |
Detection |
Response |
| Input token count exceeds use-case ceiling |
Token counter at step 2 |
413 Request Entity Too Large with token count details |
| Team budget at hard 100% limit |
Budget tracker at step 3 |
429 with budget-exhausted code; 24h until reset or manual approval needed |
| Cost model stale/unavailable |
Cost model service timeout |
Log warning; proceed with routing using last-known-good cost model |
| Batch API unavailable for batch-tagged request |
Batch route check failure |
Fall back to real-time API; log cost increase for later review |
8. Security Considerations
- Budget bypass attempts (manually setting max_tokens above the enforced ceiling) are rejected at the gateway; the ceiling is a platform-enforced control, not a suggestion
- Consumer token counters are stored in a dedicated Redis instance with no direct consumer write access; only the gateway cost accounting service can increment counters
- Cost event stream is read-only for consumers; teams can view their own consumption data but not other teams'
- Chargeback reports are distributed per-team; cross-team visibility requires FinOps-level access
OWASP LLM Top 10 Controls
| OWASP LLM Risk |
Cost Control Layer |
| LLM04 Model DoS |
Token budget per consumer prevents any single consumer exhausting platform capacity; this is both a cost and availability control |
| LLM08 Excessive Agency |
Agentic loops are bounded by per-session token budgets; runaway agent loops are expensive before they are harmful |
9. Governance Considerations
Budget Governance
- Monthly token budgets per team are approved by the AI Governance Board and FinOps team jointly; requests for budget increases require business case documentation
- Emergency budget extensions (overriding the hard ceiling) require explicit sign-off from the team's engineering manager and FinOps; all extensions are logged
Chargeback Model
- Costs attributed via the cost event stream are the official basis for internal chargeback; teams are responsible for their attributed costs
- The cost model is updated quarterly as provider pricing changes; teams are notified 30 days in advance of pricing model changes
Governance Artefacts
| Artefact |
Owner |
Cadence |
Location |
| Team budget schedule |
FinOps + AI Governance Board |
Annual (reviewed quarterly) |
Platform configuration + finance system |
| Budget extension approvals |
Engineering Manager + FinOps |
Per-event |
GRC system |
| Monthly chargeback report |
Platform Team |
Monthly |
Finance system + team dashboards |
| Cost model pricing updates |
Platform Team |
Quarterly |
Platform configuration |
| Cost optimisation roadmap |
FinOps + Platform Team |
Quarterly |
Internal wiki |
10. Operational Considerations
Monitoring
| Signal |
Source |
Alert Threshold |
Owner |
| Team budget at 80% |
Budget tracker |
Event-driven |
FinOps + Team Lead |
| Team budget at 100% |
Budget tracker |
Event-driven (high urgency) |
FinOps + Team Lead + Engineering Manager |
| Daily spend > 1.5× previous day average |
Cost event aggregation |
Daily window |
FinOps On-Call |
| Abnormal token counts per request (P99 spike) |
Request metrics |
>200% of rolling P99 baseline |
Platform On-Call |
| Cache hit rate drop |
Cache metrics |
<10% sustained 1 hour |
Platform Team |
| Budget tracker service unavailable |
Health check |
Immediate |
Platform On-Call |
SLOs
| SLO |
Target |
Window |
| Cost event ingestion latency |
<5 seconds from request completion |
Rolling 7 days |
| Budget counter accuracy |
<1% variance from actual provider charges |
Monthly reconciliation |
| Alert delivery latency |
<60 seconds from threshold breach |
Per-event |
| Dashboard data freshness |
<5 minutes lag |
Rolling 7 days |
Disaster Recovery
| Component |
RPO |
RTO |
Strategy |
| Budget counter (Redis) |
5 min |
5 min |
Redis Sentinel; brief window of over-limit requests acceptable |
| Cost event stream (Kafka) |
<1 min |
10 min |
Cross-region replication |
| Dashboard (read-only) |
1 hour |
30 min |
Acceptable staleness for non-critical service |
| Chargeback report data |
0 |
24 hours |
Recomputable from cost event archive |
11. Cost Considerations
Cost Drivers
| Driver |
Description |
Relative Weight |
| Redis for budget counters |
Minimal memory footprint; high throughput needed |
Very Low |
| Cost event stream (Kafka/Kinesis) |
Volume proportional to request rate |
Low |
| Dashboard hosting |
Read-only service; moderate cost |
Low |
| LLM API costs (controlled) |
Primary cost being managed; all controls aimed here |
Dominant |
Indicative Cost Range
| Scale |
Monthly Cost Control Infra |
LLM Savings from Controls |
| Small (<1M tokens/day) |
$100–$300 |
$500–$2,000 from tier routing + caching |
| Medium (1–50M tokens/day) |
$500–$2,000 |
$5,000–$30,000 from combined controls |
| Large (>50M tokens/day) |
$2,000–$8,000 |
$30,000–$150,000+ from combined controls |
12. Trade-Off Analysis
Budget Enforcement Options
| Option |
Description |
Pros |
Cons |
Best For |
| Hard Stop at 100% |
Block all requests at budget limit |
Absolute cost certainty |
Operational incidents if budget misconfigured |
Finance-controlled AI programmes; strict cost accountability |
| Soft Throttle at 100% |
Allow requests at reduced rate (e.g., 10% of normal) after limit |
Degraded not dead |
Still accumulates cost above budget |
Product-focused teams; uptime priority |
| Alert Only |
No enforcement; only alerts at thresholds |
No operational impact |
No cost control; only cost visibility |
Initial rollout; trust-based environment |
Caching Strategy Options
| Option |
Description |
Pros |
Cons |
Best For |
| Provider-Side Cache Only |
Use provider KV cache for prefix caching |
Zero additional infrastructure; reduces input tokens |
Only prefix-level caching; no cross-request caching |
Workloads with long stable system prompts |
| Semantic Cache Only |
Platform-level near-match response caching |
Cross-request caching; higher hit rate potential |
Privacy considerations; false positive risk |
FAQ, classification, search augmentation |
| Combined Provider + Semantic |
Both layers active |
Maximum cost reduction |
Complexity; requires careful TTL management |
High-volume mixed workloads |
Architectural Tensions
| Tension |
Option A |
Option B |
Resolution |
| Strict per-request limits vs. flexible prompting |
Hard input token ceiling |
Soft guidance |
Configurable per use-case class; creative use cases have higher limits |
| Team autonomy vs. cost governance |
Teams set own budgets |
Central FinOps sets all budgets |
FinOps sets envelope; teams allocate within envelope by product/feature |
| Cache freshness vs. cost savings |
Low TTL (fresh) |
High TTL (cheap) |
TTL per corpus type; static knowledge bases: long TTL; dynamic context: short/no TTL |
13. Failure Modes
| Failure |
Likelihood |
Impact |
Detection |
Recovery |
| Budget tracker (Redis) failure |
Medium |
Medium — budget enforcement suspended |
Redis health check fail |
Fail-safe: revert to rate limiting only; alert FinOps |
| Cost model stale (pricing outdated) |
Medium |
Low — routing decisions suboptimal |
Automatic freshness check alert |
Manual pricing update; automated via provider pricing API |
| Token counter drift (Redis vs. actual spend) |
Low |
Medium — budget accountability gap |
Monthly reconciliation vs. provider invoice |
Reconciliation report triggers manual correction |
| Alert fatigue (too many budget warnings) |
High |
Low-Medium — alerts ignored |
Alert volume metrics |
Tune thresholds; consolidate daily digest vs. real-time alerts |
| Batch API failure causing real-time fallback |
Medium |
Medium — unexpected cost increase |
Batch failure rate spike |
Alert FinOps; teams approve real-time cost increase or pause workload |
14. Regulatory Considerations
APRA CPS 230 (Operational Risk)
- Cost control mechanisms are operational risk controls for the AI platform; the budget enforcement system must itself be resilient
- AI cost overruns that materially affect the organisation's operational budget may constitute an operational risk event reportable under CPS 230
Financial Reporting
- Internal cost attribution data must be accurate enough to support financial reporting; the cost event reconciliation process ensures chargeback data matches actual provider invoices
15. Reference Implementations
AWS
| Component |
AWS Service |
| Budget counters |
ElastiCache Redis |
| Cost events |
Kinesis Data Streams → S3 → Athena |
| Alerts |
CloudWatch Alarms + SNS → PagerDuty / Slack |
| Dashboard |
CloudWatch custom dashboards + Grafana |
| Chargeback reports |
Athena queries + S3 + QuickSight |
| Provider pricing API |
Bedrock pricing API (where available) |
Azure
| Component |
Azure Service |
| Budget counters |
Azure Cache for Redis |
| Cost events |
Event Hubs → Azure Data Lake Gen2 |
| Alerts |
Azure Monitor Alerts + Action Groups |
| Dashboard |
Azure Monitor Workbooks + Grafana |
On-Premises
| Component |
Technology |
| Budget counters |
Redis Enterprise |
| Cost events |
Apache Kafka → ClickHouse |
| Dashboard |
Grafana + ClickHouse data source |
| Alerts |
Alertmanager → PagerDuty |
| Pattern ID |
Name |
Relationship |
| EAAPL-PLT002 |
AI API Gateway |
Host — budget enforcement implemented within gateway pipeline |
| EAAPL-PLT003 |
Model Routing |
Component — cost-based routing is a cost control mechanism |
| EAAPL-PLT006 |
LLM Caching Layer |
Complementary — caching reduces effective token consumption |
| EAAPL-PLT001 |
Enterprise AI Platform |
Parent — cost management is a shared service |
| EAAPL-INT005 |
Batch AI Processing |
Complementary — batch routing reduces cost for async workloads |
17. Maturity Assessment
Overall Maturity: Proven
Token budget enforcement and model tier routing are production-proven at scale. Provider-side prompt caching is a relatively recent feature (2024) that is proving high-value. The combined pattern has strong ROI evidence across multiple enterprise deployments.
Scoring Matrix
| Dimension |
Score (1–5) |
Rationale |
| Pattern Completeness |
5 |
All sections documented |
| Implementation Evidence |
4 |
Core controls proven; provider cache integration emerging |
| ROI Evidence |
5 |
Consistent 40–60% spend reduction documented |
| Tooling Maturity |
4 |
Redis counters and dashboards mature; provider pricing APIs variable |
| Operational Complexity |
Medium |
Budget configuration requires FinOps discipline; manageable |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2024-04-01 |
EAAPL Working Group |
Initial publication |
| 1.1 |
2024-11-10 |
EAAPL Working Group |
Provider-side prompt caching section added; batch API cost models updated |
| 1.2 |
2025-06-12 |
EAAPL Working Group |
Cost range data updated; tiered budget enforcement (soft/hard/ceiling) documented |