EAAPL-PLT004Proven

LLM Cost Control

⚙️ Platform EngineeringISO/IEC 42001

[EAAPL-PLT004] LLM Cost Control

Category: Platform Engineering Sub-category: FinOps / Cost Management Version: 1.2 Maturity: Proven Tags: finops, cost-management, token-budget, prompt-caching, model-tier-routing, cost-alerting, chargeback, spending-dashboards Regulatory Relevance: APRA CPS 230 (Operational Risk — cost controls), ISO 42001

1. Executive Summary

LLM inference costs exhibit a dangerous property shared with no previous enterprise technology: they scale with usage in ways that are invisible until the cloud bill arrives. A single poorly-scoped prompt with an unbounded context window can consume more compute in one request than an hour of traditional API calls. Without systematic controls, a single runaway AI feature or misconfigured pipeline can generate tens of thousands of dollars in unexpected spend within hours.

This pattern establishes a comprehensive cost control framework covering the full lifecycle of an LLM request: upfront budget enforcement (token limits per request, per consumer, per time period), intelligent routing to cost-appropriate model tiers, prompt caching to eliminate redundant computation, batch versus real-time optimisation, and real-time spend alerting with dashboard visibility for FinOps and engineering leadership. Organisations that implement this pattern systematically report 40–60% reduction in LLM spend compared to unmanaged baseline while maintaining feature quality, enabling AI investment to scale with genuine business value rather than inefficiency.

2. Problem Statement

Business Problem

LLM API costs appear as undifferentiated cloud charges with no attribution to products, teams, or decisions. When spend spikes, root cause analysis takes days. Budget sign-off for AI initiatives is difficult because cost projections are unreliable. AI spend is growing faster than business value in organisations without controls, triggering executive concern about AI investment ROI.

Technical Problem

Individual LLM requests have highly variable token consumption based on prompt construction, context window usage, and response length. Without per-request token limits, a buggy prompt template can send 100K-token requests when 2K was intended. Without model tier routing, all requests use frontier model pricing. Without caching, identical or near-identical prompts are computed fresh on every call. Without budget enforcement, a single batch job can exhaust a monthly budget.

Symptoms

Monthly AI cloud bills with variance >50% month-to-month without corresponding business activity change
No ability to attribute AI spend to individual products, teams, or features
Alerts for AI cost anomalies discovered retrospectively when the bill arrives
All AI traffic routed to the most expensive model regardless of task requirements
Identical FAQ-style prompts computed fresh on every call with no caching

Cost of Inaction

AI spend growing to unsustainable levels, threatening AI investment programme shutdown
Executive loss of confidence in AI ROI due to uncontrolled cost growth
Inability to negotiate volume discounts with providers without consolidated spend data
Cross-team cost externalities: one team's runaway workload degrades token budget for all teams

3. Context

When to Apply

Organisation's monthly AI API spend exceeds $5,000 or is projected to exceed this within 3 months
Multiple teams or use cases share AI infrastructure without cost isolation
FinOps team requires per-team or per-product cost attribution
AI cost efficiency is an explicit KPI for the AI programme

When NOT to Apply

Single small-scale proof of concept: overhead of full cost control not warranted
Single team with a single predictable, fixed-cost workload: direct budget monitoring sufficient
Air-gapped self-hosted deployments with no per-token cost: infrastructure cost management applies instead

Prerequisites

AI API Gateway (PLT002) as enforcement point for budget controls
Cost allocation taxonomy agreed between FinOps and engineering (team/product/environment dimensions)
Observability stack for real-time cost event ingestion
Stakeholder agreement on what constitutes a budget threshold and escalation path

Industry Applicability

Industry	Applicability	Key Cost Driver
Technology / SaaS	Very High	AI features at scale; customer-facing token consumption
Retail / E-commerce	Very High	Product descriptions, search, personalisation at catalog scale
Financial Services	High	Research automation, document processing, customer service
Healthcare	High	Clinical documentation, patient communication at volume
Media / Content	Very High	Content generation, summarisation, moderation at scale
Government	Medium	Document processing; typically lower volume

4. Architecture Overview

The LLM Cost Control pattern operates across three time horizons: per-request controls that enforce hard limits on individual calls, per-period budget controls that enforce cumulative spending limits over time windows (daily/weekly/monthly), and strategic optimisations that systematically reduce the per-token cost of all traffic.

Per-Request Token Budget Enforcement is the first line of defence. Every request entering the AI API Gateway is evaluated for its estimated token consumption. The max_tokens parameter is enforced as a hard ceiling; requests without an explicit max_tokens receive a platform default (configurable per model tier and use case). Input token limits per request prevent context window abuse: a request exceeding the configured input token limit for its use case classification is rejected with a 413 response and a recommendation to use the batch API instead. This single control eliminates the most common cause of surprise cost spikes.

Consumer and Team Budget Tracking maintains real-time token consumption counters per consumer, team, project, and environment. These counters are maintained in a Redis data structure (sorted sets for time-windowed aggregation) and updated atomically on every request completion. Budget thresholds are configured at multiple levels: a soft warning threshold (80% of period budget consumed → alert to team lead), a hard throttle threshold (100% → requests rate-limited to a configured percentage), and an emergency ceiling (110% → requests blocked entirely until human approval to extend). The tiered response prevents hard stops from creating operational incidents while still enforcing accountability.

Model Tier Routing (see EAAPL-PLT003 for full treatment) is the largest lever for strategic cost reduction. The cost control layer maintains a cost model for each available model endpoint (cost per 1K input tokens, cost per 1K output tokens) and uses this in conjunction with the routing strategy to route each request to the cheapest model meeting the quality requirement. The cost model is updated automatically from provider pricing APIs where available. A/B routing experiments track cost efficiency alongside quality to inform routing policy updates.

Prompt Caching operates at two levels. Provider-side prompt caching (supported by Anthropic Claude and OpenAI) caches the KV computation for prompt prefixes at the model provider level; this requires structuring prompts with stable system prompt prefixes at the beginning of the context. Platform-side semantic caching (PLT006) caches full responses for near-identical prompts at the gateway level. Both mechanisms reduce effective token consumption; the platform cost model tracks cache hit rates and attributable savings separately so the value of caching investment is visible.

Batch vs. Real-Time Optimisation provides a structural cost reduction for non-interactive workloads. The cost control layer routes requests tagged as execution-mode: batch through provider batch APIs (OpenAI Batch API, Anthropic Message Batches) which offer 50% token cost reduction at the expense of 24-hour latency. Product teams are guided to tag their use cases appropriately during onboarding; the developer portal surfaces the cost differential to encourage correct classification.

Cost Alerting and Dashboards provide the operational visibility layer. Real-time cost events from all requests are streamed to the Cost Management Service, which aggregates by team/product/environment dimensions and evaluates against configured budget thresholds. Alerts are delivered via PagerDuty (emergency), Slack (warning), and email (daily digest). The FinOps dashboard (Grafana or Superset) provides spend-by-team, spend-by-model, cache savings, and projection-to-period-end views.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Enforcement["Request Enforcement"] A[Incoming Request] B[Token Limit Check] C{Budget Tracker} end subgraph Routing["Cost-Aware Routing"] D[Model Tier Router] E[Prompt Cache Check] end subgraph Models["Model Endpoints"] F[Efficiency Model] G[Frontier Model] H[Batch API] end A --> B B --> C C -->|within budget| D C -->|over budget| I[Block + Alert] D --> E E -->|cache miss| F E -->|complex task| G E -->|batch tag| H F --> J[(Token Counter)] G --> J H --> J J --> K[FinOps Dashboard] style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f3e8ff,stroke:#a855f7 style D fill:#f0fdf4,stroke:#22c55e style E fill:#fef9c3,stroke:#eab308 style F fill:#d1fae5,stroke:#10b981 style G fill:#dbeafe,stroke:#3b82f6 style H fill:#dbeafe,stroke:#3b82f6 style I fill:#fee2e2,stroke:#ef4444 style J fill:#fef9c3,stroke:#eab308 style K fill:#d1fae5,stroke:#10b981

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Input Token Limit Enforcer	Middleware	Validate max_tokens parameter; enforce input token ceiling per use case	Custom gateway middleware; token counting library (tiktoken)	High
Consumer Budget Tracker	Service	Maintain real-time token consumption counters per consumer/team/period	Redis sorted sets (ZADD/ZRANGEBYSCORE for time windows)	Critical
Budget Threshold Evaluator	Service	Evaluate thresholds; trigger warnings and blocks	Custom service backed by Redis	Critical
Cost Model Store	Service	Maintain per-model pricing data; update from provider pricing APIs	Redis hash or PostgreSQL table	High
Model Tier Router	Service	Select cheapest adequate model for request (see PLT003)	LiteLLM cost-based routing, custom rule engine	Critical
Provider Prompt Cache Manager	Service	Structure prompts for provider-side KV cache; track cache hit rates	Custom, provider SDK integration	High
Semantic Cache (Platform-Side)	Service	Cache full responses for near-identical prompts (see PLT006)	GPTCache, Redis + vector index	High
Batch Route Classifier	Service	Classify requests as batch-eligible based on execution mode tag	Custom rule-based classifier	Medium
Cost Event Publisher	Service	Emit per-request cost events for aggregation	Kafka producer, CloudWatch PutMetricData	Critical
Alert Engine	Service	Evaluate budget thresholds; dispatch alerts	PagerDuty, Slack webhook, email (SES/Sendgrid)	High
Cost Dashboard	Service	Real-time and historical spend visualisation	Grafana, Apache Superset, PowerBI	Medium
Chargeback Report Generator	Service	Monthly per-team cost attribution reports	Custom SQL on cost events, Metabase	Medium

7. Data Flow

Primary Flow — Request with Budget Enforcement

Step	Actor	Action	Output
1	Consumer Application	Submit request with `max_tokens: 2048`, `use-case: summarisation`, `team: marketing`	Request at gateway cost control stage
2	Input Token Limit Enforcer	Count input tokens using tiktoken; compare to use-case ceiling (summarisation: 8192 input tokens)	Tokens within limit; proceed
3	Consumer Budget Tracker	Query Redis for marketing team's tokens used this month vs. monthly budget	Remaining: 2.4M tokens (80% used → warning threshold crossed)
4	Budget Threshold Evaluator	80% threshold crossed; emit warning alert to team-marketing Slack channel	Warning alert dispatched; request continues
5	Cost Model Lookup	Retrieve cost model for routing: Claude Haiku ($0.0001/1K input, $0.0002/1K output) vs. Claude Sonnet ($0.003/$0.015)	Cost delta available for routing decision
6	Model Tier Router	Complexity LOW; select Claude Haiku (cost-based); circuit breaker CLOSED	Selected: Claude Haiku
7	Provider Prompt Cache Check	Check if prompt prefix is in Anthropic KV cache; cache HIT	Provider cache hit; 90% of prompt tokens not charged
8	Upstream Call	Forward to Claude Haiku; cache hit reduces effective input tokens	Response returned; actual billed tokens: ~200 (uncached suffix)
9	Cost Event Publish	Emit cost event: {team: marketing, model: claude-haiku, input_tokens: 200 (cache hit), output_tokens: 512, cost_usd: 0.000122}	Cost event in stream
10	Budget Counter Update	Update Redis counter for marketing team: +712 effective tokens	Counter updated atomically

Error Flow

Error Condition	Detection	Response
Input token count exceeds use-case ceiling	Token counter at step 2	413 Request Entity Too Large with token count details
Team budget at hard 100% limit	Budget tracker at step 3	429 with budget-exhausted code; 24h until reset or manual approval needed
Cost model stale/unavailable	Cost model service timeout	Log warning; proceed with routing using last-known-good cost model
Batch API unavailable for batch-tagged request	Batch route check failure	Fall back to real-time API; log cost increase for later review

8. Security Considerations

Budget bypass attempts (manually setting max_tokens above the enforced ceiling) are rejected at the gateway; the ceiling is a platform-enforced control, not a suggestion
Consumer token counters are stored in a dedicated Redis instance with no direct consumer write access; only the gateway cost accounting service can increment counters
Cost event stream is read-only for consumers; teams can view their own consumption data but not other teams'
Chargeback reports are distributed per-team; cross-team visibility requires FinOps-level access

OWASP LLM Top 10 Controls

OWASP LLM Risk	Cost Control Layer
LLM04 Model DoS	Token budget per consumer prevents any single consumer exhausting platform capacity; this is both a cost and availability control
LLM08 Excessive Agency	Agentic loops are bounded by per-session token budgets; runaway agent loops are expensive before they are harmful

9. Governance Considerations

Budget Governance

Monthly token budgets per team are approved by the AI Governance Board and FinOps team jointly; requests for budget increases require business case documentation
Emergency budget extensions (overriding the hard ceiling) require explicit sign-off from the team's engineering manager and FinOps; all extensions are logged

Chargeback Model

Costs attributed via the cost event stream are the official basis for internal chargeback; teams are responsible for their attributed costs
The cost model is updated quarterly as provider pricing changes; teams are notified 30 days in advance of pricing model changes

Governance Artefacts

Artefact	Owner	Cadence	Location
Team budget schedule	FinOps + AI Governance Board	Annual (reviewed quarterly)	Platform configuration + finance system
Budget extension approvals	Engineering Manager + FinOps	Per-event	GRC system
Monthly chargeback report	Platform Team	Monthly	Finance system + team dashboards
Cost model pricing updates	Platform Team	Quarterly	Platform configuration
Cost optimisation roadmap	FinOps + Platform Team	Quarterly	Internal wiki

10. Operational Considerations

Monitoring

Signal	Source	Alert Threshold	Owner
Team budget at 80%	Budget tracker	Event-driven	FinOps + Team Lead
Team budget at 100%	Budget tracker	Event-driven (high urgency)	FinOps + Team Lead + Engineering Manager
Daily spend > 1.5× previous day average	Cost event aggregation	Daily window	FinOps On-Call
Abnormal token counts per request (P99 spike)	Request metrics	>200% of rolling P99 baseline	Platform On-Call
Cache hit rate drop	Cache metrics	<10% sustained 1 hour	Platform Team
Budget tracker service unavailable	Health check	Immediate	Platform On-Call

SLOs

SLO	Target	Window
Cost event ingestion latency	<5 seconds from request completion	Rolling 7 days
Budget counter accuracy	<1% variance from actual provider charges	Monthly reconciliation
Alert delivery latency	<60 seconds from threshold breach	Per-event
Dashboard data freshness	<5 minutes lag	Rolling 7 days

Disaster Recovery

Component	RPO	RTO	Strategy
Budget counter (Redis)	5 min	5 min	Redis Sentinel; brief window of over-limit requests acceptable
Cost event stream (Kafka)	<1 min	10 min	Cross-region replication
Dashboard (read-only)	1 hour	30 min	Acceptable staleness for non-critical service
Chargeback report data	0	24 hours	Recomputable from cost event archive

11. Cost Considerations

Cost Drivers

Driver	Description	Relative Weight
Redis for budget counters	Minimal memory footprint; high throughput needed	Very Low
Cost event stream (Kafka/Kinesis)	Volume proportional to request rate	Low
Dashboard hosting	Read-only service; moderate cost	Low
LLM API costs (controlled)	Primary cost being managed; all controls aimed here	Dominant

Indicative Cost Range

Scale	Monthly Cost Control Infra	LLM Savings from Controls
Small (<1M tokens/day)	$100–$300	$500–$2,000 from tier routing + caching
Medium (1–50M tokens/day)	$500–$2,000	$5,000–$30,000 from combined controls
Large (>50M tokens/day)	$2,000–$8,000	$30,000–$150,000+ from combined controls

12. Trade-Off Analysis

Budget Enforcement Options

Option	Description	Pros	Cons	Best For
Hard Stop at 100%	Block all requests at budget limit	Absolute cost certainty	Operational incidents if budget misconfigured	Finance-controlled AI programmes; strict cost accountability
Soft Throttle at 100%	Allow requests at reduced rate (e.g., 10% of normal) after limit	Degraded not dead	Still accumulates cost above budget	Product-focused teams; uptime priority
Alert Only	No enforcement; only alerts at thresholds	No operational impact	No cost control; only cost visibility	Initial rollout; trust-based environment

Caching Strategy Options

Option	Description	Pros	Cons	Best For
Provider-Side Cache Only	Use provider KV cache for prefix caching	Zero additional infrastructure; reduces input tokens	Only prefix-level caching; no cross-request caching	Workloads with long stable system prompts
Semantic Cache Only	Platform-level near-match response caching	Cross-request caching; higher hit rate potential	Privacy considerations; false positive risk	FAQ, classification, search augmentation
Combined Provider + Semantic	Both layers active	Maximum cost reduction	Complexity; requires careful TTL management	High-volume mixed workloads

Architectural Tensions

Tension	Option A	Option B	Resolution
Strict per-request limits vs. flexible prompting	Hard input token ceiling	Soft guidance	Configurable per use-case class; creative use cases have higher limits
Team autonomy vs. cost governance	Teams set own budgets	Central FinOps sets all budgets	FinOps sets envelope; teams allocate within envelope by product/feature
Cache freshness vs. cost savings	Low TTL (fresh)	High TTL (cheap)	TTL per corpus type; static knowledge bases: long TTL; dynamic context: short/no TTL

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Budget tracker (Redis) failure	Medium	Medium — budget enforcement suspended	Redis health check fail	Fail-safe: revert to rate limiting only; alert FinOps
Cost model stale (pricing outdated)	Medium	Low — routing decisions suboptimal	Automatic freshness check alert	Manual pricing update; automated via provider pricing API
Token counter drift (Redis vs. actual spend)	Low	Medium — budget accountability gap	Monthly reconciliation vs. provider invoice	Reconciliation report triggers manual correction
Alert fatigue (too many budget warnings)	High	Low-Medium — alerts ignored	Alert volume metrics	Tune thresholds; consolidate daily digest vs. real-time alerts
Batch API failure causing real-time fallback	Medium	Medium — unexpected cost increase	Batch failure rate spike	Alert FinOps; teams approve real-time cost increase or pause workload

14. Regulatory Considerations

APRA CPS 230 (Operational Risk)

Cost control mechanisms are operational risk controls for the AI platform; the budget enforcement system must itself be resilient
AI cost overruns that materially affect the organisation's operational budget may constitute an operational risk event reportable under CPS 230

Financial Reporting

Internal cost attribution data must be accurate enough to support financial reporting; the cost event reconciliation process ensures chargeback data matches actual provider invoices

15. Reference Implementations

AWS

Component	AWS Service
Budget counters	ElastiCache Redis
Cost events	Kinesis Data Streams → S3 → Athena
Alerts	CloudWatch Alarms + SNS → PagerDuty / Slack
Dashboard	CloudWatch custom dashboards + Grafana
Chargeback reports	Athena queries + S3 + QuickSight
Provider pricing API	Bedrock pricing API (where available)

Azure

Component	Azure Service
Budget counters	Azure Cache for Redis
Cost events	Event Hubs → Azure Data Lake Gen2
Alerts	Azure Monitor Alerts + Action Groups
Dashboard	Azure Monitor Workbooks + Grafana

On-Premises

Component	Technology
Budget counters	Redis Enterprise
Cost events	Apache Kafka → ClickHouse
Dashboard	Grafana + ClickHouse data source
Alerts	Alertmanager → PagerDuty

Pattern ID	Name	Relationship
EAAPL-PLT002	AI API Gateway	Host — budget enforcement implemented within gateway pipeline
EAAPL-PLT003	Model Routing	Component — cost-based routing is a cost control mechanism
EAAPL-PLT006	LLM Caching Layer	Complementary — caching reduces effective token consumption
EAAPL-PLT001	Enterprise AI Platform	Parent — cost management is a shared service
EAAPL-INT005	Batch AI Processing	Complementary — batch routing reduces cost for async workloads

17. Maturity Assessment

Overall Maturity: Proven Token budget enforcement and model tier routing are production-proven at scale. Provider-side prompt caching is a relatively recent feature (2024) that is proving high-value. The combined pattern has strong ROI evidence across multiple enterprise deployments.

Scoring Matrix

Dimension	Score (1–5)	Rationale
Pattern Completeness	5	All sections documented
Implementation Evidence	4	Core controls proven; provider cache integration emerging
ROI Evidence	5	Consistent 40–60% spend reduction documented
Tooling Maturity	4	Redis counters and dashboards mature; provider pricing APIs variable
Operational Complexity	Medium	Budget configuration requires FinOps discipline; manageable

18. Revision History

Version	Date	Author	Changes
1.0	2024-04-01	EAAPL Working Group	Initial publication
1.1	2024-11-10	EAAPL Working Group	Provider-side prompt caching section added; batch API cost models updated
1.2	2025-06-12	EAAPL Working Group	Cost range data updated; tiered budget enforcement (soft/hard/ceiling) documented

← Back to Library More Platform Engineering →