EAAPL-PLT006Proven

LLM Caching Layer

[EAAPL-PLT006] LLM Caching Layer

Category: Platform Engineering Sub-category: Performance / Cost Optimisation Version: 1.1 Maturity: Proven Tags: caching, semantic-cache, exact-match-cache, cache-invalidation, prompt-cache, cost-reduction, latency, privacy Regulatory Relevance: Privacy Act, GDPR (cached PII), ISO 27001 (data retention)

1. Executive Summary

LLM inference is expensive and often redundant. Across enterprise deployments, 20–40% of LLM requests are near-duplicates of previous requests—particularly in customer service, FAQ answering, document summarisation, and search augmentation use cases. The LLM Caching Layer pattern systematically intercepts these redundant calls and serves responses from cache, reducing both token costs and end-user latency simultaneously.

This pattern defines two complementary caching mechanisms: exact-match caching for deterministic, templated prompts (100% identical prompt → 100% cache hit), and semantic caching for paraphrased or slightly varied prompts that should produce the same response (vector similarity above a configured threshold → cache hit). Together, these mechanisms create a caching layer that is transparent to consuming applications, requires no application code changes, and delivers measurable ROI from day one. The pattern also addresses the critical privacy and security concern of cached responses: ensuring cached content from one consumer is never served to another, and that PII-containing responses are handled with appropriate retention policies.

2. Problem Statement

Business Problem

LLM inference budgets are consumed by redundant computation. When a thousand customers ask the same product question in slightly different words, the enterprise pays for a thousand independent model inferences when a single cached response would suffice. This is not an engineering concern—it is a direct charge to the business unit's operational budget with no corresponding value.

Technical Problem

Model inference has latency that varies from 500ms to 15+ seconds depending on model and response length. For customer-facing features, this latency is a user experience problem. Without caching, every request waits for model inference even when the answer is known. Furthermore, streaming responses are difficult to cache, requiring specialised handling.

Symptoms

Identical or near-identical prompts appearing in request logs at high frequency with no caching
Customer-facing AI features with P95 latency >2 seconds for common queries
LLM inference costs growing proportionally with user volume rather than with unique queries
Teams separately implementing ad hoc caching solutions (in-memory, Redis) in their applications

Cost of Inaction

20–40% of LLM spend attributable to redundant inference with no caching in place
Latency spikes during peak load as model providers throttle under increased concurrent requests
Engineering effort duplicated as teams build bespoke caching independently

3. Context

When to Apply

High volume of LLM requests with expected query repetition (FAQ, search, classification, summarisation)
Latency improvement for AI features is a product quality objective
LLM spend optimisation is active
Centralised AI API Gateway (PLT002) exists as the appropriate host for shared caching

When NOT to Apply

Highly personalised prompts where every request contains unique user context: cache hit rate will be negligible
Creative generation use cases (story writing, code generation from unique specs): semantic similarity does not imply equivalent quality
Prompts with real-time data requirements (current price, today's news): stale cached responses would be incorrect
Applications requiring non-deterministic response variety: caching by definition reduces variety

Prerequisites

AI API Gateway as host for the caching layer (PLT002)
Redis or compatible vector-enabled cache infrastructure
Embedding model (local or API) for semantic similarity computation
Defined TTL policies per use case and corpus type
Privacy assessment completed for cache content retention

Industry Applicability

Industry	Applicability	Key Workload
Retail / E-commerce	Very High	Product description generation, FAQ, search augmentation
Financial Services	High	FAQ, policy explanation, templated document processing
Technology / SaaS	High	Developer documentation, code explanation, support chatbot
Healthcare	Medium	Clinical FAQ (carefully scoped); administrative queries
Media	High	Content classification, tagging, summarisation at scale
Government	Medium	Policy FAQ, document summarisation

4. Architecture Overview

The LLM Caching Layer is positioned as middleware within the AI API Gateway's request pipeline, executing after policy enforcement and before model routing. This positioning is deliberate: policy enforcement runs before caching to ensure that cache lookups do not bypass security controls, and caching runs before routing to prevent unnecessary routing computation for cache hits.

Exact-Match Caching targets the simplest case: when the complete prompt is byte-for-byte identical to a previous request. A SHA-256 hash of the canonical prompt (normalised for whitespace, with dynamic variables stripped) is used as the cache key. On a cache hit, the stored response is returned immediately with a X-Cache: HIT header. This mechanism is particularly effective for system-prompt-heavy applications where the system prompt constitutes >90% of the input tokens and changes rarely.

The cache key design for exact-match is subtle. The full prompt including the system prompt must be hashed, not just the user message. The model name and version must be included in the cache key, since the same prompt produces different outputs from different models. The response format parameters (temperature, top_p, max_tokens, output format) must also be included, since they affect the response. A cache key that ignores any of these dimensions risks serving a response produced under different conditions.

Semantic Caching handles the more complex case: prompts that are paraphrases of each other and should produce equivalent responses. The user-facing portion of the prompt (not the system prompt, which is typically stable) is embedded using a local embedding model into a dense vector. This vector is queried against a vector store (Redis with RediSearch, pgvector, or a dedicated vector database) using approximate nearest-neighbour search with cosine similarity. If the nearest stored prompt vector has a similarity score above the configured threshold, the associated stored response is returned.

The similarity threshold is the most operationally sensitive parameter in semantic caching. Too high (>0.99): essentially exact-match behaviour, low hit rate. Too low (<0.90): risk of serving an incorrect response to a superficially similar but semantically different prompt. The appropriate threshold depends on the use case: FAQ answering (high tolerance for paraphrase, threshold 0.92–0.95), code generation (low tolerance, threshold 0.98+), product description (medium tolerance, threshold 0.94–0.97). Thresholds should be validated against a test set of similar-but-different prompts to measure false positive rate.

Cache Invalidation is the hardest problem in caching, and LLM response caches are no exception. Three invalidation triggers are relevant: time-based TTL (responses expire after a configured period; appropriate for most use cases), corpus-update-triggered invalidation (when the knowledge base underlying the AI responses changes, all related cache entries must be invalidated; this requires tagging cache entries with corpus version metadata), and explicit invalidation (an API endpoint allowing platform operators or application owners to flush specific entries or entire namespaces). Corpus-update invalidation is the most complex; it requires the cache to maintain an index of which knowledge corpus version each entry was computed against.

Privacy and Tenancy Controls are non-negotiable. Cache entries must be scoped to the consumer namespace by default: a cached response for Team A's prompt must not be returned for Team B's identical prompt unless explicitly configured as a shared cache. For responses that may contain PII (personalised summaries, user-specific answers), caching must be disabled at the per-request level via a X-Cache-Control: no-store header or use-case configuration. All cache entries must respect their TTL strictly; there must be no mechanism to enumerate cached prompt content.

Cache Observability exposes the operational value of caching. Cache hit rate, cache size, eviction rate, average response latency (hit vs. miss), and attributable token savings are published as metrics. These metrics drive the FinOps dashboard's cache savings attribution and inform threshold tuning decisions.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Request["Request Pipeline"] A[Incoming Request] B[Policy + Auth] C[Prompt Normaliser] end subgraph Cache["Caching Layer"] D{Exact Hash Match?} E[Embedding Model] F{Semantic Match?} end subgraph Backends["Cache and Model"] G[(Exact-Match Store)] H[(Vector Store)] I[Model Router + Proxy] end A --> B B --> C C --> D D -->|hit| J[Return Cached Response] D -->|miss| E E --> F F -->|above threshold| J F -->|miss| I I --> G I --> H I --> J style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f0fdf4,stroke:#22c55e style D fill:#f3e8ff,stroke:#a855f7 style E fill:#f0fdf4,stroke:#22c55e style F fill:#f3e8ff,stroke:#a855f7 style G fill:#fef9c3,stroke:#eab308 style H fill:#fef9c3,stroke:#eab308 style I fill:#dbeafe,stroke:#3b82f6 style J fill:#d1fae5,stroke:#10b981

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Prompt Normaliser	Service	Strip dynamic variables; canonicalise whitespace for stable hashing	Custom pre-processing function	High
Exact-Match Hash Generator	Service	Compute deterministic cache key from normalised prompt + model + params	SHA-256 (standard library)	High
Exact-Match Cache Store	Service	Store and retrieve response by exact hash key	Redis (string key → JSON value), Memcached	High
Embedding Model	Service	Convert prompt to dense vector for semantic similarity	text-embedding-3-small (OpenAI), local sentence-transformers, BGE-M3	High
Embedding Cache	Service	Cache computed embeddings to avoid re-embedding same prompt	Redis hash keyed by prompt SHA-256	Medium
Vector Store	Service	Store and query prompt vectors for ANN search	Redis RediSearch, pgvector, Qdrant, Weaviate, Milvus	High
Similarity Threshold Evaluator	Service	Compare ANN search score to use-case-specific threshold	Custom middleware, GPTCache policy	High
Cache Write Handler	Service	Asynchronously write model response + vector to cache after inference	Async task (asyncio, Celery)	High
Cache Invalidation API	Service	Flush by namespace, corpus version, or individual key	Custom REST API	Medium
Cache Metrics Exporter	Service	Publish hit rate, latency, token savings metrics	Prometheus exporter, CloudWatch custom metrics	Medium

7. Data Flow

Primary Flow — Semantic Cache Hit

Step	Actor	Action	Output
1	Gateway	Receive request; pass policy enforcement	Authenticated, authorised request
2	Prompt Normaliser	Remove dynamic variables (user name, date); strip excess whitespace	Canonical prompt
3	Hash Generator	SHA-256 of canonical prompt + model name + temperature	Cache key: `sha256(prompt+model+temp)`
4	Exact-Match Lookup	Query Redis with key; not found	Cache miss → proceed to semantic
5	Embedding Model	Compute 1536-dim vector for user message portion	Embedding vector
6	Vector Store Query	ANN search with cosine similarity; retrieve top-1 result	Score: 0.963; stored response for similar prompt
7	Threshold Check	0.963 > 0.95 threshold for use-case `faq`	Semantic cache hit
8	Response Return	Return stored response with headers: `X-Cache: HIT-SEMANTIC`, `X-Cache-Score: 0.963`	Consumer receives response; zero model API call
9	Metrics	Emit cache hit event; attribute token savings = (input_tokens + output_tokens) of original entry	FinOps savings attributed

Error Flow

Error	Detection	Response
Redis unavailable	Health check; connection timeout	Bypass cache; proceed to model inference; log warning
Vector store query timeout	Query latency >50ms	Bypass semantic cache; attempt exact-match only; log warning
Embedding model failure	Embedding call error	Bypass semantic cache; exact-match only or full bypass
Cache write failure	Async write error	Log error; response still returned to consumer; cache miss on next identical request

8. Security Considerations

Tenant Isolation

Cache namespace prefix is always scoped to consumer team namespace; {team}:{prompt_hash} prevents cross-tenant cache hits by default
Shared cache pools (opt-in) require explicit platform team configuration and are only appropriate for non-personalised, non-confidential content (e.g., public FAQ)
Redis ACLs enforce per-namespace read/write isolation; no consumer can access another tenant's cache entries

Privacy Controls

Requests tagged with Cache-Control: no-store (from use-case configuration or explicit consumer header) bypass both exact and semantic cache
Cache entries containing PII-derived responses must have reduced TTL (max 24 hours) and must be excluded from any cross-tenant sharing
Cached responses are stored as ciphertext (AES-256) in Redis when classified above INTERNAL sensitivity level

Data Retention

Cache TTLs are the mechanism for data retention compliance; they must be aligned with the privacy policy for the prompt content type
Regular TTL audit to ensure cache retention does not exceed applicable data retention requirements

OWASP LLM Controls

OWASP LLM Risk	Cache Layer Control
LLM06 Sensitive Information Disclosure	Tenant isolation prevents cross-tenant response leakage; PII cache disable prevents PII retention
LLM09 Overreliance	Cache hit responses must carry the same `X-AI-Generated` header as live responses; staleness indicator on old cache entries

9. Governance Considerations

Cache Policy Governance

TTL policies per use case are defined by the Data Governance team and enforced by the platform; product teams cannot override TTLs beyond their configured maximum
Semantic similarity thresholds per use case are reviewed by the prompt owner on a quarterly basis or after any quality incident attributable to cache false positives

Governance Artefacts

Artefact	Owner	Cadence	Location
Cache TTL policy	Data Governance Team	Annual review	Platform configuration
Similarity threshold settings	Prompt Owner + Platform Team	Quarterly	Platform configuration
Cache privacy impact assessment	Privacy Team	Per new use case	Privacy register
Cache false positive incident log	Platform Team	Per incident	Incident management system

10. Operational Considerations

Monitoring

Signal	Source	Alert Threshold	Owner
Overall cache hit rate	Cache metrics	<10% sustained 30 min (may indicate threshold misconfiguration or genuinely uncacheable workload)	Platform Team
Exact-match hit rate	Cache metrics	Informational; track trend	Platform Team
Semantic false positive reports	User feedback + quality monitoring	Any confirmed false positive	Platform Team + Prompt Owner
Redis memory utilisation	Redis metrics	>80% of configured max memory	Platform On-Call
Cache write failure rate	Async write error metrics	>1%	Platform Team

SLOs

SLO	Target	Window
Cache lookup latency (exact-match)	<5ms	Rolling 7 days
Cache lookup latency (semantic, including embedding)	<50ms	Rolling 7 days
Cache hit rate (FAQ/summarisation use cases)	>20%	Rolling 7 days
Redis availability	99.9%	Rolling 30 days

Disaster Recovery

Component	RPO	RTO	Strategy
Redis exact-match cache	1 hour	5 min	Redis Sentinel; cache is soft state; rebuild on miss
Vector store	1 hour	15 min	Backup + restore; cache warms naturally over time
Embedding cache	1 hour	5 min	Redis Sentinel; recomputable

11. Cost Considerations

Cost Drivers

Driver	Description	Relative Weight
Redis memory	Proportional to cached response volume and TTLs	Medium
Vector store hosting	Varies by vector count and query rate	Medium
Embedding model API (if external)	Per embedding API call; offset by cache savings	Low if local model used
Cache infrastructure savings	Negative cost — reduces model API spend	Dominant positive ROI

Optimisations

Use a local embedding model (sentence-transformers, BGE) rather than an API embedding model to eliminate per-embedding costs and latency
Configure Redis LRU eviction policy to automatically evict least-recently-used entries when memory pressure is high, maintaining cache quality without manual management
Set TTLs aggressively short for use cases with dynamic context; longer for static knowledge-base use cases

Indicative Cost Range

Scale	Cache Infra Monthly Cost	Monthly Token Savings
Small (1M tokens/day, 20% hit rate)	$150–$400	$1,000–$3,000
Medium (10M tokens/day, 25% hit rate)	$500–$1,500	$8,000–$20,000
Large (100M tokens/day, 30% hit rate)	$2,000–$6,000	$60,000–$150,000

12. Trade-Off Analysis

Caching Strategy Options

Strategy	Description	Pros	Cons	Best For
Exact-Match Only	Cache on full prompt SHA-256	Zero false positive risk; simple; deterministic	Low hit rate on natural language; paraphrases miss	Templated/structured prompts; deterministic workflows
Semantic Cache Only	Cache on embedding similarity	High hit rate on paraphrased queries	False positive risk; embedding latency adds overhead	Natural language FAQ, search, support
Two-Layer (Exact + Semantic)	Exact-match checked first; semantic as fallback	Maximum hit rate; zero FP risk for exact layer	Complexity; two infrastructure components	High-volume mixed workloads; recommended default

Similarity Threshold Options

Threshold Level	False Positive Rate	Hit Rate	Best For
Conservative (0.97–0.99)	Very Low	Low	Legal/clinical content where wrong answer has serious consequences
Balanced (0.93–0.96)	Low	Moderate	FAQ, general enterprise Q&A
Aggressive (0.88–0.92)	Medium	High	Simple classification, sentiment; tolerance for minor variation acceptable

Architectural Tensions

Tension	Option A	Option B	Resolution
Cache hit rate vs. response accuracy	Low threshold for high hit rate	High threshold for accuracy	Per-use-case thresholds configured and governed
Cache freshness vs. cost efficiency	Low TTL (always fresh)	High TTL (cost efficient)	TTL by corpus type; static corpus: 7 days; dynamic: 1 hour
Cross-tenant cache sharing vs. data isolation	Shared pool for efficiency	Per-tenant isolation	Per-tenant by default; opt-in sharing for non-sensitive public content

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Semantic false positive (wrong response served)	Low	High — incorrect response delivered to user	User feedback; quality monitoring	Remove offending cache entry; lower threshold for affected use case
Redis OOM (out of memory)	Medium	Medium — LRU eviction increases miss rate	Redis memory metrics	Increase Redis memory; tune TTLs; scale Redis cluster
Embedding model cold start latency	Medium	Low — first semantic lookup slow after deployment	Latency spike post-deployment	Pre-warm embedding model with representative prompts
Cache stampede on cold start	Low	Medium — burst of misses hitting model simultaneously	Response time spike after cache flush	Probabilistic early expiration (PER); staggered cache rebuild
Stale cache served after knowledge base update	Medium	High — outdated information served	User feedback; corpus version mismatch	Corpus-version-tagged invalidation; mandatory invalidation trigger on KB update

Cascading Scenario

Vector store failure + Redis failure simultaneously: Both caching layers unavailable; all requests fall through to model inference. At scale this can immediately exceed provider rate limits and trigger widespread 429 errors. Mitigation: circuit breaker on cache bypass path; pre-configured rate limit reduction to stay within provider rate limits during cache outage.

14. Regulatory Considerations

Privacy Act / GDPR

Cached responses derived from personal information require the same legal basis as the original processing; caching must not extend the effective retention period beyond what the privacy policy allows
TTL configuration is a data retention control; it must be managed by the Data Governance team, not left to engineering discretion
Data subject deletion requests may require cache entries derived from that subject's data to be invalidated; the corpus-version invalidation mechanism supports this if prompts can be correlated to data subjects

ISO 27001

Cache stores are information assets requiring appropriate access controls, encryption at rest, and audit logging in line with ISO 27001 Annex A.8

15. Reference Implementations

AWS

Component	AWS Service
Exact-match cache	ElastiCache Redis
Vector store	OpenSearch Service (k-NN index) or pgvector on RDS
Embedding model	Local container on ECS/Lambda, or Titan Embeddings via Bedrock
Metrics	CloudWatch custom metrics

Azure

Component	Azure Service
Exact-match cache	Azure Cache for Redis
Vector store	Azure AI Search (vector search)
Embedding model	Azure OpenAI Embeddings or local container

On-Premises

Component	Technology
Exact + semantic cache	GPTCache (open source) — combines both layers
Vector store	Qdrant or Weaviate self-hosted
Embedding model	sentence-transformers/all-MiniLM-L6-v2 (local, no API cost)

Pattern ID	Name	Relationship
EAAPL-PLT002	AI API Gateway	Host — caching layer is a stage in the gateway pipeline
EAAPL-PLT004	LLM Cost Control	Complementary — caching is a primary cost reduction mechanism
EAAPL-PLT001	Enterprise AI Platform	Parent — caching is a platform shared service
EAAPL-PLT003	Model Routing	Complementary — cache hit bypasses routing entirely

17. Maturity Assessment

Overall Maturity: Proven LLM caching is production-proven at scale. Exact-match caching is commodity. Semantic caching via GPTCache and Redis Vector Search is operationally mature. Corpus-version invalidation is the least-standardised component.

Scoring Matrix

Dimension	Score (1–5)	Rationale
Pattern Completeness	5	All sections documented
Implementation Evidence	4	Exact-match: 5; semantic: 4; invalidation: 3
Tooling Maturity	4	GPTCache, Redis RediSearch stable; pgvector rapidly maturing
Privacy / Compliance Rigor	4	TTL as retention control documented; automated data-subject deletion emerging
Cost ROI	5	Consistent 20–40% token savings documented

18. Revision History

Version	Date	Author	Changes
1.0	2024-06-01	EAAPL Working Group	Initial publication
1.1	2025-06-12	EAAPL Working Group	Corpus-version invalidation documented; privacy controls expanded; cascading failure scenario added

← Back to Library More Platform Engineering →