[EAAPL-PLT006] LLM Caching Layer
Category: Platform Engineering
Sub-category: Performance / Cost Optimisation
Version: 1.1
Maturity: Proven
Tags: caching, semantic-cache, exact-match-cache, cache-invalidation, prompt-cache, cost-reduction, latency, privacy
Regulatory Relevance: Privacy Act, GDPR (cached PII), ISO 27001 (data retention)
1. Executive Summary
LLM inference is expensive and often redundant. Across enterprise deployments, 20–40% of LLM requests are near-duplicates of previous requests—particularly in customer service, FAQ answering, document summarisation, and search augmentation use cases. The LLM Caching Layer pattern systematically intercepts these redundant calls and serves responses from cache, reducing both token costs and end-user latency simultaneously.
This pattern defines two complementary caching mechanisms: exact-match caching for deterministic, templated prompts (100% identical prompt → 100% cache hit), and semantic caching for paraphrased or slightly varied prompts that should produce the same response (vector similarity above a configured threshold → cache hit). Together, these mechanisms create a caching layer that is transparent to consuming applications, requires no application code changes, and delivers measurable ROI from day one. The pattern also addresses the critical privacy and security concern of cached responses: ensuring cached content from one consumer is never served to another, and that PII-containing responses are handled with appropriate retention policies.
2. Problem Statement
Business Problem
LLM inference budgets are consumed by redundant computation. When a thousand customers ask the same product question in slightly different words, the enterprise pays for a thousand independent model inferences when a single cached response would suffice. This is not an engineering concern—it is a direct charge to the business unit's operational budget with no corresponding value.
Technical Problem
Model inference has latency that varies from 500ms to 15+ seconds depending on model and response length. For customer-facing features, this latency is a user experience problem. Without caching, every request waits for model inference even when the answer is known. Furthermore, streaming responses are difficult to cache, requiring specialised handling.
Symptoms
- Identical or near-identical prompts appearing in request logs at high frequency with no caching
- Customer-facing AI features with P95 latency >2 seconds for common queries
- LLM inference costs growing proportionally with user volume rather than with unique queries
- Teams separately implementing ad hoc caching solutions (in-memory, Redis) in their applications
Cost of Inaction
- 20–40% of LLM spend attributable to redundant inference with no caching in place
- Latency spikes during peak load as model providers throttle under increased concurrent requests
- Engineering effort duplicated as teams build bespoke caching independently
3. Context
When to Apply
- High volume of LLM requests with expected query repetition (FAQ, search, classification, summarisation)
- Latency improvement for AI features is a product quality objective
- LLM spend optimisation is active
- Centralised AI API Gateway (PLT002) exists as the appropriate host for shared caching
When NOT to Apply
- Highly personalised prompts where every request contains unique user context: cache hit rate will be negligible
- Creative generation use cases (story writing, code generation from unique specs): semantic similarity does not imply equivalent quality
- Prompts with real-time data requirements (current price, today's news): stale cached responses would be incorrect
- Applications requiring non-deterministic response variety: caching by definition reduces variety
Prerequisites
- AI API Gateway as host for the caching layer (PLT002)
- Redis or compatible vector-enabled cache infrastructure
- Embedding model (local or API) for semantic similarity computation
- Defined TTL policies per use case and corpus type
- Privacy assessment completed for cache content retention
Industry Applicability
| Industry |
Applicability |
Key Workload |
| Retail / E-commerce |
Very High |
Product description generation, FAQ, search augmentation |
| Financial Services |
High |
FAQ, policy explanation, templated document processing |
| Technology / SaaS |
High |
Developer documentation, code explanation, support chatbot |
| Healthcare |
Medium |
Clinical FAQ (carefully scoped); administrative queries |
| Media |
High |
Content classification, tagging, summarisation at scale |
| Government |
Medium |
Policy FAQ, document summarisation |
4. Architecture Overview
The LLM Caching Layer is positioned as middleware within the AI API Gateway's request pipeline, executing after policy enforcement and before model routing. This positioning is deliberate: policy enforcement runs before caching to ensure that cache lookups do not bypass security controls, and caching runs before routing to prevent unnecessary routing computation for cache hits.
Exact-Match Caching targets the simplest case: when the complete prompt is byte-for-byte identical to a previous request. A SHA-256 hash of the canonical prompt (normalised for whitespace, with dynamic variables stripped) is used as the cache key. On a cache hit, the stored response is returned immediately with a X-Cache: HIT header. This mechanism is particularly effective for system-prompt-heavy applications where the system prompt constitutes >90% of the input tokens and changes rarely.
The cache key design for exact-match is subtle. The full prompt including the system prompt must be hashed, not just the user message. The model name and version must be included in the cache key, since the same prompt produces different outputs from different models. The response format parameters (temperature, top_p, max_tokens, output format) must also be included, since they affect the response. A cache key that ignores any of these dimensions risks serving a response produced under different conditions.
Semantic Caching handles the more complex case: prompts that are paraphrases of each other and should produce equivalent responses. The user-facing portion of the prompt (not the system prompt, which is typically stable) is embedded using a local embedding model into a dense vector. This vector is queried against a vector store (Redis with RediSearch, pgvector, or a dedicated vector database) using approximate nearest-neighbour search with cosine similarity. If the nearest stored prompt vector has a similarity score above the configured threshold, the associated stored response is returned.
The similarity threshold is the most operationally sensitive parameter in semantic caching. Too high (>0.99): essentially exact-match behaviour, low hit rate. Too low (<0.90): risk of serving an incorrect response to a superficially similar but semantically different prompt. The appropriate threshold depends on the use case: FAQ answering (high tolerance for paraphrase, threshold 0.92–0.95), code generation (low tolerance, threshold 0.98+), product description (medium tolerance, threshold 0.94–0.97). Thresholds should be validated against a test set of similar-but-different prompts to measure false positive rate.
Cache Invalidation is the hardest problem in caching, and LLM response caches are no exception. Three invalidation triggers are relevant: time-based TTL (responses expire after a configured period; appropriate for most use cases), corpus-update-triggered invalidation (when the knowledge base underlying the AI responses changes, all related cache entries must be invalidated; this requires tagging cache entries with corpus version metadata), and explicit invalidation (an API endpoint allowing platform operators or application owners to flush specific entries or entire namespaces). Corpus-update invalidation is the most complex; it requires the cache to maintain an index of which knowledge corpus version each entry was computed against.
Privacy and Tenancy Controls are non-negotiable. Cache entries must be scoped to the consumer namespace by default: a cached response for Team A's prompt must not be returned for Team B's identical prompt unless explicitly configured as a shared cache. For responses that may contain PII (personalised summaries, user-specific answers), caching must be disabled at the per-request level via a X-Cache-Control: no-store header or use-case configuration. All cache entries must respect their TTL strictly; there must be no mechanism to enumerate cached prompt content.
Cache Observability exposes the operational value of caching. Cache hit rate, cache size, eviction rate, average response latency (hit vs. miss), and attributable token savings are published as metrics. These metrics drive the FinOps dashboard's cache savings attribution and inform threshold tuning decisions.
5. Architecture Diagram
flowchart TD
subgraph Request["Request Pipeline"]
A[Incoming Request]
B[Policy + Auth]
C[Prompt Normaliser]
end
subgraph Cache["Caching Layer"]
D{Exact Hash Match?}
E[Embedding Model]
F{Semantic Match?}
end
subgraph Backends["Cache and Model"]
G[(Exact-Match Store)]
H[(Vector Store)]
I[Model Router + Proxy]
end
A --> B
B --> C
C --> D
D -->|hit| J[Return Cached Response]
D -->|miss| E
E --> F
F -->|above threshold| J
F -->|miss| I
I --> G
I --> H
I --> J
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#f0fdf4,stroke:#22c55e
style C fill:#f0fdf4,stroke:#22c55e
style D fill:#f3e8ff,stroke:#a855f7
style E fill:#f0fdf4,stroke:#22c55e
style F fill:#f3e8ff,stroke:#a855f7
style G fill:#fef9c3,stroke:#eab308
style H fill:#fef9c3,stroke:#eab308
style I fill:#dbeafe,stroke:#3b82f6
style J fill:#d1fae5,stroke:#10b981
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Prompt Normaliser |
Service |
Strip dynamic variables; canonicalise whitespace for stable hashing |
Custom pre-processing function |
High |
| Exact-Match Hash Generator |
Service |
Compute deterministic cache key from normalised prompt + model + params |
SHA-256 (standard library) |
High |
| Exact-Match Cache Store |
Service |
Store and retrieve response by exact hash key |
Redis (string key → JSON value), Memcached |
High |
| Embedding Model |
Service |
Convert prompt to dense vector for semantic similarity |
text-embedding-3-small (OpenAI), local sentence-transformers, BGE-M3 |
High |
| Embedding Cache |
Service |
Cache computed embeddings to avoid re-embedding same prompt |
Redis hash keyed by prompt SHA-256 |
Medium |
| Vector Store |
Service |
Store and query prompt vectors for ANN search |
Redis RediSearch, pgvector, Qdrant, Weaviate, Milvus |
High |
| Similarity Threshold Evaluator |
Service |
Compare ANN search score to use-case-specific threshold |
Custom middleware, GPTCache policy |
High |
| Cache Write Handler |
Service |
Asynchronously write model response + vector to cache after inference |
Async task (asyncio, Celery) |
High |
| Cache Invalidation API |
Service |
Flush by namespace, corpus version, or individual key |
Custom REST API |
Medium |
| Cache Metrics Exporter |
Service |
Publish hit rate, latency, token savings metrics |
Prometheus exporter, CloudWatch custom metrics |
Medium |
7. Data Flow
Primary Flow — Semantic Cache Hit
| Step |
Actor |
Action |
Output |
| 1 |
Gateway |
Receive request; pass policy enforcement |
Authenticated, authorised request |
| 2 |
Prompt Normaliser |
Remove dynamic variables (user name, date); strip excess whitespace |
Canonical prompt |
| 3 |
Hash Generator |
SHA-256 of canonical prompt + model name + temperature |
Cache key: sha256(prompt+model+temp) |
| 4 |
Exact-Match Lookup |
Query Redis with key; not found |
Cache miss → proceed to semantic |
| 5 |
Embedding Model |
Compute 1536-dim vector for user message portion |
Embedding vector |
| 6 |
Vector Store Query |
ANN search with cosine similarity; retrieve top-1 result |
Score: 0.963; stored response for similar prompt |
| 7 |
Threshold Check |
0.963 > 0.95 threshold for use-case faq |
Semantic cache hit |
| 8 |
Response Return |
Return stored response with headers: X-Cache: HIT-SEMANTIC, X-Cache-Score: 0.963 |
Consumer receives response; zero model API call |
| 9 |
Metrics |
Emit cache hit event; attribute token savings = (input_tokens + output_tokens) of original entry |
FinOps savings attributed |
Error Flow
| Error |
Detection |
Response |
| Redis unavailable |
Health check; connection timeout |
Bypass cache; proceed to model inference; log warning |
| Vector store query timeout |
Query latency >50ms |
Bypass semantic cache; attempt exact-match only; log warning |
| Embedding model failure |
Embedding call error |
Bypass semantic cache; exact-match only or full bypass |
| Cache write failure |
Async write error |
Log error; response still returned to consumer; cache miss on next identical request |
8. Security Considerations
Tenant Isolation
- Cache namespace prefix is always scoped to consumer team namespace;
{team}:{prompt_hash} prevents cross-tenant cache hits by default
- Shared cache pools (opt-in) require explicit platform team configuration and are only appropriate for non-personalised, non-confidential content (e.g., public FAQ)
- Redis ACLs enforce per-namespace read/write isolation; no consumer can access another tenant's cache entries
Privacy Controls
- Requests tagged with
Cache-Control: no-store (from use-case configuration or explicit consumer header) bypass both exact and semantic cache
- Cache entries containing PII-derived responses must have reduced TTL (max 24 hours) and must be excluded from any cross-tenant sharing
- Cached responses are stored as ciphertext (AES-256) in Redis when classified above INTERNAL sensitivity level
Data Retention
- Cache TTLs are the mechanism for data retention compliance; they must be aligned with the privacy policy for the prompt content type
- Regular TTL audit to ensure cache retention does not exceed applicable data retention requirements
OWASP LLM Controls
| OWASP LLM Risk |
Cache Layer Control |
| LLM06 Sensitive Information Disclosure |
Tenant isolation prevents cross-tenant response leakage; PII cache disable prevents PII retention |
| LLM09 Overreliance |
Cache hit responses must carry the same X-AI-Generated header as live responses; staleness indicator on old cache entries |
9. Governance Considerations
Cache Policy Governance
- TTL policies per use case are defined by the Data Governance team and enforced by the platform; product teams cannot override TTLs beyond their configured maximum
- Semantic similarity thresholds per use case are reviewed by the prompt owner on a quarterly basis or after any quality incident attributable to cache false positives
Governance Artefacts
| Artefact |
Owner |
Cadence |
Location |
| Cache TTL policy |
Data Governance Team |
Annual review |
Platform configuration |
| Similarity threshold settings |
Prompt Owner + Platform Team |
Quarterly |
Platform configuration |
| Cache privacy impact assessment |
Privacy Team |
Per new use case |
Privacy register |
| Cache false positive incident log |
Platform Team |
Per incident |
Incident management system |
10. Operational Considerations
Monitoring
| Signal |
Source |
Alert Threshold |
Owner |
| Overall cache hit rate |
Cache metrics |
<10% sustained 30 min (may indicate threshold misconfiguration or genuinely uncacheable workload) |
Platform Team |
| Exact-match hit rate |
Cache metrics |
Informational; track trend |
Platform Team |
| Semantic false positive reports |
User feedback + quality monitoring |
Any confirmed false positive |
Platform Team + Prompt Owner |
| Redis memory utilisation |
Redis metrics |
>80% of configured max memory |
Platform On-Call |
| Cache write failure rate |
Async write error metrics |
>1% |
Platform Team |
SLOs
| SLO |
Target |
Window |
| Cache lookup latency (exact-match) |
<5ms |
Rolling 7 days |
| Cache lookup latency (semantic, including embedding) |
<50ms |
Rolling 7 days |
| Cache hit rate (FAQ/summarisation use cases) |
>20% |
Rolling 7 days |
| Redis availability |
99.9% |
Rolling 30 days |
Disaster Recovery
| Component |
RPO |
RTO |
Strategy |
| Redis exact-match cache |
1 hour |
5 min |
Redis Sentinel; cache is soft state; rebuild on miss |
| Vector store |
1 hour |
15 min |
Backup + restore; cache warms naturally over time |
| Embedding cache |
1 hour |
5 min |
Redis Sentinel; recomputable |
11. Cost Considerations
Cost Drivers
| Driver |
Description |
Relative Weight |
| Redis memory |
Proportional to cached response volume and TTLs |
Medium |
| Vector store hosting |
Varies by vector count and query rate |
Medium |
| Embedding model API (if external) |
Per embedding API call; offset by cache savings |
Low if local model used |
| Cache infrastructure savings |
Negative cost — reduces model API spend |
Dominant positive ROI |
Optimisations
- Use a local embedding model (sentence-transformers, BGE) rather than an API embedding model to eliminate per-embedding costs and latency
- Configure Redis LRU eviction policy to automatically evict least-recently-used entries when memory pressure is high, maintaining cache quality without manual management
- Set TTLs aggressively short for use cases with dynamic context; longer for static knowledge-base use cases
Indicative Cost Range
| Scale |
Cache Infra Monthly Cost |
Monthly Token Savings |
| Small (1M tokens/day, 20% hit rate) |
$150–$400 |
$1,000–$3,000 |
| Medium (10M tokens/day, 25% hit rate) |
$500–$1,500 |
$8,000–$20,000 |
| Large (100M tokens/day, 30% hit rate) |
$2,000–$6,000 |
$60,000–$150,000 |
12. Trade-Off Analysis
Caching Strategy Options
| Strategy |
Description |
Pros |
Cons |
Best For |
| Exact-Match Only |
Cache on full prompt SHA-256 |
Zero false positive risk; simple; deterministic |
Low hit rate on natural language; paraphrases miss |
Templated/structured prompts; deterministic workflows |
| Semantic Cache Only |
Cache on embedding similarity |
High hit rate on paraphrased queries |
False positive risk; embedding latency adds overhead |
Natural language FAQ, search, support |
| Two-Layer (Exact + Semantic) |
Exact-match checked first; semantic as fallback |
Maximum hit rate; zero FP risk for exact layer |
Complexity; two infrastructure components |
High-volume mixed workloads; recommended default |
Similarity Threshold Options
| Threshold Level |
False Positive Rate |
Hit Rate |
Best For |
| Conservative (0.97–0.99) |
Very Low |
Low |
Legal/clinical content where wrong answer has serious consequences |
| Balanced (0.93–0.96) |
Low |
Moderate |
FAQ, general enterprise Q&A |
| Aggressive (0.88–0.92) |
Medium |
High |
Simple classification, sentiment; tolerance for minor variation acceptable |
Architectural Tensions
| Tension |
Option A |
Option B |
Resolution |
| Cache hit rate vs. response accuracy |
Low threshold for high hit rate |
High threshold for accuracy |
Per-use-case thresholds configured and governed |
| Cache freshness vs. cost efficiency |
Low TTL (always fresh) |
High TTL (cost efficient) |
TTL by corpus type; static corpus: 7 days; dynamic: 1 hour |
| Cross-tenant cache sharing vs. data isolation |
Shared pool for efficiency |
Per-tenant isolation |
Per-tenant by default; opt-in sharing for non-sensitive public content |
13. Failure Modes
| Failure |
Likelihood |
Impact |
Detection |
Recovery |
| Semantic false positive (wrong response served) |
Low |
High — incorrect response delivered to user |
User feedback; quality monitoring |
Remove offending cache entry; lower threshold for affected use case |
| Redis OOM (out of memory) |
Medium |
Medium — LRU eviction increases miss rate |
Redis memory metrics |
Increase Redis memory; tune TTLs; scale Redis cluster |
| Embedding model cold start latency |
Medium |
Low — first semantic lookup slow after deployment |
Latency spike post-deployment |
Pre-warm embedding model with representative prompts |
| Cache stampede on cold start |
Low |
Medium — burst of misses hitting model simultaneously |
Response time spike after cache flush |
Probabilistic early expiration (PER); staggered cache rebuild |
| Stale cache served after knowledge base update |
Medium |
High — outdated information served |
User feedback; corpus version mismatch |
Corpus-version-tagged invalidation; mandatory invalidation trigger on KB update |
Cascading Scenario
- Vector store failure + Redis failure simultaneously: Both caching layers unavailable; all requests fall through to model inference. At scale this can immediately exceed provider rate limits and trigger widespread 429 errors. Mitigation: circuit breaker on cache bypass path; pre-configured rate limit reduction to stay within provider rate limits during cache outage.
14. Regulatory Considerations
Privacy Act / GDPR
- Cached responses derived from personal information require the same legal basis as the original processing; caching must not extend the effective retention period beyond what the privacy policy allows
- TTL configuration is a data retention control; it must be managed by the Data Governance team, not left to engineering discretion
- Data subject deletion requests may require cache entries derived from that subject's data to be invalidated; the corpus-version invalidation mechanism supports this if prompts can be correlated to data subjects
ISO 27001
- Cache stores are information assets requiring appropriate access controls, encryption at rest, and audit logging in line with ISO 27001 Annex A.8
15. Reference Implementations
AWS
| Component |
AWS Service |
| Exact-match cache |
ElastiCache Redis |
| Vector store |
OpenSearch Service (k-NN index) or pgvector on RDS |
| Embedding model |
Local container on ECS/Lambda, or Titan Embeddings via Bedrock |
| Metrics |
CloudWatch custom metrics |
Azure
| Component |
Azure Service |
| Exact-match cache |
Azure Cache for Redis |
| Vector store |
Azure AI Search (vector search) |
| Embedding model |
Azure OpenAI Embeddings or local container |
On-Premises
| Component |
Technology |
| Exact + semantic cache |
GPTCache (open source) — combines both layers |
| Vector store |
Qdrant or Weaviate self-hosted |
| Embedding model |
sentence-transformers/all-MiniLM-L6-v2 (local, no API cost) |
| Pattern ID |
Name |
Relationship |
| EAAPL-PLT002 |
AI API Gateway |
Host — caching layer is a stage in the gateway pipeline |
| EAAPL-PLT004 |
LLM Cost Control |
Complementary — caching is a primary cost reduction mechanism |
| EAAPL-PLT001 |
Enterprise AI Platform |
Parent — caching is a platform shared service |
| EAAPL-PLT003 |
Model Routing |
Complementary — cache hit bypasses routing entirely |
17. Maturity Assessment
Overall Maturity: Proven
LLM caching is production-proven at scale. Exact-match caching is commodity. Semantic caching via GPTCache and Redis Vector Search is operationally mature. Corpus-version invalidation is the least-standardised component.
Scoring Matrix
| Dimension |
Score (1–5) |
Rationale |
| Pattern Completeness |
5 |
All sections documented |
| Implementation Evidence |
4 |
Exact-match: 5; semantic: 4; invalidation: 3 |
| Tooling Maturity |
4 |
GPTCache, Redis RediSearch stable; pgvector rapidly maturing |
| Privacy / Compliance Rigor |
4 |
TTL as retention control documented; automated data-subject deletion emerging |
| Cost ROI |
5 |
Consistent 20–40% token savings documented |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2024-06-01 |
EAAPL Working Group |
Initial publication |
| 1.1 |
2025-06-12 |
EAAPL Working Group |
Corpus-version invalidation documented; privacy controls expanded; cascading failure scenario added |