[EAAPL-RAG001] Enterprise Retrieval-Augmented Generation
Category: Artificial Intelligence / Retrieval-Augmented Generation
Sub-category: Foundational RAG Architecture
Version: 2.1
Maturity: Mature
Tags: rag retrieval embeddings vector-search llm grounding citation enterprise
Regulatory Relevance: APRA CPS234, EU AI Act Article 13 (Transparency), ISO/IEC 42001, NIST AI RMF (Govern 1.1, Map 1.1)
1. Executive Summary
Retrieval-Augmented Generation (RAG) is the foundational architecture pattern that grounds Large Language Model (LLM) responses in verifiable enterprise knowledge. Rather than relying solely on parametric knowledge baked into model weights, RAG dynamically retrieves relevant documents at inference time, assembles them into a context window, and instructs the LLM to generate answers grounded exclusively in retrieved evidence.
For enterprise CIOs and CTOs, RAG directly addresses three business-critical concerns: accuracy (answers are anchored to current, authoritative sources rather than potentially stale training data), auditability (every claim can be traced to a source document for compliance and regulatory purposes), and control (the knowledge base is governed by the enterprise, not by a third-party model provider). RAG enables AI-powered enterprise search, internal knowledge assistants, customer service automation, and regulatory document Q&A without the cost and risk of fine-tuning proprietary models. When implemented correctly, RAG reduces hallucination rates by 60–80% compared to prompt-only LLM usage, and provides the citation infrastructure required to satisfy model explainability mandates in APRA, EU AI Act, and ISO 42001 frameworks.
2. Problem Statement
Business Problem
Enterprise knowledge is locked in unstructured repositories — SharePoint libraries, Confluence wikis, PDF policy archives, email threads, and ERP exports. Employees spend an average of 20% of their working week searching for information (McKinsey Global Institute). LLMs offer natural-language access to this knowledge but generate plausible-sounding but factually incorrect answers (hallucinations) at an unacceptable rate for regulated industries.
Technical Problem
Standard LLM prompting cannot access documents outside the model's training window, cannot cite sources for claims, cannot reflect updates made after the model's training cutoff, and cannot respect per-user access controls on confidential documents. Context windows are finite; naively injecting entire document corpora is computationally prohibitive and degrades generation quality.
Symptoms of the Absence of this Pattern
- Help-desk chatbots that confidently cite policy sections that do not exist or have been superseded
- Internal search returning keyword-matched results with no synthesis or relevance ranking
- Compliance teams unable to audit the provenance of AI-generated regulatory summaries
- Knowledge workers spending >30 minutes constructing answers from multiple source documents
- Model answers that vary unpredictably across repeated identical queries
Cost of Inaction
- Regulatory exposure: ungrounded AI outputs used in decision-making violate EU AI Act Article 13 and APRA CPG 234 requirements for explainability
- Operational cost: manual document synthesis at scale costs $150–$400 per knowledge-worker hour
- Risk of reputational damage from hallucinated answers in customer-facing applications
- Inability to retire legacy knowledge portal investments without a viable AI-powered replacement
3. Context
When to Apply
- Enterprise Q&A systems over internal policy, procedure, or product documentation
- Customer service automation requiring grounded, citable answers
- Regulatory and compliance document interrogation
- Code generation assistants that reference internal SDK and API documentation
- Research synthesis across large document corpora (legal discovery, clinical guidelines, engineering standards)
- Any LLM use case where answer provenance and auditability are required
When NOT to Apply
- Tasks requiring real-time external data not yet ingested (use Streaming RAG, EAAPL-RAG006)
- Multi-hop reasoning across structured relational data (use Graph RAG, EAAPL-RAG009, or SQL-generation patterns)
- Use cases where the knowledge corpus is smaller than the context window (direct context injection is simpler)
- Creative generation tasks where factual grounding is not required
- Highly latency-sensitive applications (<100ms P99) where vector search overhead is unacceptable
Prerequisites
- A defined and governed knowledge corpus (documents, wikis, structured exports)
- An embedding model appropriate to the corpus language and domain
- A vector database provisioned and accessible from the inference runtime
- An LLM with sufficient context window to accommodate retrieved passages plus the user query
- A document ingestion pipeline with scheduling and delta-update capability
- Logging infrastructure capable of recording retrieval decisions and LLM inputs/outputs
Industry Applicability
| Industry |
Primary Use Case |
Criticality |
Regulatory Consideration |
| Financial Services |
Policy Q&A, compliance manuals, product disclosure |
Mission-critical |
APRA CPS234, MiFID II, Basel III documentation |
| Healthcare |
Clinical guideline retrieval, formulary assistance |
Mission-critical |
TGA, AHPRA, HIPAA, clinical liability |
| Government |
Legislation interpretation, service eligibility |
High |
FOI, Privacy Act 1988, APS values |
| Legal |
Case law research, contract clause retrieval |
High |
Legal professional privilege, confidentiality |
| Retail/FMCG |
Product knowledge bases, supplier documentation |
Medium |
ACCC consumer guarantees, product liability |
| Technology |
Internal developer documentation, runbook Q&A |
Medium |
SOC2, ISO 27001 |
| Higher Education |
Academic policy, research corpus search |
Medium |
Copyright Act, FERPA equivalents |
4. Architecture Overview
Enterprise RAG decomposes into two distinct temporal phases: an offline ingestion pipeline and an online retrieval-generation pipeline. Understanding the separation of these phases is critical to operating the system correctly at enterprise scale.
Offline Ingestion Pipeline
The ingestion pipeline transforms raw enterprise documents into a searchable vector index. This phase runs continuously or on a schedule and must be treated as a production data pipeline with monitoring, alerting, and schema versioning.
Document acquisition draws from multiple source connectors (SharePoint, Confluence, S3, SFTP, database exports). Each connector must capture not only document content but also metadata: document ID, version, owner, classification level, effective date, and expiry date. Metadata is as important as content for enterprise use cases — it drives filtering, citation generation, and access control enforcement.
Chunking is among the most consequential architectural decisions in any RAG system. The goal is to produce semantically coherent text units that are large enough to contain useful context but small enough to remain topically focused. Three strategies apply at enterprise scale: fixed-size chunking (split by token count, typically 256–512 tokens, with 10–20% overlap) is operationally simple and predictable; semantic chunking (split at natural paragraph or section boundaries) preserves document structure and is preferred for narrative documents such as policy manuals; hierarchical chunking (maintain parent-child relationships between summary chunks and detail chunks) enables retrieval at multiple granularities and is optimal for long technical documents. For regulated environments, hierarchical chunking with section-level metadata (clause number, effective date) is recommended because it enables citation at the regulatory clause level.
Embedding converts each chunk into a dense vector representation using an embedding model. Model selection has long-term consequences: changing the embedding model requires re-embedding the entire corpus. For English-language enterprise corpora, text-embedding-3-large (OpenAI), textembedding-gecko (Google), or bge-large-en-v1.5 (BAAI, self-hostable) are strong choices. For multilingual corpora, multilingual-e5-large or bge-m3 are preferred. The embedding model must be evaluated on a domain-representative benchmark before production selection.
Vector storage persists embeddings alongside the full chunk text and metadata in a vector database. The vector index (typically HNSW — Hierarchical Navigable Small World) enables approximate nearest-neighbour search in milliseconds across tens of millions of vectors. Index construction parameters (ef_construction, M) directly affect recall/latency trade-offs and must be tuned per corpus.
Online Retrieval-Generation Pipeline
At inference time, the user query traverses a multi-stage pipeline before the LLM generates a response.
Query processing applies transformations that materially improve retrieval quality: query expansion (generating alternative phrasings of the question), HyDE (Hypothetical Document Embedding — generating a hypothetical answer and embedding it to find similar real documents), and query decomposition (splitting compound questions into atomic sub-queries). These transformations add 50–150ms latency but improve top-5 recall by 15–30% in empirical benchmarks.
Retrieval executes the vector similarity search against the index, returning the top-K chunks (K typically 5–20) ranked by cosine similarity. Pre-retrieval metadata filtering (by document class, department, effective date) reduces the search space and enforces access control at the vector layer.
Re-ranking applies a cross-encoder model to re-score the top-K retrieved chunks against the original query with higher precision than the bi-encoder embedding model. Cross-encoders (e.g., cross-encoder/ms-marco-MiniLM-L-12-v2) do not scale to full-index search but are highly effective on the top-K set.
Context assembly constructs the final prompt by ordering retrieved chunks (relevance-first or document-structure-first depending on the task), injecting system instructions, and appending the user query. Maximum context budget (the number of retrieved tokens before the LLM's context window is exceeded) must be monitored and enforced.
Generation invokes the LLM with the assembled context. The system prompt must explicitly instruct the model to answer only from the provided context and to include citations. Post-generation, citations are extracted and validated against the retrieved chunk set to detect hallucinated references.
The full pipeline must be instrumented end-to-end. Every query, the retrieved chunk IDs, the assembled context hash, the LLM response, and latency at each stage must be logged to enable quality monitoring, debugging, and audit trail maintenance.
5. Architecture Diagram
flowchart TD
subgraph Ingestion["Offline Ingestion"]
A[Source Connectors]
B[Chunk + Embed]
C[Vector Store]
end
subgraph Retrieval["Online Retrieval"]
D[User Query]
E[Query Processor]
F[Vector Search + Rerank]
end
subgraph Generation["Generation + Observability"]
G[LLM + Context]
H[Citation Validator]
I[Quality Monitor]
end
A --> B --> C
D --> E -->|filtered ANN search| C
C --> F --> G --> H --> D
G --> I
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#f0fdf4,stroke:#22c55e
style C fill:#fef9c3,stroke:#eab308
style D fill:#dbeafe,stroke:#3b82f6
style E fill:#f0fdf4,stroke:#22c55e
style F fill:#f0fdf4,stroke:#22c55e
style G fill:#fef9c3,stroke:#eab308
style H fill:#d1fae5,stroke:#10b981
style I fill:#fef9c3,stroke:#eab308
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Source Connectors |
Integration |
Pull documents from enterprise repositories on schedule or event trigger |
Microsoft Graph API, Confluence REST API, S3 Event Notifications, custom JDBC connectors |
High |
| Metadata Extractor |
Data Processing |
Parse and normalise document metadata; assign classification labels |
Apache Tika, AWS Textract, Azure Document Intelligence, custom NLP pipeline |
High |
| Chunking Engine |
Data Processing |
Segment documents into semantically coherent, appropriately-sized chunks |
LangChain text splitters, LlamaIndex node parsers, custom Python chunkers |
High |
| Embedding Model |
ML Inference |
Convert text chunks to dense vector representations |
OpenAI text-embedding-3-large, Google textembedding-gecko, BAAI bge-large-en-v1.5, Cohere embed-v3 |
Critical |
| Vector Database |
Storage |
Store and index embedding vectors; serve ANN queries |
Pinecone, Weaviate, Qdrant, pgvector, OpenSearch k-NN, Chroma |
Critical |
| Document Store |
Storage |
Persist full chunk text and metadata for context assembly |
Amazon S3, Azure Blob Storage, Google Cloud Storage, PostgreSQL |
High |
| Query Processor |
Inference |
Enrich and expand user queries before retrieval |
LangChain, LlamaIndex, custom Python with LLM call |
Medium |
| ACL Filter |
Security |
Enforce document-level access control before vector search |
Custom middleware using identity provider claims; RBAC policy engine |
Critical |
| Cross-Encoder Re-ranker |
ML Inference |
Re-rank top-K retrieved chunks with higher precision |
Cohere Rerank, ms-marco cross-encoders, Voyage AI rerank |
High |
| Context Assembler |
Orchestration |
Order chunks, enforce token budget, construct final prompt |
LangChain, LlamaIndex, custom orchestration |
High |
| LLM |
ML Inference |
Generate grounded natural-language response from context |
OpenAI GPT-4o, Anthropic Claude 3.5, Google Gemini 1.5, Azure OpenAI, self-hosted Llama 3 |
Critical |
| Citation Extractor |
Post-processing |
Extract and validate source references in generated output |
Regex + structured output parsing, LLM-based extraction |
High |
| Observability Layer |
Operations |
Log all pipeline stages; monitor quality and latency metrics |
Datadog, Grafana + Prometheus, Langfuse, Arize AI |
High |
7. Data Flow
Primary Flow
| Step |
Actor |
Action |
Output |
| 1 |
Source Connector |
Poll or receive webhook from document repository; fetch new/modified documents |
Raw document bytes + source metadata |
| 2 |
Metadata Extractor |
Parse document format; extract title, author, classification, dates, section structure |
Structured metadata record per document |
| 3 |
Chunking Engine |
Apply chunking strategy; assign chunk ID, parent document ID, position index |
Ordered list of text chunks with metadata |
| 4 |
Embedding Model |
Generate dense vector for each chunk |
(chunk_id, vector[1536], chunk_text, metadata) tuple |
| 5 |
Vector Database |
Upsert vector with metadata payload; rebuild/update HNSW index |
Persisted vector index entry; confirmation receipt |
| 6 |
Document Store |
Persist full chunk text and metadata |
Durable record accessible by chunk_id |
| 7 |
User / Application |
Submit natural-language query via API |
Query string + user identity context |
| 8 |
Query Processor |
Expand query; optionally generate HyDE document |
Enhanced query representation(s) |
| 9 |
ACL Filter |
Resolve user's permitted document classes from identity provider |
Allowlist of namespace/metadata filters for vector search |
| 10 |
Vector Database |
Execute ANN search with metadata filters; return top-K (k=20) chunks |
Ranked list of (chunk_id, score, metadata) |
| 11 |
Document Store |
Fetch full chunk text for top-K chunk IDs |
Chunk text + source metadata for each candidate |
| 12 |
Cross-Encoder Re-ranker |
Score each chunk against original query; re-order by cross-encoder score |
Re-ranked top-N (N=5) chunks |
| 13 |
Context Assembler |
Order chunks; prepend system prompt; enforce token budget (≤context_window − reserve) |
Assembled prompt string |
| 14 |
LLM |
Generate response conditioned on assembled context |
Raw response text with in-line citation markers |
| 15 |
Citation Extractor |
Parse citation markers; validate each against retrieved chunk IDs |
Structured response: answer + verified citations |
| 16 |
Observability Layer |
Log query, chunk IDs, context hash, response, latency per stage |
Audit log record; metrics increment |
| 17 |
User / Application |
Receive grounded answer with clickable source citations |
End-user response |
Error Flow
| Error Condition |
Detection Point |
Recovery Action |
| Embedding model unavailable |
Step 4 (ingestion) or Step 8 (query) |
Retry with exponential backoff; fall back to cached embeddings for known queries |
| Vector database query timeout |
Step 10 |
Retry up to 3 times; degrade to keyword search fallback; surface "reduced quality" indicator to user |
| Zero results returned after ACL filter |
Step 10 |
Return "No accessible documents found" — do NOT fall through to unfiltered search |
| LLM rate limit or timeout |
Step 14 |
Queue retry with jitter; return partial response with "generation pending" status |
| Citation validation failure (hallucinated source) |
Step 15 |
Strip hallucinated citation from response; increment hallucination counter; flag for review |
| Document ingestion failure |
Step 2 |
Dead-letter queue; alert pipeline operator; document remains on previous version in index |
8. Security Considerations
Authentication and Authorisation
- All API endpoints require OAuth 2.0 / OIDC tokens from the enterprise identity provider (Entra ID, Okta, Ping)
- User identity claims are forwarded through the entire pipeline and recorded in audit logs
- Vector search is scoped by user-identity-derived metadata filters before execution — retrieval never returns documents the user cannot access
- Service-to-service calls between pipeline components use mTLS with short-lived certificates
Secrets Management
- Embedding model API keys stored in HashiCorp Vault or cloud-native secrets manager (AWS Secrets Manager, Azure Key Vault)
- LLM API keys rotated on a 90-day schedule; rotation must not require pipeline restart
- Database credentials never hardcoded; injected at runtime via environment variable from secrets manager
Data Classification
- Source document classification labels (OFFICIAL, SENSITIVE, PROTECTED, etc.) are preserved as metadata through the chunking and embedding pipeline
- Retrieved chunks inherit the highest classification of their parent document
- The assembled context window classification is the maximum of all included chunks
- LLM response is tagged with the classification of the highest-classified source included in context
Encryption
- Vectors and chunk text at rest: AES-256 encryption in vector database and document store
- Data in transit: TLS 1.3 minimum between all components
- Highly sensitive corpora: consider field-level encryption of metadata; evaluate format-preserving encryption for PII fields
Auditability
- Immutable audit log: every query, user ID, retrieved chunk IDs, context hash (SHA-256), LLM model version, and response hash
- Audit logs shipped to tamper-evident log store (WORM S3 bucket, Splunk, Azure Sentinel)
- Audit log retention: minimum 7 years for regulated industries
OWASP LLM Top 10 Mitigations
| OWASP LLM Risk |
Applicability |
Mitigation in this Pattern |
| LLM01: Prompt Injection |
High |
System prompt hardened; retrieved content treated as data, not instructions; input sanitisation before embedding |
| LLM02: Insecure Output Handling |
High |
Structured output parsing; citation validation; no execution of LLM-generated code in this pattern |
| LLM03: Training Data Poisoning |
Medium |
Not directly applicable post-training; mitigated by corpus quality gates (EAAPL-KNW006) |
| LLM04: Model Denial of Service |
High |
Rate limiting per user/tenant; query complexity limits; context window budget enforcement |
| LLM05: Supply Chain Vulnerabilities |
Medium |
Embedding and LLM model versions pinned; SBOM maintained; provider SLA reviewed |
| LLM06: Sensitive Information Disclosure |
Critical |
ACL pre-filter prevents retrieval; PII redaction post-retrieval; output scanning for classification leakage |
| LLM07: Insecure Plugin Design |
Low |
No plugin execution in foundational RAG; applicable in Agentic RAG (EAAPL-RAG007) |
| LLM08: Excessive Agency |
Low |
RAG is read-only; no write actions available to LLM in this pattern |
| LLM09: Overreliance |
High |
Confidence scores surfaced to users; citations presented for independent verification |
| LLM10: Model Theft |
Medium |
LLM accessed via API only; model weights not exposed; fine-tuned models stored in private registries |
9. Governance Considerations
Responsible AI
- RAG answers must always present source citations to enable human verification
- Confidence scoring should be implemented; low-confidence answers must be flagged
- Sensitive-topic classifiers (medical advice, legal advice, financial advice) should trigger "consult a professional" disclaimers
- Demographic bias monitoring on retrieval: ensure corpus is not systematically missing content relevant to specific user groups
Model Risk Management
- Embedding model versioning: the corpus must be re-embedded when the embedding model is upgraded; running mixed embeddings (different models for different document batches) produces retrieval quality degradation
- LLM model versioning: changes to the generation model require regression testing against a held-out QA benchmark
- Hallucination rate KPI tracked as a model risk indicator; threshold breach triggers review gate
Human Approval Gates
- Corpus ingestion of Tier 1 (Critical) documents requires human review approval before the document is made retrievable
- Significant changes to system prompt (which governs LLM behaviour) require a change approval process
- Quarterly human review of a random sample (n ≥ 100) of query/response pairs for quality and safety
Governance Artefacts
| Artefact |
Owner |
Frequency |
Purpose |
| Corpus Inventory |
Knowledge Manager |
Continuous (automated) |
Track which documents are in the index, their versions, and owners |
| Embedding Model Card |
ML Engineer |
Per model version |
Document model capabilities, limitations, evaluation benchmarks |
| RAG Quality Scorecard |
AI Operations |
Weekly |
Track retrieval recall, precision, hallucination rate, answer faithfulness |
| Audit Log Export |
Compliance |
Monthly |
Regulatory evidence of access controls and output traceability |
| Responsible AI Assessment |
AI Governance Board |
Quarterly |
Bias, fairness, and explainability review |
| Data Lineage Record |
Data Governance |
Per ingestion run |
Document-to-chunk-to-vector lineage for every item in the corpus |
10. Operational Considerations
Monitoring
| Metric |
Type |
Collection Method |
Alert Threshold |
| Retrieval Latency P99 |
Latency |
OpenTelemetry trace |
> 500ms |
| End-to-end Query Latency P99 |
Latency |
OpenTelemetry trace |
> 3000ms |
| Embedding Model Availability |
Availability |
Synthetic probe every 60s |
< 99.5% over 5 min |
| Vector DB Query Success Rate |
Availability |
API response monitoring |
< 99.9% over 5 min |
| Hallucination Rate (weekly sample) |
Quality |
Manual review + LLM-as-judge |
> 5% of sampled queries |
| Answer Faithfulness Score |
Quality |
Automated RAGAS evaluation |
< 0.75 average |
| Index Freshness (hours since last update) |
Freshness |
Ingestion pipeline heartbeat |
> 24 hours for Tier 1 docs |
| Context Budget Utilisation |
Resource |
Per-query logging |
> 95% (approaching window limit) |
Service Level Objectives
| SLO |
Target |
Measurement Window |
| Query Response Time P95 |
≤ 2 seconds |
Rolling 7-day |
| Query Response Time P99 |
≤ 4 seconds |
Rolling 7-day |
| Pipeline Availability |
≥ 99.9% |
Monthly |
| Ingestion Pipeline SLA (document available within N hours of publish) |
≤ 4 hours (Tier 1), ≤ 24 hours (Tier 2) |
Per document |
| Answer Faithfulness (RAGAS) |
≥ 0.80 |
Weekly evaluation |
Logging
- Structured JSON logs for every pipeline stage
- Correlation ID propagated through entire query lifecycle
- PII fields in logs must be masked (hash user IDs, redact query text for high-classification corpora)
- Log retention: 90 days hot (searchable), 7 years cold (compliance archive)
Incident Response
| Incident Type |
Detection |
Severity |
Response |
| Hallucinated citation in high-stakes answer |
Citation validator alert / user report |
P1 |
Immediate rollback of affected system prompt; manual review of last 24h queries |
| Cross-tenant data leakage |
ACL audit log anomaly |
P0 |
Immediate service suspension; security team activation; regulatory notification |
| Vector DB unavailability |
Synthetic probe |
P1 |
Fail over to read replica; page on-call SRE; degrade to keyword search |
| Ingestion pipeline stall |
Freshness SLO breach |
P2 |
Restart pipeline; alert knowledge manager; communicate staleness to users |
Disaster Recovery
| Component |
RTO |
RPO |
DR Strategy |
| Vector Database |
1 hour |
1 hour |
Cross-region replica; daily snapshot to object storage |
| Document Store |
30 minutes |
0 (versioned) |
Multi-region S3 replication; versioning enabled |
| Ingestion Pipeline |
4 hours |
N/A (re-runnable) |
Infrastructure-as-code re-deploy; idempotent re-ingestion |
| LLM API |
15 minutes |
N/A |
Multi-provider fallback (primary + secondary LLM provider) |
11. Cost Considerations
Cost Drivers
| Cost Driver |
Unit |
Approximate Cost |
Scaling Behaviour |
| Embedding model (batch ingestion) |
Per million tokens |
$0.02–$0.13 (OpenAI/Google) |
Linear with corpus size; one-time then incremental |
| Embedding model (query time) |
Per million tokens |
$0.02–$0.13 |
Linear with query volume |
| Vector database hosting |
Per million vectors/month |
$70–$200 (managed); $20–$80 (self-hosted) |
Sub-linear with sharding |
| LLM generation |
Per million tokens (input+output) |
$2–$15 (GPT-4o class) |
Linear with query volume × context length |
| Cross-encoder re-ranking |
Per million tokens |
$1–$3 (Cohere) |
Linear with query volume × K |
| Object storage (document store) |
Per TB/month |
$20–$25 |
Linear with corpus size |
| Compute (orchestration, ingestion workers) |
Per vCPU-hour |
$0.05–$0.15 |
Bursty during ingestion; steady-state low |
Scaling Risks
- Context length growth: as users discover RAG and ask more complex queries, context tokens per query creep upward, driving LLM cost non-linearly
- Re-embedding cost spike: a mandatory embedding model upgrade on a 100M-token corpus costs $2,000–$13,000 and requires careful planning
- Vector index rebuild: adding new metadata fields requires a full index rebuild, causing temporary retrieval degradation
Cost Optimisations
- Use tiered embedding: cheap embedding model for initial retrieval, expensive model only for re-ranking candidates
- Implement semantic caching (cache responses for near-duplicate queries using embedding similarity)
- Batch embedding during off-peak hours to take advantage of batch API discounts (50% on OpenAI)
- Right-size the LLM: use a smaller/cheaper model for low-stakes queries, route to premium model only for complex or high-classification queries
- Compress stored vectors using scalar quantisation (INT8) to reduce storage by 4× with <2% recall degradation
Indicative Cost Range
| Deployment Scale |
Monthly Cost Range |
Notes |
| Small (< 1M vectors, < 10K queries/day) |
$500 – $2,000 |
Startup or departmental deployment |
| Medium (1M–10M vectors, 10K–100K queries/day) |
$2,000 – $15,000 |
Enterprise divisional deployment |
| Large (> 10M vectors, > 100K queries/day) |
$15,000 – $80,000 |
Enterprise-wide deployment; optimisation critical |
12. Trade-Off Analysis
Chunking Strategy Comparison
| Option |
Recall Quality |
Operational Complexity |
Citation Granularity |
Recommended For |
| Fixed-size chunking (512 tokens, 10% overlap) |
Moderate |
Low |
Low (mid-paragraph boundaries) |
Initial deployments; homogeneous corpora |
| Semantic chunking (paragraph/section boundaries) |
High |
Medium |
High (section-level) |
Policy/procedure documents; structured reports |
| Hierarchical chunking (summary + detail chunks) |
Very High |
High |
Very High (clause-level) |
Regulated documents; long technical specifications |
| Sentence-level chunking |
Low (context fragmentation) |
Low |
Very High |
Not recommended for enterprise RAG |
Embedding Model Comparison
| Option |
Quality (MTEB) |
Cost |
Hosting |
Lock-in Risk |
| OpenAI text-embedding-3-large |
Highest |
$0.13/M tokens |
Cloud API |
High (OpenAI dependency) |
| Google textembedding-gecko-004 |
High |
$0.025/M tokens |
Cloud API |
High (GCP dependency) |
| BAAI bge-large-en-v1.5 |
High |
Compute cost only |
Self-hosted |
None |
| Cohere embed-v3 |
High |
$0.10/M tokens |
Cloud API |
Medium |
Architectural Tensions
| Tension |
Option A |
Option B |
Recommended Resolution |
| Freshness vs. Ingestion Cost |
Real-time ingestion (high cost) |
Batch nightly ingestion (stale) |
Risk-tiered: Tier 1 docs hourly, Tier 2 daily |
| Retrieval Depth (high K) vs. Latency |
K=50 for high recall |
K=5 for low latency |
K=20 + cross-encoder re-rank to N=5 |
| Open-source self-hosting vs. Managed services |
Lower ongoing cost, full control |
Higher managed cost, faster time-to-value |
Managed for initial deployment; migrate to self-hosted at >$5K/month savings threshold |
| Context richness vs. Context window cost |
Large context (high accuracy) |
Small context (low cost) |
Adaptive context: scale K with query complexity score |
13. Failure Modes
| Failure Mode |
Likelihood |
Impact |
Detection |
Recovery |
| Hallucinated citation (LLM invents source) |
Medium |
High |
Citation validator comparing generated refs against retrieved chunk IDs |
Strip invalid citation; log for model quality review |
| Index staleness (document updated but old version retrieved) |
Medium |
Medium |
Freshness monitoring; version mismatch detection |
Re-trigger ingestion for affected document; surface version warning to user |
| Embedding drift (new documents in a different semantic space) |
Low |
Medium |
Retrieval quality metric degradation over time |
Re-embed affected documents; monitor RAGAS faithfulness |
| ACL filter bypass (misconfiguration) |
Low |
Critical |
Anomaly detection on retrieval patterns; classification label mismatches in outputs |
Immediate service suspension; full ACL audit |
| Cross-encoder timeout causing degraded ranking |
Medium |
Low |
P99 latency alert |
Serve top-K from vector search without re-ranking; log degradation |
| Context window overflow (truncated context) |
Medium |
High |
Token count monitoring per request |
Reduce K; prioritise by re-rank score; alert when budget > 90% |
| LLM generates answer outside provided context |
Medium |
High |
Faithfulness scoring via RAGAS or LLM-as-judge |
Tighten system prompt; consider output classifier |
| Vector database corruption |
Very Low |
Critical |
Data integrity checksums; retrieval anomaly detection |
Restore from last snapshot; re-ingest since snapshot timestamp |
Cascading Failure Scenarios
- Embedding model API outage during peak query period: Query processor cannot embed queries → vector search cannot execute → entire RAG pipeline fails. Mitigation: implement query-embedding caching for recently seen queries; maintain a keyword search fallback with explicit quality degradation notice.
- ACL metadata missing from newly ingested documents: Documents ingest without access control metadata → ACL filter passes all requests → data leakage. Mitigation: mandatory ACL metadata validation before ingestion completion; reject documents lacking classification metadata.
14. Regulatory Considerations
| Regulation |
Requirement |
RAG Pattern Response |
| APRA CPS 230 (Operational Resilience) |
Critical service continuity; third-party risk for cloud LLM |
DR plan per component; LLM provider assessed as material service provider; multi-provider fallback |
| APRA CPS 234 (Information Security) |
Information asset classification; access control |
ACL-aware retrieval; classification labels preserved; encrypted at rest and in transit |
| Privacy Act 1988 (Australia) |
Minimum necessary data collection; right to erasure |
PII detection before corpus ingestion; erasure procedure deletes chunk + vector + source document |
| EU AI Act Article 13 |
Transparency: users must know they are interacting with AI |
UI disclosure: "Answers generated by AI based on [source]"; citation of source documents |
| EU AI Act Article 14 |
Human oversight for high-risk AI systems |
Human review gate for high-stakes RAG answers (medical, legal, financial advice) |
| ISO/IEC 42001 (AI Management System) |
Risk management; accountability; transparency |
Corpus inventory; model card; quality scorecard; audit logs as required artefacts |
| NIST AI RMF (Govern 1.1, Map 1.1) |
Document AI system context and intended use |
System card documenting intended use, limitations, and risk mitigations |
| GDPR Article 22 |
No solely automated decisions affecting individuals |
Human-in-the-loop for consequential decisions informed by RAG outputs |
15. Reference Implementations
AWS
- Source connectors: Amazon Kendra (managed) or custom Lambda + EventBridge
- Chunking & embedding: AWS Lambda (Python) + Amazon Bedrock Titan Embeddings v2
- Vector store: Amazon OpenSearch Service with k-NN plugin, or Amazon Aurora pgvector
- Document store: Amazon S3 with S3 Versioning
- LLM: Amazon Bedrock (Claude 3.5 Sonnet, Llama 3)
- Orchestration: AWS Step Functions + LangChain on Lambda
- Observability: Amazon CloudWatch + AWS X-Ray + Langfuse
Azure
- Source connectors: Azure Logic Apps + Microsoft Graph connector
- Chunking & embedding: Azure Functions + Azure OpenAI Service (text-embedding-3-large)
- Vector store: Azure AI Search (with vector search mode)
- Document store: Azure Blob Storage
- LLM: Azure OpenAI Service (GPT-4o)
- Orchestration: Azure AI Studio Prompt Flow
- Observability: Azure Monitor + Application Insights + Azure AI Content Safety
GCP
- Source connectors: Cloud Run jobs + Pub/Sub for event-driven ingestion
- Chunking & embedding: Cloud Run + Vertex AI Embeddings (textembedding-gecko)
- Vector store: Vertex AI Vector Search (formerly Matching Engine) or AlloyDB pgvector
- Document store: Google Cloud Storage
- LLM: Vertex AI (Gemini 1.5 Pro)
- Orchestration: Vertex AI Agent Builder or LangChain on Cloud Run
- Observability: Cloud Monitoring + Cloud Trace + Vertex AI Model Monitoring
On-Premises / Air-Gapped
- Source connectors: Custom Python connectors + Apache NiFi
- Chunking & embedding: GPU inference server (NVIDIA A10G) + BAAI bge-large-en-v1.5
- Vector store: Weaviate or Qdrant self-hosted on Kubernetes
- Document store: MinIO (S3-compatible object storage)
- LLM: vLLM serving Llama 3.1 70B or Mistral Large on GPU cluster
- Orchestration: LangChain / LlamaIndex on Kubernetes
- Observability: Prometheus + Grafana + Langfuse self-hosted
| Pattern ID |
Pattern Name |
Relationship |
| EAAPL-RAG002 |
Multi-Source RAG |
Extends RAG001 to heterogeneous source types; inherits all foundational components |
| EAAPL-RAG003 |
Secure RAG |
Extends RAG001 with enterprise ACL enforcement; recommended overlay for any regulated deployment |
| EAAPL-RAG004 |
Federated RAG |
Extends RAG001 to distributed knowledge bases; replaces centralised vector store |
| EAAPL-RAG005 |
Hybrid RAG |
Extends RAG001 retrieval layer with BM25 + RRF; drop-in upgrade to retrieval component |
| EAAPL-RAG006 |
Streaming RAG |
Extends RAG001 ingestion pipeline for real-time data sources |
| EAAPL-RAG007 |
Agentic RAG |
Wraps RAG001 in an AI agent loop for multi-hop retrieval |
| EAAPL-RAG008 |
Multimodal RAG |
Extends RAG001 embedding and retrieval for non-text modalities |
| EAAPL-RAG009 |
Graph RAG |
Replaces/augments vector retrieval with knowledge graph traversal |
| EAAPL-RAG010 |
Contextual RAG with Metadata Filtering |
Extends RAG001 with richer metadata schema and filter composition |
| EAAPL-KNW003 |
AI Knowledge Corpus Management |
Governs the document corpus that RAG001 indexes |
| EAAPL-KNW004 |
Vector Database Management |
Governs operational management of the vector store used by RAG001 |
| EAAPL-KNW006 |
Corpus Quality Assurance |
Provides quality gates for corpus ingested into RAG001 |
17. Maturity Assessment
Overall Maturity: Mature — Enterprise RAG is widely deployed across regulated industries; tooling is production-grade; best practices are documented; failure modes are well-understood.
| Dimension |
Score (1–5) |
Rationale |
| Technology Readiness |
5 |
All components (vector DBs, embedding models, LLM APIs) are GA and production-proven |
| Tooling Ecosystem |
5 |
LangChain, LlamaIndex, LlamaHub, Haystack, and cloud-native RAG services are mature |
| Operational Guidance |
4 |
RAGAS evaluation framework, hallucination benchmarks, and SRE practices are established but evolving |
| Security & Compliance Guidance |
4 |
ACL-aware retrieval and audit patterns are documented; regulatory mapping is still being formalised by standards bodies |
| Scalability Evidence |
4 |
Production deployments at 100M+ vector scale are documented; optimisation at extreme scale requires expertise |
| Cost Predictability |
3 |
LLM token costs are volatile; embedding model pricing changes frequently; cost modelling is an ongoing effort |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2024-01-15 |
EAAPL Working Group |
Initial publication |
| 1.1 |
2024-04-20 |
EAAPL Working Group |
Added HyDE query expansion; updated OWASP LLM Top 10 to 2024 edition |
| 2.0 |
2024-09-10 |
EAAPL Working Group |
Major revision: hierarchical chunking strategy added; cross-encoder re-ranking formalised; regulatory section updated for EU AI Act final text |
| 2.1 |
2025-02-28 |
EAAPL Working Group |
Updated cost tables; added GCP Vertex AI reference implementation; expanded failure modes with cascading scenarios |