EAAPL-RAG001Proven↑ Trending

Enterprise Retrieval-Augmented Generation

🔍 Retrieval-Augmented GenerationAPRA CPS234EU AI Act↑ 1 signals · Q2 2026

[EAAPL-RAG001] Enterprise Retrieval-Augmented Generation

Category: Artificial Intelligence / Retrieval-Augmented Generation Sub-category: Foundational RAG Architecture Version: 2.1 Maturity: Mature Tags: rag retrieval embeddings vector-search llm grounding citation enterprise Regulatory Relevance: APRA CPS234, EU AI Act Article 13 (Transparency), ISO/IEC 42001, NIST AI RMF (Govern 1.1, Map 1.1)

1. Executive Summary

Retrieval-Augmented Generation (RAG) is the foundational architecture pattern that grounds Large Language Model (LLM) responses in verifiable enterprise knowledge. Rather than relying solely on parametric knowledge baked into model weights, RAG dynamically retrieves relevant documents at inference time, assembles them into a context window, and instructs the LLM to generate answers grounded exclusively in retrieved evidence.

For enterprise CIOs and CTOs, RAG directly addresses three business-critical concerns: accuracy (answers are anchored to current, authoritative sources rather than potentially stale training data), auditability (every claim can be traced to a source document for compliance and regulatory purposes), and control (the knowledge base is governed by the enterprise, not by a third-party model provider). RAG enables AI-powered enterprise search, internal knowledge assistants, customer service automation, and regulatory document Q&A without the cost and risk of fine-tuning proprietary models. When implemented correctly, RAG reduces hallucination rates by 60–80% compared to prompt-only LLM usage, and provides the citation infrastructure required to satisfy model explainability mandates in APRA, EU AI Act, and ISO 42001 frameworks.

2. Problem Statement

Business Problem

Enterprise knowledge is locked in unstructured repositories — SharePoint libraries, Confluence wikis, PDF policy archives, email threads, and ERP exports. Employees spend an average of 20% of their working week searching for information (McKinsey Global Institute). LLMs offer natural-language access to this knowledge but generate plausible-sounding but factually incorrect answers (hallucinations) at an unacceptable rate for regulated industries.

Technical Problem

Standard LLM prompting cannot access documents outside the model's training window, cannot cite sources for claims, cannot reflect updates made after the model's training cutoff, and cannot respect per-user access controls on confidential documents. Context windows are finite; naively injecting entire document corpora is computationally prohibitive and degrades generation quality.

Symptoms of the Absence of this Pattern

Help-desk chatbots that confidently cite policy sections that do not exist or have been superseded
Internal search returning keyword-matched results with no synthesis or relevance ranking
Compliance teams unable to audit the provenance of AI-generated regulatory summaries
Knowledge workers spending >30 minutes constructing answers from multiple source documents
Model answers that vary unpredictably across repeated identical queries

Cost of Inaction

Regulatory exposure: ungrounded AI outputs used in decision-making violate EU AI Act Article 13 and APRA CPG 234 requirements for explainability
Operational cost: manual document synthesis at scale costs $150–$400 per knowledge-worker hour
Risk of reputational damage from hallucinated answers in customer-facing applications
Inability to retire legacy knowledge portal investments without a viable AI-powered replacement

3. Context

When to Apply

Enterprise Q&A systems over internal policy, procedure, or product documentation
Customer service automation requiring grounded, citable answers
Regulatory and compliance document interrogation
Code generation assistants that reference internal SDK and API documentation
Research synthesis across large document corpora (legal discovery, clinical guidelines, engineering standards)
Any LLM use case where answer provenance and auditability are required

When NOT to Apply

Tasks requiring real-time external data not yet ingested (use Streaming RAG, EAAPL-RAG006)
Multi-hop reasoning across structured relational data (use Graph RAG, EAAPL-RAG009, or SQL-generation patterns)
Use cases where the knowledge corpus is smaller than the context window (direct context injection is simpler)
Creative generation tasks where factual grounding is not required
Highly latency-sensitive applications (<100ms P99) where vector search overhead is unacceptable

Prerequisites

A defined and governed knowledge corpus (documents, wikis, structured exports)
An embedding model appropriate to the corpus language and domain
A vector database provisioned and accessible from the inference runtime
An LLM with sufficient context window to accommodate retrieved passages plus the user query
A document ingestion pipeline with scheduling and delta-update capability
Logging infrastructure capable of recording retrieval decisions and LLM inputs/outputs

Industry Applicability

Industry	Primary Use Case	Criticality	Regulatory Consideration
Financial Services	Policy Q&A, compliance manuals, product disclosure	Mission-critical	APRA CPS234, MiFID II, Basel III documentation
Healthcare	Clinical guideline retrieval, formulary assistance	Mission-critical	TGA, AHPRA, HIPAA, clinical liability
Government	Legislation interpretation, service eligibility	High	FOI, Privacy Act 1988, APS values
Legal	Case law research, contract clause retrieval	High	Legal professional privilege, confidentiality
Retail/FMCG	Product knowledge bases, supplier documentation	Medium	ACCC consumer guarantees, product liability
Technology	Internal developer documentation, runbook Q&A	Medium	SOC2, ISO 27001
Higher Education	Academic policy, research corpus search	Medium	Copyright Act, FERPA equivalents

4. Architecture Overview

Enterprise RAG decomposes into two distinct temporal phases: an offline ingestion pipeline and an online retrieval-generation pipeline. Understanding the separation of these phases is critical to operating the system correctly at enterprise scale.

Offline Ingestion Pipeline

The ingestion pipeline transforms raw enterprise documents into a searchable vector index. This phase runs continuously or on a schedule and must be treated as a production data pipeline with monitoring, alerting, and schema versioning.

Document acquisition draws from multiple source connectors (SharePoint, Confluence, S3, SFTP, database exports). Each connector must capture not only document content but also metadata: document ID, version, owner, classification level, effective date, and expiry date. Metadata is as important as content for enterprise use cases — it drives filtering, citation generation, and access control enforcement.

Chunking is among the most consequential architectural decisions in any RAG system. The goal is to produce semantically coherent text units that are large enough to contain useful context but small enough to remain topically focused. Three strategies apply at enterprise scale: fixed-size chunking (split by token count, typically 256–512 tokens, with 10–20% overlap) is operationally simple and predictable; semantic chunking (split at natural paragraph or section boundaries) preserves document structure and is preferred for narrative documents such as policy manuals; hierarchical chunking (maintain parent-child relationships between summary chunks and detail chunks) enables retrieval at multiple granularities and is optimal for long technical documents. For regulated environments, hierarchical chunking with section-level metadata (clause number, effective date) is recommended because it enables citation at the regulatory clause level.

Embedding converts each chunk into a dense vector representation using an embedding model. Model selection has long-term consequences: changing the embedding model requires re-embedding the entire corpus. For English-language enterprise corpora, text-embedding-3-large (OpenAI), textembedding-gecko (Google), or bge-large-en-v1.5 (BAAI, self-hostable) are strong choices. For multilingual corpora, multilingual-e5-large or bge-m3 are preferred. The embedding model must be evaluated on a domain-representative benchmark before production selection.

Vector storage persists embeddings alongside the full chunk text and metadata in a vector database. The vector index (typically HNSW — Hierarchical Navigable Small World) enables approximate nearest-neighbour search in milliseconds across tens of millions of vectors. Index construction parameters (ef_construction, M) directly affect recall/latency trade-offs and must be tuned per corpus.

Online Retrieval-Generation Pipeline

At inference time, the user query traverses a multi-stage pipeline before the LLM generates a response.

Query processing applies transformations that materially improve retrieval quality: query expansion (generating alternative phrasings of the question), HyDE (Hypothetical Document Embedding — generating a hypothetical answer and embedding it to find similar real documents), and query decomposition (splitting compound questions into atomic sub-queries). These transformations add 50–150ms latency but improve top-5 recall by 15–30% in empirical benchmarks.

Retrieval executes the vector similarity search against the index, returning the top-K chunks (K typically 5–20) ranked by cosine similarity. Pre-retrieval metadata filtering (by document class, department, effective date) reduces the search space and enforces access control at the vector layer.

Re-ranking applies a cross-encoder model to re-score the top-K retrieved chunks against the original query with higher precision than the bi-encoder embedding model. Cross-encoders (e.g., cross-encoder/ms-marco-MiniLM-L-12-v2) do not scale to full-index search but are highly effective on the top-K set.

Context assembly constructs the final prompt by ordering retrieved chunks (relevance-first or document-structure-first depending on the task), injecting system instructions, and appending the user query. Maximum context budget (the number of retrieved tokens before the LLM's context window is exceeded) must be monitored and enforced.

Generation invokes the LLM with the assembled context. The system prompt must explicitly instruct the model to answer only from the provided context and to include citations. Post-generation, citations are extracted and validated against the retrieved chunk set to detect hallucinated references.

The full pipeline must be instrumented end-to-end. Every query, the retrieved chunk IDs, the assembled context hash, the LLM response, and latency at each stage must be logged to enable quality monitoring, debugging, and audit trail maintenance.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Ingestion["Offline Ingestion"] A[Source Connectors] B[Chunk + Embed] C[Vector Store] end subgraph Retrieval["Online Retrieval"] D[User Query] E[Query Processor] F[Vector Search + Rerank] end subgraph Generation["Generation + Observability"] G[LLM + Context] H[Citation Validator] I[Quality Monitor] end A --> B --> C D --> E -->|filtered ANN search| C C --> F --> G --> H --> D G --> I style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#fef9c3,stroke:#eab308 style D fill:#dbeafe,stroke:#3b82f6 style E fill:#f0fdf4,stroke:#22c55e style F fill:#f0fdf4,stroke:#22c55e style G fill:#fef9c3,stroke:#eab308 style H fill:#d1fae5,stroke:#10b981 style I fill:#fef9c3,stroke:#eab308

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Source Connectors	Integration	Pull documents from enterprise repositories on schedule or event trigger	Microsoft Graph API, Confluence REST API, S3 Event Notifications, custom JDBC connectors	High
Metadata Extractor	Data Processing	Parse and normalise document metadata; assign classification labels	Apache Tika, AWS Textract, Azure Document Intelligence, custom NLP pipeline	High
Chunking Engine	Data Processing	Segment documents into semantically coherent, appropriately-sized chunks	LangChain text splitters, LlamaIndex node parsers, custom Python chunkers	High
Embedding Model	ML Inference	Convert text chunks to dense vector representations	OpenAI text-embedding-3-large, Google textembedding-gecko, BAAI bge-large-en-v1.5, Cohere embed-v3	Critical
Vector Database	Storage	Store and index embedding vectors; serve ANN queries	Pinecone, Weaviate, Qdrant, pgvector, OpenSearch k-NN, Chroma	Critical
Document Store	Storage	Persist full chunk text and metadata for context assembly	Amazon S3, Azure Blob Storage, Google Cloud Storage, PostgreSQL	High
Query Processor	Inference	Enrich and expand user queries before retrieval	LangChain, LlamaIndex, custom Python with LLM call	Medium
ACL Filter	Security	Enforce document-level access control before vector search	Custom middleware using identity provider claims; RBAC policy engine	Critical
Cross-Encoder Re-ranker	ML Inference	Re-rank top-K retrieved chunks with higher precision	Cohere Rerank, ms-marco cross-encoders, Voyage AI rerank	High
Context Assembler	Orchestration	Order chunks, enforce token budget, construct final prompt	LangChain, LlamaIndex, custom orchestration	High
LLM	ML Inference	Generate grounded natural-language response from context	OpenAI GPT-4o, Anthropic Claude 3.5, Google Gemini 1.5, Azure OpenAI, self-hosted Llama 3	Critical
Citation Extractor	Post-processing	Extract and validate source references in generated output	Regex + structured output parsing, LLM-based extraction	High
Observability Layer	Operations	Log all pipeline stages; monitor quality and latency metrics	Datadog, Grafana + Prometheus, Langfuse, Arize AI	High

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	Source Connector	Poll or receive webhook from document repository; fetch new/modified documents	Raw document bytes + source metadata
2	Metadata Extractor	Parse document format; extract title, author, classification, dates, section structure	Structured metadata record per document
3	Chunking Engine	Apply chunking strategy; assign chunk ID, parent document ID, position index	Ordered list of text chunks with metadata
4	Embedding Model	Generate dense vector for each chunk	(chunk_id, vector[1536], chunk_text, metadata) tuple
5	Vector Database	Upsert vector with metadata payload; rebuild/update HNSW index	Persisted vector index entry; confirmation receipt
6	Document Store	Persist full chunk text and metadata	Durable record accessible by chunk_id
7	User / Application	Submit natural-language query via API	Query string + user identity context
8	Query Processor	Expand query; optionally generate HyDE document	Enhanced query representation(s)
9	ACL Filter	Resolve user's permitted document classes from identity provider	Allowlist of namespace/metadata filters for vector search
10	Vector Database	Execute ANN search with metadata filters; return top-K (k=20) chunks	Ranked list of (chunk_id, score, metadata)
11	Document Store	Fetch full chunk text for top-K chunk IDs	Chunk text + source metadata for each candidate
12	Cross-Encoder Re-ranker	Score each chunk against original query; re-order by cross-encoder score	Re-ranked top-N (N=5) chunks
13	Context Assembler	Order chunks; prepend system prompt; enforce token budget (≤context_window − reserve)	Assembled prompt string
14	LLM	Generate response conditioned on assembled context	Raw response text with in-line citation markers
15	Citation Extractor	Parse citation markers; validate each against retrieved chunk IDs	Structured response: answer + verified citations
16	Observability Layer	Log query, chunk IDs, context hash, response, latency per stage	Audit log record; metrics increment
17	User / Application	Receive grounded answer with clickable source citations	End-user response

Error Flow

Error Condition	Detection Point	Recovery Action
Embedding model unavailable	Step 4 (ingestion) or Step 8 (query)	Retry with exponential backoff; fall back to cached embeddings for known queries
Vector database query timeout	Step 10	Retry up to 3 times; degrade to keyword search fallback; surface "reduced quality" indicator to user
Zero results returned after ACL filter	Step 10	Return "No accessible documents found" — do NOT fall through to unfiltered search
LLM rate limit or timeout	Step 14	Queue retry with jitter; return partial response with "generation pending" status
Citation validation failure (hallucinated source)	Step 15	Strip hallucinated citation from response; increment hallucination counter; flag for review
Document ingestion failure	Step 2	Dead-letter queue; alert pipeline operator; document remains on previous version in index

8. Security Considerations

Authentication and Authorisation

All API endpoints require OAuth 2.0 / OIDC tokens from the enterprise identity provider (Entra ID, Okta, Ping)
User identity claims are forwarded through the entire pipeline and recorded in audit logs
Vector search is scoped by user-identity-derived metadata filters before execution — retrieval never returns documents the user cannot access
Service-to-service calls between pipeline components use mTLS with short-lived certificates

Secrets Management

Embedding model API keys stored in HashiCorp Vault or cloud-native secrets manager (AWS Secrets Manager, Azure Key Vault)
LLM API keys rotated on a 90-day schedule; rotation must not require pipeline restart
Database credentials never hardcoded; injected at runtime via environment variable from secrets manager

Data Classification

Source document classification labels (OFFICIAL, SENSITIVE, PROTECTED, etc.) are preserved as metadata through the chunking and embedding pipeline
Retrieved chunks inherit the highest classification of their parent document
The assembled context window classification is the maximum of all included chunks
LLM response is tagged with the classification of the highest-classified source included in context

Encryption

Vectors and chunk text at rest: AES-256 encryption in vector database and document store
Data in transit: TLS 1.3 minimum between all components
Highly sensitive corpora: consider field-level encryption of metadata; evaluate format-preserving encryption for PII fields

Auditability

Immutable audit log: every query, user ID, retrieved chunk IDs, context hash (SHA-256), LLM model version, and response hash
Audit logs shipped to tamper-evident log store (WORM S3 bucket, Splunk, Azure Sentinel)
Audit log retention: minimum 7 years for regulated industries

OWASP LLM Top 10 Mitigations

OWASP LLM Risk	Applicability	Mitigation in this Pattern
LLM01: Prompt Injection	High	System prompt hardened; retrieved content treated as data, not instructions; input sanitisation before embedding
LLM02: Insecure Output Handling	High	Structured output parsing; citation validation; no execution of LLM-generated code in this pattern
LLM03: Training Data Poisoning	Medium	Not directly applicable post-training; mitigated by corpus quality gates (EAAPL-KNW006)
LLM04: Model Denial of Service	High	Rate limiting per user/tenant; query complexity limits; context window budget enforcement
LLM05: Supply Chain Vulnerabilities	Medium	Embedding and LLM model versions pinned; SBOM maintained; provider SLA reviewed
LLM06: Sensitive Information Disclosure	Critical	ACL pre-filter prevents retrieval; PII redaction post-retrieval; output scanning for classification leakage
LLM07: Insecure Plugin Design	Low	No plugin execution in foundational RAG; applicable in Agentic RAG (EAAPL-RAG007)
LLM08: Excessive Agency	Low	RAG is read-only; no write actions available to LLM in this pattern
LLM09: Overreliance	High	Confidence scores surfaced to users; citations presented for independent verification
LLM10: Model Theft	Medium	LLM accessed via API only; model weights not exposed; fine-tuned models stored in private registries

9. Governance Considerations

Responsible AI

RAG answers must always present source citations to enable human verification
Confidence scoring should be implemented; low-confidence answers must be flagged
Sensitive-topic classifiers (medical advice, legal advice, financial advice) should trigger "consult a professional" disclaimers
Demographic bias monitoring on retrieval: ensure corpus is not systematically missing content relevant to specific user groups

Model Risk Management

Embedding model versioning: the corpus must be re-embedded when the embedding model is upgraded; running mixed embeddings (different models for different document batches) produces retrieval quality degradation
LLM model versioning: changes to the generation model require regression testing against a held-out QA benchmark
Hallucination rate KPI tracked as a model risk indicator; threshold breach triggers review gate

Human Approval Gates

Corpus ingestion of Tier 1 (Critical) documents requires human review approval before the document is made retrievable
Significant changes to system prompt (which governs LLM behaviour) require a change approval process
Quarterly human review of a random sample (n ≥ 100) of query/response pairs for quality and safety

Governance Artefacts

Artefact	Owner	Frequency	Purpose
Corpus Inventory	Knowledge Manager	Continuous (automated)	Track which documents are in the index, their versions, and owners
Embedding Model Card	ML Engineer	Per model version	Document model capabilities, limitations, evaluation benchmarks
RAG Quality Scorecard	AI Operations	Weekly	Track retrieval recall, precision, hallucination rate, answer faithfulness
Audit Log Export	Compliance	Monthly	Regulatory evidence of access controls and output traceability
Responsible AI Assessment	AI Governance Board	Quarterly	Bias, fairness, and explainability review
Data Lineage Record	Data Governance	Per ingestion run	Document-to-chunk-to-vector lineage for every item in the corpus

10. Operational Considerations

Monitoring

Metric	Type	Collection Method	Alert Threshold
Retrieval Latency P99	Latency	OpenTelemetry trace	> 500ms
End-to-end Query Latency P99	Latency	OpenTelemetry trace	> 3000ms
Embedding Model Availability	Availability	Synthetic probe every 60s	< 99.5% over 5 min
Vector DB Query Success Rate	Availability	API response monitoring	< 99.9% over 5 min
Hallucination Rate (weekly sample)	Quality	Manual review + LLM-as-judge	> 5% of sampled queries
Answer Faithfulness Score	Quality	Automated RAGAS evaluation	< 0.75 average
Index Freshness (hours since last update)	Freshness	Ingestion pipeline heartbeat	> 24 hours for Tier 1 docs
Context Budget Utilisation	Resource	Per-query logging	> 95% (approaching window limit)

Service Level Objectives

SLO	Target	Measurement Window
Query Response Time P95	≤ 2 seconds	Rolling 7-day
Query Response Time P99	≤ 4 seconds	Rolling 7-day
Pipeline Availability	≥ 99.9%	Monthly
Ingestion Pipeline SLA (document available within N hours of publish)	≤ 4 hours (Tier 1), ≤ 24 hours (Tier 2)	Per document
Answer Faithfulness (RAGAS)	≥ 0.80	Weekly evaluation

Logging

Structured JSON logs for every pipeline stage
Correlation ID propagated through entire query lifecycle
PII fields in logs must be masked (hash user IDs, redact query text for high-classification corpora)
Log retention: 90 days hot (searchable), 7 years cold (compliance archive)

Incident Response

Incident Type	Detection	Severity	Response
Hallucinated citation in high-stakes answer	Citation validator alert / user report	P1	Immediate rollback of affected system prompt; manual review of last 24h queries
Cross-tenant data leakage	ACL audit log anomaly	P0	Immediate service suspension; security team activation; regulatory notification
Vector DB unavailability	Synthetic probe	P1	Fail over to read replica; page on-call SRE; degrade to keyword search
Ingestion pipeline stall	Freshness SLO breach	P2	Restart pipeline; alert knowledge manager; communicate staleness to users

Disaster Recovery

Component	RTO	RPO	DR Strategy
Vector Database	1 hour	1 hour	Cross-region replica; daily snapshot to object storage
Document Store	30 minutes	0 (versioned)	Multi-region S3 replication; versioning enabled
Ingestion Pipeline	4 hours	N/A (re-runnable)	Infrastructure-as-code re-deploy; idempotent re-ingestion
LLM API	15 minutes	N/A	Multi-provider fallback (primary + secondary LLM provider)

11. Cost Considerations

Cost Drivers

Cost Driver	Unit	Approximate Cost	Scaling Behaviour
Embedding model (batch ingestion)	Per million tokens	$0.02–$0.13 (OpenAI/Google)	Linear with corpus size; one-time then incremental
Embedding model (query time)	Per million tokens	$0.02–$0.13	Linear with query volume
Vector database hosting	Per million vectors/month	$70–$200 (managed); $20–$80 (self-hosted)	Sub-linear with sharding
LLM generation	Per million tokens (input+output)	$2–$15 (GPT-4o class)	Linear with query volume × context length
Cross-encoder re-ranking	Per million tokens	$1–$3 (Cohere)	Linear with query volume × K
Object storage (document store)	Per TB/month	$20–$25	Linear with corpus size
Compute (orchestration, ingestion workers)	Per vCPU-hour	$0.05–$0.15	Bursty during ingestion; steady-state low

Scaling Risks

Context length growth: as users discover RAG and ask more complex queries, context tokens per query creep upward, driving LLM cost non-linearly
Re-embedding cost spike: a mandatory embedding model upgrade on a 100M-token corpus costs $2,000–$13,000 and requires careful planning
Vector index rebuild: adding new metadata fields requires a full index rebuild, causing temporary retrieval degradation

Cost Optimisations

Use tiered embedding: cheap embedding model for initial retrieval, expensive model only for re-ranking candidates
Implement semantic caching (cache responses for near-duplicate queries using embedding similarity)
Batch embedding during off-peak hours to take advantage of batch API discounts (50% on OpenAI)
Right-size the LLM: use a smaller/cheaper model for low-stakes queries, route to premium model only for complex or high-classification queries
Compress stored vectors using scalar quantisation (INT8) to reduce storage by 4× with <2% recall degradation

Indicative Cost Range

Deployment Scale	Monthly Cost Range	Notes
Small (< 1M vectors, < 10K queries/day)	$500 – $2,000	Startup or departmental deployment
Medium (1M–10M vectors, 10K–100K queries/day)	$2,000 – $15,000	Enterprise divisional deployment
Large (> 10M vectors, > 100K queries/day)	$15,000 – $80,000	Enterprise-wide deployment; optimisation critical

12. Trade-Off Analysis

Chunking Strategy Comparison

Option	Recall Quality	Operational Complexity	Citation Granularity	Recommended For
Fixed-size chunking (512 tokens, 10% overlap)	Moderate	Low	Low (mid-paragraph boundaries)	Initial deployments; homogeneous corpora
Semantic chunking (paragraph/section boundaries)	High	Medium	High (section-level)	Policy/procedure documents; structured reports
Hierarchical chunking (summary + detail chunks)	Very High	High	Very High (clause-level)	Regulated documents; long technical specifications
Sentence-level chunking	Low (context fragmentation)	Low	Very High	Not recommended for enterprise RAG

Embedding Model Comparison

Option	Quality (MTEB)	Cost	Hosting	Lock-in Risk
OpenAI text-embedding-3-large	Highest	$0.13/M tokens	Cloud API	High (OpenAI dependency)
Google textembedding-gecko-004	High	$0.025/M tokens	Cloud API	High (GCP dependency)
BAAI bge-large-en-v1.5	High	Compute cost only	Self-hosted	None
Cohere embed-v3	High	$0.10/M tokens	Cloud API	Medium

Architectural Tensions

Tension	Option A	Option B	Recommended Resolution
Freshness vs. Ingestion Cost	Real-time ingestion (high cost)	Batch nightly ingestion (stale)	Risk-tiered: Tier 1 docs hourly, Tier 2 daily
Retrieval Depth (high K) vs. Latency	K=50 for high recall	K=5 for low latency	K=20 + cross-encoder re-rank to N=5
Open-source self-hosting vs. Managed services	Lower ongoing cost, full control	Higher managed cost, faster time-to-value	Managed for initial deployment; migrate to self-hosted at >$5K/month savings threshold
Context richness vs. Context window cost	Large context (high accuracy)	Small context (low cost)	Adaptive context: scale K with query complexity score

13. Failure Modes

Failure Mode	Likelihood	Impact	Detection	Recovery
Hallucinated citation (LLM invents source)	Medium	High	Citation validator comparing generated refs against retrieved chunk IDs	Strip invalid citation; log for model quality review
Index staleness (document updated but old version retrieved)	Medium	Medium	Freshness monitoring; version mismatch detection	Re-trigger ingestion for affected document; surface version warning to user
Embedding drift (new documents in a different semantic space)	Low	Medium	Retrieval quality metric degradation over time	Re-embed affected documents; monitor RAGAS faithfulness
ACL filter bypass (misconfiguration)	Low	Critical	Anomaly detection on retrieval patterns; classification label mismatches in outputs	Immediate service suspension; full ACL audit
Cross-encoder timeout causing degraded ranking	Medium	Low	P99 latency alert	Serve top-K from vector search without re-ranking; log degradation
Context window overflow (truncated context)	Medium	High	Token count monitoring per request	Reduce K; prioritise by re-rank score; alert when budget > 90%
LLM generates answer outside provided context	Medium	High	Faithfulness scoring via RAGAS or LLM-as-judge	Tighten system prompt; consider output classifier
Vector database corruption	Very Low	Critical	Data integrity checksums; retrieval anomaly detection	Restore from last snapshot; re-ingest since snapshot timestamp

Cascading Failure Scenarios

Embedding model API outage during peak query period: Query processor cannot embed queries → vector search cannot execute → entire RAG pipeline fails. Mitigation: implement query-embedding caching for recently seen queries; maintain a keyword search fallback with explicit quality degradation notice.
ACL metadata missing from newly ingested documents: Documents ingest without access control metadata → ACL filter passes all requests → data leakage. Mitigation: mandatory ACL metadata validation before ingestion completion; reject documents lacking classification metadata.

14. Regulatory Considerations

Regulation	Requirement	RAG Pattern Response
APRA CPS 230 (Operational Resilience)	Critical service continuity; third-party risk for cloud LLM	DR plan per component; LLM provider assessed as material service provider; multi-provider fallback
APRA CPS 234 (Information Security)	Information asset classification; access control	ACL-aware retrieval; classification labels preserved; encrypted at rest and in transit
Privacy Act 1988 (Australia)	Minimum necessary data collection; right to erasure	PII detection before corpus ingestion; erasure procedure deletes chunk + vector + source document
EU AI Act Article 13	Transparency: users must know they are interacting with AI	UI disclosure: "Answers generated by AI based on [source]"; citation of source documents
EU AI Act Article 14	Human oversight for high-risk AI systems	Human review gate for high-stakes RAG answers (medical, legal, financial advice)
ISO/IEC 42001 (AI Management System)	Risk management; accountability; transparency	Corpus inventory; model card; quality scorecard; audit logs as required artefacts
NIST AI RMF (Govern 1.1, Map 1.1)	Document AI system context and intended use	System card documenting intended use, limitations, and risk mitigations
GDPR Article 22	No solely automated decisions affecting individuals	Human-in-the-loop for consequential decisions informed by RAG outputs

15. Reference Implementations

AWS

Source connectors: Amazon Kendra (managed) or custom Lambda + EventBridge
Chunking & embedding: AWS Lambda (Python) + Amazon Bedrock Titan Embeddings v2
Vector store: Amazon OpenSearch Service with k-NN plugin, or Amazon Aurora pgvector
Document store: Amazon S3 with S3 Versioning
LLM: Amazon Bedrock (Claude 3.5 Sonnet, Llama 3)
Orchestration: AWS Step Functions + LangChain on Lambda
Observability: Amazon CloudWatch + AWS X-Ray + Langfuse

Azure

Source connectors: Azure Logic Apps + Microsoft Graph connector
Chunking & embedding: Azure Functions + Azure OpenAI Service (text-embedding-3-large)
Vector store: Azure AI Search (with vector search mode)
Document store: Azure Blob Storage
LLM: Azure OpenAI Service (GPT-4o)
Orchestration: Azure AI Studio Prompt Flow
Observability: Azure Monitor + Application Insights + Azure AI Content Safety

GCP

Source connectors: Cloud Run jobs + Pub/Sub for event-driven ingestion
Chunking & embedding: Cloud Run + Vertex AI Embeddings (textembedding-gecko)
Vector store: Vertex AI Vector Search (formerly Matching Engine) or AlloyDB pgvector
Document store: Google Cloud Storage
LLM: Vertex AI (Gemini 1.5 Pro)
Orchestration: Vertex AI Agent Builder or LangChain on Cloud Run
Observability: Cloud Monitoring + Cloud Trace + Vertex AI Model Monitoring

On-Premises / Air-Gapped

Source connectors: Custom Python connectors + Apache NiFi
Chunking & embedding: GPU inference server (NVIDIA A10G) + BAAI bge-large-en-v1.5
Vector store: Weaviate or Qdrant self-hosted on Kubernetes
Document store: MinIO (S3-compatible object storage)
LLM: vLLM serving Llama 3.1 70B or Mistral Large on GPU cluster
Orchestration: LangChain / LlamaIndex on Kubernetes
Observability: Prometheus + Grafana + Langfuse self-hosted

Pattern ID	Pattern Name	Relationship
EAAPL-RAG002	Multi-Source RAG	Extends RAG001 to heterogeneous source types; inherits all foundational components
EAAPL-RAG003	Secure RAG	Extends RAG001 with enterprise ACL enforcement; recommended overlay for any regulated deployment
EAAPL-RAG004	Federated RAG	Extends RAG001 to distributed knowledge bases; replaces centralised vector store
EAAPL-RAG005	Hybrid RAG	Extends RAG001 retrieval layer with BM25 + RRF; drop-in upgrade to retrieval component
EAAPL-RAG006	Streaming RAG	Extends RAG001 ingestion pipeline for real-time data sources
EAAPL-RAG007	Agentic RAG	Wraps RAG001 in an AI agent loop for multi-hop retrieval
EAAPL-RAG008	Multimodal RAG	Extends RAG001 embedding and retrieval for non-text modalities
EAAPL-RAG009	Graph RAG	Replaces/augments vector retrieval with knowledge graph traversal
EAAPL-RAG010	Contextual RAG with Metadata Filtering	Extends RAG001 with richer metadata schema and filter composition
EAAPL-KNW003	AI Knowledge Corpus Management	Governs the document corpus that RAG001 indexes
EAAPL-KNW004	Vector Database Management	Governs operational management of the vector store used by RAG001
EAAPL-KNW006	Corpus Quality Assurance	Provides quality gates for corpus ingested into RAG001

17. Maturity Assessment

Overall Maturity: Mature — Enterprise RAG is widely deployed across regulated industries; tooling is production-grade; best practices are documented; failure modes are well-understood.

Dimension	Score (1–5)	Rationale
Technology Readiness	5	All components (vector DBs, embedding models, LLM APIs) are GA and production-proven
Tooling Ecosystem	5	LangChain, LlamaIndex, LlamaHub, Haystack, and cloud-native RAG services are mature
Operational Guidance	4	RAGAS evaluation framework, hallucination benchmarks, and SRE practices are established but evolving
Security & Compliance Guidance	4	ACL-aware retrieval and audit patterns are documented; regulatory mapping is still being formalised by standards bodies
Scalability Evidence	4	Production deployments at 100M+ vector scale are documented; optimisation at extreme scale requires expertise
Cost Predictability	3	LLM token costs are volatile; embedding model pricing changes frequently; cost modelling is an ongoing effort

18. Revision History

Version	Date	Author	Changes
1.0	2024-01-15	EAAPL Working Group	Initial publication
1.1	2024-04-20	EAAPL Working Group	Added HyDE query expansion; updated OWASP LLM Top 10 to 2024 edition
2.0	2024-09-10	EAAPL Working Group	Major revision: hierarchical chunking strategy added; cross-encoder re-ranking formalised; regulatory section updated for EU AI Act final text
2.1	2025-02-28	EAAPL Working Group	Updated cost tables; added GCP Vertex AI reference implementation; expanded failure modes with cascading scenarios

← Back to Library More Retrieval-Augmented Generation →