EAAPL-RAG010Proven

Contextual RAG with Metadata Filtering

🔍 Retrieval-Augmented GenerationAPRA CPS234EU AI Act

[EAAPL-RAG010] Contextual RAG with Metadata Filtering

Category: Artificial Intelligence / Retrieval-Augmented Generation Sub-category: Metadata-Driven Contextual Retrieval Version: 1.2 Maturity: Proven Tags: rag metadata-filtering contextual-retrieval pre-retrieval-filtering faceted-search progressive-disclosure schema-design Regulatory Relevance: APRA CPS234, Privacy Act 1988 APP 3 (data minimisation), ISO/IEC 42001 Section 8.4, EU AI Act Article 10

1. Executive Summary

Contextual RAG with Metadata Filtering extends the foundational RAG pattern with a rich, queryable metadata layer that enables users and system components to scope retrieval with precision before semantic similarity search executes. Rather than searching all vectors globally, the retrieval is constrained by metadata predicates — date ranges, document categories, departmental ownership, classification levels, language, and custom domain attributes — reducing the effective search space, improving precision, and enforcing contextual boundaries that pure semantic search cannot provide.

For enterprise architects and product managers, metadata filtering is the mechanism that transforms a general-purpose knowledge search into a domain-specific, context-aware knowledge assistant. A legal counsel who queries the AI assistant expects answers sourced only from legal documents effective today, in the relevant jurisdiction, and classified at their clearance level — not answers drawn from a two-year-old draft or a policy from a different business unit. Metadata filtering encodes these contextual expectations as first-class retrieval parameters rather than requiring them to be expressed in the query text and hoped to be respected by the LLM. The pattern is the recommended baseline for all production enterprise RAG deployments over large, heterogeneous corpora.

2. Problem Statement

Business Problem

Enterprise knowledge corpora are large and heterogeneous. A financial institution's knowledge base contains policies from multiple jurisdictions, in multiple languages, at multiple classification levels, some current and some superseded. A query for "the current refund policy for corporate clients" could, without metadata filtering, retrieve a superseded policy from 2019, a policy applicable to retail clients, or a draft policy not yet approved. Each of these is semantically similar to the query but contextually wrong.

Technical Problem

Semantic similarity search optimises for vector distance, not contextual appropriateness. A document can be highly semantically similar to a query while being contextually wrong (wrong date, wrong department, wrong jurisdiction, wrong classification). Without metadata filtering, the retrieval layer has no mechanism to enforce contextual constraints — the LLM must infer them from document text, which it does unreliably, or the system must hope that contextually appropriate documents happen to score higher than contextually wrong ones.

Symptoms

AI assistant returns answers based on superseded policies or procedures
Answers reference documents from the wrong department or business unit, confusing users
Multilingual queries return results in the wrong language despite the user's language preference
Date-sensitive queries ("what are our current obligations under X") return historical documents
Users add explicit temporal and domain constraints to every query ("current", "Australia only", "retail banking") because they have learned the system ignores context

Cost of Inaction

User trust erosion: users who receive contextually wrong answers (superseded policy, wrong jurisdiction) lose confidence rapidly
Compliance risk: AI system provides outdated regulatory guidance, leading to incorrect compliance decisions
Support burden: high volume of "the AI gave me the wrong policy" reports requiring manual correction

3. Context

When to Apply

Large heterogeneous corpora where documents have significant variation in date, category, jurisdiction, classification, or status
Multi-department deployments where users need results scoped to their business unit or domain
Regulated corpora where only current, approved, and appropriately classified documents should be retrieved
Multilingual corpora where language-specific retrieval is important
Any deployment where temporal freshness of answers is a user requirement

When NOT to Apply

Very small corpora (< 10,000 documents) where the global search space is manageable without metadata filtering
Corpora with no meaningful metadata differentiation (all documents have the same date, category, and status)
Use cases where cross-context retrieval is desired (comparative analysis across time periods or departments)

Prerequisites

A well-defined metadata schema agreed across all source systems
Metadata populated consistently at ingestion time (missing metadata values must trigger quality alerts)
A metadata extraction pipeline for documents that do not have explicit metadata (e.g., PDF files without embedded metadata)
A user interface or API that exposes metadata filter parameters to users or calling applications

Industry Applicability

Industry	Key Metadata Dimensions	Filter Examples
Financial Services	effective_date, jurisdiction, product_line, classification, regulatory_body	date_range: [today-7d, today]; jurisdiction: AU; status: CURRENT
Healthcare	clinical_specialty, formulary_version, guideline_body, publication_date	specialty: oncology; status: ACTIVE; guideline_body: NHMRC
Government	department, act_reference, security_classification, review_date	department: Treasury; classification: OFFICIAL-SENSITIVE; status: CURRENT
Legal	jurisdiction, court_level, decision_date, area_of_law	jurisdiction: NSW; area_of_law: employment; date_after: 2020-01-01
Technology	product_version, environment, doc_type, team	product: payment-gateway; version: >=3.2; doc_type: runbook; env: prod

4. Architecture Overview

Contextual RAG with Metadata Filtering introduces a carefully designed metadata schema as a first-class architectural concern, alongside mechanisms for metadata extraction, filter construction, and progressive disclosure of context.

Metadata Schema Design

The metadata schema is the foundation of the pattern. Schema design decisions made at deployment time are difficult to change later because they require re-ingestion of the entire corpus. The schema must balance completeness (capturing all contextually relevant attributes) with practicality (all fields must be extractable and populated reliably).

A canonical enterprise RAG metadata schema includes:

Temporal: effective_date, expiry_date, last_modified, publication_date
Provenance: source_system, source_document_id, document_version, author, owner_department
Classification: security_classification, data_sensitivity, contains_pii
Content type: document_type (policy/procedure/runbook/faq/report/contract), language, format
Domain: jurisdiction, product_line, regulatory_body, clinical_specialty (domain-specific)
Status: lifecycle_status (DRAFT/CURRENT/SUPERSEDED/ARCHIVED)

Every field in the schema must have a defined cardinality (single-value or multi-value), a defined value set or validation rule, and a defined population source (embedded in source document, extracted by NLP, or assigned by governance process).

Metadata Extraction Pipeline

Not all enterprise documents arrive with structured metadata. The extraction pipeline must infer metadata from document content when explicit metadata is absent. This involves:

Date extraction: identify effective dates, review dates, and publication dates from document headers, footers, and first paragraphs using NLP or document intelligence services
Document type classification: classify the document as policy/procedure/FAQ/contract/report using a fine-tuned text classifier
Language detection: identify the document language using langdetect or equivalent
Entity extraction for domain metadata: extract regulatory body names, product names, jurisdiction references for domain metadata population

Metadata quality gates enforce minimum field population requirements before a document is indexed. Documents that fail quality gates enter a "metadata review" queue rather than being indexed without metadata.

Filter Construction

Metadata filters can be applied at three levels:

System-imposed filters (always on, not user-controllable): ACL filters (EAAPL-RAG003), lifecycle status filter (exclude ARCHIVED and DRAFT by default), classification ceiling (user's maximum clearance)
User-provided explicit filters: the user or calling application specifies filter values explicitly (e.g., jurisdiction: AU, date_after: 2024-01-01, document_type: policy). These can be extracted from the user's query text (NLP-based filter extraction: "current policy" → status: CURRENT) or provided via a structured filter UI
Context-inferred filters: the system infers filters from context — the user's profile (department, role, jurisdiction), the conversation history, or the query classification. A user authenticated with a jurisdiction of "NSW" automatically has jurisdiction IN [NSW, AU] applied to their queries without specifying it explicitly

Filters are composed with AND logic by default; OR logic is available for multi-value fields (a user's permitted departments, multiple accepted languages).

Progressive Disclosure

When metadata filtering produces too few results (fewer than K minimum relevant chunks), a progressive disclosure strategy relaxes filters in a defined order until sufficient results are found. For example: if {status: CURRENT, jurisdiction: NSW} produces 0 results, the system relaxes jurisdiction to {AU} and tries again before relaxing status to include SUPERSEDED. Each relaxation is logged, and the response includes a provenance note: "No current NSW-specific documents found; the following is based on the applicable national policy."

Progressive disclosure prevents "no results" responses while maintaining metadata filter transparency — the user always knows what filters were applied.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Ingestion["Ingestion Pipeline"] A[Source Documents] B[Metadata Extractor] C[Vector Store] end subgraph Query["Query Pipeline"] D[User + Profile] E[Filter Composer] F{Threshold Check} G[Progressive Disclosure] end subgraph Generation["Generation"] H[Re-ranker] I[LLM + Context] end A --> B -->|metadata + chunks| C D -->|query + ACL filters| E E -->|filtered search| C C --> F F -->|too few results| G G -->|relax filters| E F -->|sufficient| H H --> I --> D style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#fef9c3,stroke:#eab308 style D fill:#dbeafe,stroke:#3b82f6 style E fill:#f0fdf4,stroke:#22c55e style F fill:#f3e8ff,stroke:#a855f7 style G fill:#f0fdf4,stroke:#22c55e style H fill:#f0fdf4,stroke:#22c55e style I fill:#d1fae5,stroke:#10b981

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Metadata Schema Registry	Configuration	Define and version-control the canonical metadata schema	PostgreSQL schema table; Confluent Schema Registry; custom YAML definition	Critical
Metadata Extraction Pipeline	Data Processing	Extract metadata from documents lacking explicit metadata	AWS Comprehend, Azure AI Language, Apache Tika, custom NLP	High
Metadata Quality Gate	Data Quality	Validate metadata completeness and value correctness before indexing	great_expectations, pydantic, custom validators	Critical
Metadata Review Queue	Operations	Hold documents failing quality gate for human metadata review	SQS, Azure Service Bus, PostgreSQL queue table	High
Filter Extractor (NLP)	NLP	Extract metadata filter values from natural language query	LLM-based extraction; rule-based for common patterns	High
Profile-Based Filter Injector	Business Logic	Add filters derived from authenticated user's profile	Custom middleware reading from identity provider	High
System Filter Composer	Security	Add non-negotiable system filters (ACL, classification ceiling, status)	Custom security middleware	Critical
Filter Composer	Orchestration	Compose all filter sources into a single metadata filter predicate	Custom Python; vector DB filter syntax (Pinecone filter, Weaviate where clause, pgvector WHERE)	Critical
Progressive Disclosure Engine	Business Logic	Relax filters in defined order when results are insufficient	Custom Python with configurable relaxation rules	Medium
Filter Provenance Annotator	UX	Surface applied filters and any relaxations in response metadata	Custom formatter	High

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	Source System	Produce document with raw metadata	Document bytes + source metadata
2	Metadata Extraction Pipeline	Infer missing metadata fields from document content	Enriched metadata record
3	Metadata Quality Gate	Validate all required fields present and valid	Pass or Fail decision
4	Chunking Engine	Split document; inherit full metadata per chunk	Chunks with `{metadata: {effective_date, status, category, jurisdiction, ...}}`
5	Vector DB	Upsert chunk with metadata payload	Indexed chunk with filterable metadata
6	User	Submit query + user profile context	Query + `{department, jurisdiction, clearance, language}`
7	Filter Extractor	Parse temporal, domain, and type cues from query text	Extracted filters: `{date_after: "2024-01-01", status: "CURRENT"}`
8	Profile-Based Filter Injector	Add user-profile-derived filters	Profile filters: `{jurisdiction: "AU", language: "en"}`
9	System Filter Composer	Add ACL filter + classification ceiling + default status filter	System filters: `{acl: user_groups, classification: <= user_clearance, status: CURRENT}`
10	Filter Composer	Combine all filter sources with AND logic	Final filter predicate
11	Vector ANN Search	Execute search with filter predicate	Top-K results within filtered space
12	Progressive Disclosure	If K < minimum: relax least-restrictive filter; re-execute	Sufficient results; relaxation log
13	Re-ranker	Re-rank by semantic relevance and metadata freshness (recency bonus)	Top-N re-ranked chunks
14	Context Assembler	Assemble prompt with metadata annotations per chunk	Annotated prompt
15	LLM	Generate answer	Response
16	Filter Provenance Annotator	Append filter summary to response	"Results filtered to: current Australian policy documents (2024)"

Error Flow

Error Condition	Detection	Recovery
All filter combinations return zero results	Zero-result detection after max progressive disclosure relaxations	Return "No documents found matching your query parameters"; do not generate from empty context
Metadata field missing on new document type	Quality gate failure	Route to metadata review queue; do not index without required fields
Filter NLP extraction produces incorrect date	Low-confidence extraction detection	Present extracted filter to user for confirmation ("Did you mean: results from 2024?")
Progressive disclosure relaxes ACL filter (must never happen)	ACL filter marked as non-relaxable	ACL filter is always in the non-relaxable set; throw error if relaxation touches ACL

8. Security Considerations

Filter Tamper Prevention

Metadata filter parameters must be constructed server-side from trusted sources (identity provider claims, server-side user profile). Client-supplied filter parameters must be validated against the user's authorised filter scope — a user must not be able to supply a filter parameter that elevates their access beyond their authorised scope. For example, a user with clearance: OFFICIAL-SENSITIVE must not be able to supply classification: PROTECTED as a filter parameter.

OWASP LLM Top 10 Mitigations

OWASP LLM Risk	Metadata Filtering Specific Concern	Mitigation
LLM06: Sensitive Information Disclosure	Metadata filter bypass via client-supplied filter injection	Server-side filter construction only; validate all client filter parameters against authorised scope
LLM04: Model Denial of Service	Very broad filter (no constraints) causes full-index scan	Minimum filter requirements enforced (at least one of: status filter or date range or category filter)
LLM09: Overreliance	User assumes all relevant documents have been searched; filter silently excludes important context	Filter provenance notation in response; progressive disclosure when results < minimum

9. Governance Considerations

Metadata Schema Governance

The metadata schema is a shared enterprise data contract. Changes to the schema require a formal change management process: RFC, review by source system owners and data stewards, backward compatibility assessment, and a migration plan for existing indexed chunks. The schema registry must version all schema changes.

Filter Relaxation Policy Governance

The progressive disclosure relaxation order and conditions must be formally agreed by data stewards and approved by the AI governance board. The relaxation policy determines what contextual boundaries can be crossed automatically (e.g., can jurisdiction be relaxed from state to national? Can status be relaxed from CURRENT to SUPERSEDED?) — these are governance decisions, not engineering decisions.

Governance Artefacts

Artefact	Owner	Frequency	Purpose
Metadata Schema Version Log	Data Architecture	Per version	Track schema evolution; migration history
Metadata Completeness Dashboard	Data Quality	Daily	Monitor field population rates per source and document type
Filter Relaxation Audit Log	AI Operations	Per event	Record every progressive disclosure event; detect over-relaxation
Filter Extraction Accuracy Report	AI Operations	Monthly	Validate NLP-based filter extraction against ground truth

10. Operational Considerations

Monitoring

Metric	Alert Threshold	Notes
Metadata completeness rate (per field, per source)	< 90% for required fields	Source-specific data quality issue
Progressive disclosure trigger rate	> 20% of queries	Filter schema too restrictive or corpus coverage gap
Zero-result rate (after max relaxation)	> 5% of queries	Corpus coverage issue; alert knowledge manager
Filter NLP extraction confidence (average)	< 0.80	Retrain or adjust NLP filter extractor

Service Level Objectives

SLO	Target	Notes
Metadata quality gate pass rate	≥ 95% of ingested documents	Measured per source
Query with zero progressive disclosure	≥ 80%	Most queries should find results within primary filter scope
Filtered retrieval latency overhead vs. unfiltered	≤ 10ms additional	Metadata filter execution should be negligible

11. Cost Considerations

Cost Drivers

Cost Driver	Notes	Optimisation
Metadata extraction NLP (at ingestion)	Per-document inference cost	Batch processing; cache inference results per document content hash
Metadata quality review (human)	Manual review of documents failing quality gate	Improve extraction model quality to reduce manual review volume
Vector DB metadata storage	Each chunk stores a full metadata object; increases storage per vector	Compress metadata; store metadata in separate metadata store, reference by chunk_id
Filter index maintenance	Some vector DBs charge for metadata index updates	Use inverted metadata indexes for high-cardinality filterable fields

Indicative Cost Range

Deployment Scale	Metadata Overhead vs. Base RAG
Small	+10–20% (extraction cost dominates)
Medium	+5–10% (extraction amortised; filter execution cheap)
Large	+3–5% (highly amortised; filter execution negligible)

12. Trade-Off Analysis

Filter Enforcement Strictness

Option	Result Completeness	User Experience Risk	Compliance Risk	Recommendation
Strict filters only (no relaxation)	May produce zero results	High (frustrating)	Low	For security-classified corpora only
Progressive disclosure with logging	High	Low (graceful degradation)	Low (transparent)	Default recommendation
No filters (global search)	Highest	Contextually wrong results	High	Not recommended for enterprise

Metadata Extraction Automation

Option	Metadata Quality	Ingestion Cost	Recommended For
Fully automated NLP extraction	Medium-High	Low	High-volume, homogeneous corpora
Automated + human review for failed	High	Medium	Mixed corpora; regulated use cases
Manual metadata assignment	Highest	Very High	Critical, low-volume corpora

Architectural Tensions

Tension	Trade-off	Recommendation
Filter granularity vs. schema complexity	More dimensions: precise filtering; more schema fields: harder to maintain	Start with 5–8 core dimensions; add domain-specific fields per use case
Pre-retrieval filtering vs. post-retrieval filtering	Pre-retrieval: fewer LLM tokens consumed; post-retrieval: higher recall	Pre-retrieval filtering always; post-retrieval classification labelling for output

13. Failure Modes

Failure Mode	Likelihood	Impact	Detection	Recovery
Metadata extraction assigns wrong effective_date (OCR error)	Medium	High	Spot-check QA; date anomaly detection	Human review queue; date correction pipeline
Progressive disclosure crosses ACL boundary (must not happen)	Very Low	Critical	ACL filter marked non-relaxable; audit log	Immediate alert; rollback progressive disclosure configuration
Status field not updated when document is superseded (stale CURRENT status)	Medium	High	Source system change notification + status sync job	Automated status sync from source system; stale status alert
Filter extractor identifies incorrect date from ambiguous query	Medium	Medium	Low-confidence extraction detection	Confirm extracted filters with user; "Did you mean: results from 2024?"

14. Regulatory Considerations

Regulation	Requirement	Metadata Filtering Response
Privacy Act 1988 APP 3	Collect only the minimum personal information necessary	Date/category/status metadata contains no personal information; domain metadata may reference individuals only via anonymised codes
APRA CPS 234	Access controls commensurate with information sensitivity	Classification metadata field drives access control enforcement; non-negotiable filter
EU AI Act Article 10	Appropriate data governance for AI operational data	Metadata schema governance; quality gate documentation; completeness metrics
GDPR Article 25 (Privacy by Design)	Data minimisation by default	Status filter defaults to CURRENT; expired documents excluded from retrieval by default without explicit override

15. Reference Implementations

AWS

Metadata extraction: Amazon Textract + Comprehend
Vector store with metadata: Amazon OpenSearch Service (supports rich filter queries on metadata fields)
Filter composition: Lambda function constructing OpenSearch filter DSL
Progressive disclosure: Lambda step function with configurable relaxation rules

Azure

Metadata extraction: Azure AI Document Intelligence + Language Service
Vector store: Azure AI Search (native filter expressions on indexed fields)
Filter composition: Azure Functions; Azure AI Search OData filter syntax
Progressive disclosure: Logic Apps workflow with retry logic

GCP

Metadata extraction: Google Document AI + Cloud Natural Language
Vector store: Vertex AI Vector Search (with numeric/string filter constraints)
Filter composition: Cloud Run; Vertex AI filter expression builder
Progressive disclosure: Cloud Workflows step function

Self-Hosted

Metadata extraction: Apache Tika + spaCy
Vector store: Weaviate (where filter), Qdrant (filter), pgvector (WHERE clause)
Progressive disclosure: Custom Python orchestration class

Pattern ID	Pattern Name	Relationship
EAAPL-RAG001	Enterprise RAG	Foundation; RAG010 extends with rich metadata schema and filter composition
EAAPL-RAG003	Secure RAG	ACL filter is a mandatory component of the system filter layer in RAG010
EAAPL-RAG005	Hybrid RAG	Metadata filters applied to both dense and BM25 retrieval paths in hybrid mode
EAAPL-KNW003	AI Knowledge Corpus Management	Corpus management policies include metadata completeness requirements
EAAPL-KNW006	Corpus Quality Assurance	Quality gates include metadata completeness and value validation

17. Maturity Assessment

Overall Maturity: Proven — Metadata filtering in vector databases is a well-established capability supported in all major platforms; the pattern is deployed in production across regulated industries; the primary challenges are metadata schema governance and extraction quality, not technology readiness.

Dimension	Score (1–5)	Rationale
Technology Readiness	5	Metadata filtering supported natively in all major vector databases
Tooling Ecosystem	4	Rich tooling for metadata extraction (Textract, Document Intelligence); filter composition is custom
Operational Guidance	4	Progressive disclosure patterns are well-understood; schema governance is organisation-specific
Security & Compliance	4	Classification and ACL filtering are well-established; filter tamper prevention requires careful implementation
Scalability Evidence	5	Metadata filter indexes scale to billions of vectors in managed services
Cost Predictability	4	Metadata extraction adds predictable per-document cost; filter execution is cheap

18. Revision History

Version	Date	Author	Changes
1.0	2024-03-15	EAAPL Working Group	Initial publication
1.1	2024-08-01	EAAPL Working Group	Progressive disclosure engine formalised; metadata schema canonical fields defined
1.2	2025-01-20	EAAPL Working Group	Filter tamper prevention security controls added; NLP filter extraction confidence monitoring added

← Back to Library More Retrieval-Augmented Generation →