Contextual RAG with Metadata Filtering
[EAAPL-RAG010] Contextual RAG with Metadata Filtering
Category: Artificial Intelligence / Retrieval-Augmented Generation
Sub-category: Metadata-Driven Contextual Retrieval
Version: 1.2
Maturity: Proven
Tags: rag metadata-filtering contextual-retrieval pre-retrieval-filtering faceted-search progressive-disclosure schema-design
Regulatory Relevance: APRA CPS234, Privacy Act 1988 APP 3 (data minimisation), ISO/IEC 42001 Section 8.4, EU AI Act Article 10
1. Executive Summary
Contextual RAG with Metadata Filtering extends the foundational RAG pattern with a rich, queryable metadata layer that enables users and system components to scope retrieval with precision before semantic similarity search executes. Rather than searching all vectors globally, the retrieval is constrained by metadata predicates — date ranges, document categories, departmental ownership, classification levels, language, and custom domain attributes — reducing the effective search space, improving precision, and enforcing contextual boundaries that pure semantic search cannot provide.
For enterprise architects and product managers, metadata filtering is the mechanism that transforms a general-purpose knowledge search into a domain-specific, context-aware knowledge assistant. A legal counsel who queries the AI assistant expects answers sourced only from legal documents effective today, in the relevant jurisdiction, and classified at their clearance level — not answers drawn from a two-year-old draft or a policy from a different business unit. Metadata filtering encodes these contextual expectations as first-class retrieval parameters rather than requiring them to be expressed in the query text and hoped to be respected by the LLM. The pattern is the recommended baseline for all production enterprise RAG deployments over large, heterogeneous corpora.
2. Problem Statement
Business Problem
Enterprise knowledge corpora are large and heterogeneous. A financial institution's knowledge base contains policies from multiple jurisdictions, in multiple languages, at multiple classification levels, some current and some superseded. A query for "the current refund policy for corporate clients" could, without metadata filtering, retrieve a superseded policy from 2019, a policy applicable to retail clients, or a draft policy not yet approved. Each of these is semantically similar to the query but contextually wrong.
Technical Problem
Semantic similarity search optimises for vector distance, not contextual appropriateness. A document can be highly semantically similar to a query while being contextually wrong (wrong date, wrong department, wrong jurisdiction, wrong classification). Without metadata filtering, the retrieval layer has no mechanism to enforce contextual constraints — the LLM must infer them from document text, which it does unreliably, or the system must hope that contextually appropriate documents happen to score higher than contextually wrong ones.
Symptoms
- AI assistant returns answers based on superseded policies or procedures
- Answers reference documents from the wrong department or business unit, confusing users
- Multilingual queries return results in the wrong language despite the user's language preference
- Date-sensitive queries ("what are our current obligations under X") return historical documents
- Users add explicit temporal and domain constraints to every query ("current", "Australia only", "retail banking") because they have learned the system ignores context
Cost of Inaction
- User trust erosion: users who receive contextually wrong answers (superseded policy, wrong jurisdiction) lose confidence rapidly
- Compliance risk: AI system provides outdated regulatory guidance, leading to incorrect compliance decisions
- Support burden: high volume of "the AI gave me the wrong policy" reports requiring manual correction
3. Context
When to Apply
- Large heterogeneous corpora where documents have significant variation in date, category, jurisdiction, classification, or status
- Multi-department deployments where users need results scoped to their business unit or domain
- Regulated corpora where only current, approved, and appropriately classified documents should be retrieved
- Multilingual corpora where language-specific retrieval is important
- Any deployment where temporal freshness of answers is a user requirement
When NOT to Apply
- Very small corpora (< 10,000 documents) where the global search space is manageable without metadata filtering
- Corpora with no meaningful metadata differentiation (all documents have the same date, category, and status)
- Use cases where cross-context retrieval is desired (comparative analysis across time periods or departments)
Prerequisites
- A well-defined metadata schema agreed across all source systems
- Metadata populated consistently at ingestion time (missing metadata values must trigger quality alerts)
- A metadata extraction pipeline for documents that do not have explicit metadata (e.g., PDF files without embedded metadata)
- A user interface or API that exposes metadata filter parameters to users or calling applications
Industry Applicability
| Industry | Key Metadata Dimensions | Filter Examples |
|---|---|---|
| Financial Services | effective_date, jurisdiction, product_line, classification, regulatory_body | date_range: [today-7d, today]; jurisdiction: AU; status: CURRENT |
| Healthcare | clinical_specialty, formulary_version, guideline_body, publication_date | specialty: oncology; status: ACTIVE; guideline_body: NHMRC |
| Government | department, act_reference, security_classification, review_date | department: Treasury; classification: OFFICIAL-SENSITIVE; status: CURRENT |
| Legal | jurisdiction, court_level, decision_date, area_of_law | jurisdiction: NSW; area_of_law: employment; date_after: 2020-01-01 |
| Technology | product_version, environment, doc_type, team | product: payment-gateway; version: >=3.2; doc_type: runbook; env: prod |
4. Architecture Overview
Contextual RAG with Metadata Filtering introduces a carefully designed metadata schema as a first-class architectural concern, alongside mechanisms for metadata extraction, filter construction, and progressive disclosure of context.
Metadata Schema Design
The metadata schema is the foundation of the pattern. Schema design decisions made at deployment time are difficult to change later because they require re-ingestion of the entire corpus. The schema must balance completeness (capturing all contextually relevant attributes) with practicality (all fields must be extractable and populated reliably).
A canonical enterprise RAG metadata schema includes:
- Temporal:
effective_date,expiry_date,last_modified,publication_date - Provenance:
source_system,source_document_id,document_version,author,owner_department - Classification:
security_classification,data_sensitivity,contains_pii - Content type:
document_type(policy/procedure/runbook/faq/report/contract),language,format - Domain:
jurisdiction,product_line,regulatory_body,clinical_specialty(domain-specific) - Status:
lifecycle_status(DRAFT/CURRENT/SUPERSEDED/ARCHIVED)
Every field in the schema must have a defined cardinality (single-value or multi-value), a defined value set or validation rule, and a defined population source (embedded in source document, extracted by NLP, or assigned by governance process).
Metadata Extraction Pipeline
Not all enterprise documents arrive with structured metadata. The extraction pipeline must infer metadata from document content when explicit metadata is absent. This involves:
- Date extraction: identify effective dates, review dates, and publication dates from document headers, footers, and first paragraphs using NLP or document intelligence services
- Document type classification: classify the document as policy/procedure/FAQ/contract/report using a fine-tuned text classifier
- Language detection: identify the document language using langdetect or equivalent
- Entity extraction for domain metadata: extract regulatory body names, product names, jurisdiction references for domain metadata population
Metadata quality gates enforce minimum field population requirements before a document is indexed. Documents that fail quality gates enter a "metadata review" queue rather than being indexed without metadata.
Filter Construction
Metadata filters can be applied at three levels:
-
System-imposed filters (always on, not user-controllable): ACL filters (EAAPL-RAG003), lifecycle status filter (exclude ARCHIVED and DRAFT by default), classification ceiling (user's maximum clearance)
-
User-provided explicit filters: the user or calling application specifies filter values explicitly (e.g.,
jurisdiction: AU,date_after: 2024-01-01,document_type: policy). These can be extracted from the user's query text (NLP-based filter extraction: "current policy" →status: CURRENT) or provided via a structured filter UI -
Context-inferred filters: the system infers filters from context — the user's profile (department, role, jurisdiction), the conversation history, or the query classification. A user authenticated with a jurisdiction of "NSW" automatically has
jurisdiction IN [NSW, AU]applied to their queries without specifying it explicitly
Filters are composed with AND logic by default; OR logic is available for multi-value fields (a user's permitted departments, multiple accepted languages).
Progressive Disclosure
When metadata filtering produces too few results (fewer than K minimum relevant chunks), a progressive disclosure strategy relaxes filters in a defined order until sufficient results are found. For example: if {status: CURRENT, jurisdiction: NSW} produces 0 results, the system relaxes jurisdiction to {AU} and tries again before relaxing status to include SUPERSEDED. Each relaxation is logged, and the response includes a provenance note: "No current NSW-specific documents found; the following is based on the applicable national policy."
Progressive disclosure prevents "no results" responses while maintaining metadata filter transparency — the user always knows what filters were applied.
5. Architecture Diagram
6. Components
| Component | Type | Responsibility | Technology Options | Criticality |
|---|---|---|---|---|
| Metadata Schema Registry | Configuration | Define and version-control the canonical metadata schema | PostgreSQL schema table; Confluent Schema Registry; custom YAML definition | Critical |
| Metadata Extraction Pipeline | Data Processing | Extract metadata from documents lacking explicit metadata | AWS Comprehend, Azure AI Language, Apache Tika, custom NLP | High |
| Metadata Quality Gate | Data Quality | Validate metadata completeness and value correctness before indexing | great_expectations, pydantic, custom validators | Critical |
| Metadata Review Queue | Operations | Hold documents failing quality gate for human metadata review | SQS, Azure Service Bus, PostgreSQL queue table | High |
| Filter Extractor (NLP) | NLP | Extract metadata filter values from natural language query | LLM-based extraction; rule-based for common patterns | High |
| Profile-Based Filter Injector | Business Logic | Add filters derived from authenticated user's profile | Custom middleware reading from identity provider | High |
| System Filter Composer | Security | Add non-negotiable system filters (ACL, classification ceiling, status) | Custom security middleware | Critical |
| Filter Composer | Orchestration | Compose all filter sources into a single metadata filter predicate | Custom Python; vector DB filter syntax (Pinecone filter, Weaviate where clause, pgvector WHERE) | Critical |
| Progressive Disclosure Engine | Business Logic | Relax filters in defined order when results are insufficient | Custom Python with configurable relaxation rules | Medium |
| Filter Provenance Annotator | UX | Surface applied filters and any relaxations in response metadata | Custom formatter | High |
7. Data Flow
Primary Flow
| Step | Actor | Action | Output |
|---|---|---|---|
| 1 | Source System | Produce document with raw metadata | Document bytes + source metadata |
| 2 | Metadata Extraction Pipeline | Infer missing metadata fields from document content | Enriched metadata record |
| 3 | Metadata Quality Gate | Validate all required fields present and valid | Pass or Fail decision |
| 4 | Chunking Engine | Split document; inherit full metadata per chunk | Chunks with {metadata: {effective_date, status, category, jurisdiction, ...}} |
| 5 | Vector DB | Upsert chunk with metadata payload | Indexed chunk with filterable metadata |
| 6 | User | Submit query + user profile context | Query + {department, jurisdiction, clearance, language} |
| 7 | Filter Extractor | Parse temporal, domain, and type cues from query text | Extracted filters: {date_after: "2024-01-01", status: "CURRENT"} |
| 8 | Profile-Based Filter Injector | Add user-profile-derived filters | Profile filters: {jurisdiction: "AU", language: "en"} |
| 9 | System Filter Composer | Add ACL filter + classification ceiling + default status filter | System filters: {acl: user_groups, classification: <= user_clearance, status: CURRENT} |
| 10 | Filter Composer | Combine all filter sources with AND logic | Final filter predicate |
| 11 | Vector ANN Search | Execute search with filter predicate | Top-K results within filtered space |
| 12 | Progressive Disclosure | If K < minimum: relax least-restrictive filter; re-execute | Sufficient results; relaxation log |
| 13 | Re-ranker | Re-rank by semantic relevance and metadata freshness (recency bonus) | Top-N re-ranked chunks |
| 14 | Context Assembler | Assemble prompt with metadata annotations per chunk | Annotated prompt |
| 15 | LLM | Generate answer | Response |
| 16 | Filter Provenance Annotator | Append filter summary to response | "Results filtered to: current Australian policy documents (2024)" |
Error Flow
| Error Condition | Detection | Recovery |
|---|---|---|
| All filter combinations return zero results | Zero-result detection after max progressive disclosure relaxations | Return "No documents found matching your query parameters"; do not generate from empty context |
| Metadata field missing on new document type | Quality gate failure | Route to metadata review queue; do not index without required fields |
| Filter NLP extraction produces incorrect date | Low-confidence extraction detection | Present extracted filter to user for confirmation ("Did you mean: results from 2024?") |
| Progressive disclosure relaxes ACL filter (must never happen) | ACL filter marked as non-relaxable | ACL filter is always in the non-relaxable set; throw error if relaxation touches ACL |
8. Security Considerations
Filter Tamper Prevention
Metadata filter parameters must be constructed server-side from trusted sources (identity provider claims, server-side user profile). Client-supplied filter parameters must be validated against the user's authorised filter scope — a user must not be able to supply a filter parameter that elevates their access beyond their authorised scope. For example, a user with clearance: OFFICIAL-SENSITIVE must not be able to supply classification: PROTECTED as a filter parameter.
OWASP LLM Top 10 Mitigations
| OWASP LLM Risk | Metadata Filtering Specific Concern | Mitigation |
|---|---|---|
| LLM06: Sensitive Information Disclosure | Metadata filter bypass via client-supplied filter injection | Server-side filter construction only; validate all client filter parameters against authorised scope |
| LLM04: Model Denial of Service | Very broad filter (no constraints) causes full-index scan | Minimum filter requirements enforced (at least one of: status filter or date range or category filter) |
| LLM09: Overreliance | User assumes all relevant documents have been searched; filter silently excludes important context | Filter provenance notation in response; progressive disclosure when results < minimum |
9. Governance Considerations
Metadata Schema Governance
The metadata schema is a shared enterprise data contract. Changes to the schema require a formal change management process: RFC, review by source system owners and data stewards, backward compatibility assessment, and a migration plan for existing indexed chunks. The schema registry must version all schema changes.
Filter Relaxation Policy Governance
The progressive disclosure relaxation order and conditions must be formally agreed by data stewards and approved by the AI governance board. The relaxation policy determines what contextual boundaries can be crossed automatically (e.g., can jurisdiction be relaxed from state to national? Can status be relaxed from CURRENT to SUPERSEDED?) — these are governance decisions, not engineering decisions.
Governance Artefacts
| Artefact | Owner | Frequency | Purpose |
|---|---|---|---|
| Metadata Schema Version Log | Data Architecture | Per version | Track schema evolution; migration history |
| Metadata Completeness Dashboard | Data Quality | Daily | Monitor field population rates per source and document type |
| Filter Relaxation Audit Log | AI Operations | Per event | Record every progressive disclosure event; detect over-relaxation |
| Filter Extraction Accuracy Report | AI Operations | Monthly | Validate NLP-based filter extraction against ground truth |
10. Operational Considerations
Monitoring
| Metric | Alert Threshold | Notes |
|---|---|---|
| Metadata completeness rate (per field, per source) | < 90% for required fields | Source-specific data quality issue |
| Progressive disclosure trigger rate | > 20% of queries | Filter schema too restrictive or corpus coverage gap |
| Zero-result rate (after max relaxation) | > 5% of queries | Corpus coverage issue; alert knowledge manager |
| Filter NLP extraction confidence (average) | < 0.80 | Retrain or adjust NLP filter extractor |
Service Level Objectives
| SLO | Target | Notes |
|---|---|---|
| Metadata quality gate pass rate | ≥ 95% of ingested documents | Measured per source |
| Query with zero progressive disclosure | ≥ 80% | Most queries should find results within primary filter scope |
| Filtered retrieval latency overhead vs. unfiltered | ≤ 10ms additional | Metadata filter execution should be negligible |
11. Cost Considerations
Cost Drivers
| Cost Driver | Notes | Optimisation |
|---|---|---|
| Metadata extraction NLP (at ingestion) | Per-document inference cost | Batch processing; cache inference results per document content hash |
| Metadata quality review (human) | Manual review of documents failing quality gate | Improve extraction model quality to reduce manual review volume |
| Vector DB metadata storage | Each chunk stores a full metadata object; increases storage per vector | Compress metadata; store metadata in separate metadata store, reference by chunk_id |
| Filter index maintenance | Some vector DBs charge for metadata index updates | Use inverted metadata indexes for high-cardinality filterable fields |
Indicative Cost Range
| Deployment Scale | Metadata Overhead vs. Base RAG |
|---|---|
| Small | +10–20% (extraction cost dominates) |
| Medium | +5–10% (extraction amortised; filter execution cheap) |
| Large | +3–5% (highly amortised; filter execution negligible) |
12. Trade-Off Analysis
Filter Enforcement Strictness
| Option | Result Completeness | User Experience Risk | Compliance Risk | Recommendation |
|---|---|---|---|---|
| Strict filters only (no relaxation) | May produce zero results | High (frustrating) | Low | For security-classified corpora only |
| Progressive disclosure with logging | High | Low (graceful degradation) | Low (transparent) | Default recommendation |
| No filters (global search) | Highest | Contextually wrong results | High | Not recommended for enterprise |
Metadata Extraction Automation
| Option | Metadata Quality | Ingestion Cost | Recommended For |
|---|---|---|---|
| Fully automated NLP extraction | Medium-High | Low | High-volume, homogeneous corpora |
| Automated + human review for failed | High | Medium | Mixed corpora; regulated use cases |
| Manual metadata assignment | Highest | Very High | Critical, low-volume corpora |
Architectural Tensions
| Tension | Trade-off | Recommendation |
|---|---|---|
| Filter granularity vs. schema complexity | More dimensions: precise filtering; more schema fields: harder to maintain | Start with 5–8 core dimensions; add domain-specific fields per use case |
| Pre-retrieval filtering vs. post-retrieval filtering | Pre-retrieval: fewer LLM tokens consumed; post-retrieval: higher recall | Pre-retrieval filtering always; post-retrieval classification labelling for output |
13. Failure Modes
| Failure Mode | Likelihood | Impact | Detection | Recovery |
|---|---|---|---|---|
| Metadata extraction assigns wrong effective_date (OCR error) | Medium | High | Spot-check QA; date anomaly detection | Human review queue; date correction pipeline |
| Progressive disclosure crosses ACL boundary (must not happen) | Very Low | Critical | ACL filter marked non-relaxable; audit log | Immediate alert; rollback progressive disclosure configuration |
| Status field not updated when document is superseded (stale CURRENT status) | Medium | High | Source system change notification + status sync job | Automated status sync from source system; stale status alert |
| Filter extractor identifies incorrect date from ambiguous query | Medium | Medium | Low-confidence extraction detection | Confirm extracted filters with user; "Did you mean: results from 2024?" |
14. Regulatory Considerations
| Regulation | Requirement | Metadata Filtering Response |
|---|---|---|
| Privacy Act 1988 APP 3 | Collect only the minimum personal information necessary | Date/category/status metadata contains no personal information; domain metadata may reference individuals only via anonymised codes |
| APRA CPS 234 | Access controls commensurate with information sensitivity | Classification metadata field drives access control enforcement; non-negotiable filter |
| EU AI Act Article 10 | Appropriate data governance for AI operational data | Metadata schema governance; quality gate documentation; completeness metrics |
| GDPR Article 25 (Privacy by Design) | Data minimisation by default | Status filter defaults to CURRENT; expired documents excluded from retrieval by default without explicit override |
15. Reference Implementations
AWS
- Metadata extraction: Amazon Textract + Comprehend
- Vector store with metadata: Amazon OpenSearch Service (supports rich filter queries on metadata fields)
- Filter composition: Lambda function constructing OpenSearch filter DSL
- Progressive disclosure: Lambda step function with configurable relaxation rules
Azure
- Metadata extraction: Azure AI Document Intelligence + Language Service
- Vector store: Azure AI Search (native filter expressions on indexed fields)
- Filter composition: Azure Functions; Azure AI Search OData filter syntax
- Progressive disclosure: Logic Apps workflow with retry logic
GCP
- Metadata extraction: Google Document AI + Cloud Natural Language
- Vector store: Vertex AI Vector Search (with numeric/string filter constraints)
- Filter composition: Cloud Run; Vertex AI filter expression builder
- Progressive disclosure: Cloud Workflows step function
Self-Hosted
- Metadata extraction: Apache Tika + spaCy
- Vector store: Weaviate (where filter), Qdrant (filter), pgvector (WHERE clause)
- Progressive disclosure: Custom Python orchestration class
16. Related Patterns
| Pattern ID | Pattern Name | Relationship |
|---|---|---|
| EAAPL-RAG001 | Enterprise RAG | Foundation; RAG010 extends with rich metadata schema and filter composition |
| EAAPL-RAG003 | Secure RAG | ACL filter is a mandatory component of the system filter layer in RAG010 |
| EAAPL-RAG005 | Hybrid RAG | Metadata filters applied to both dense and BM25 retrieval paths in hybrid mode |
| EAAPL-KNW003 | AI Knowledge Corpus Management | Corpus management policies include metadata completeness requirements |
| EAAPL-KNW006 | Corpus Quality Assurance | Quality gates include metadata completeness and value validation |
17. Maturity Assessment
Overall Maturity: Proven — Metadata filtering in vector databases is a well-established capability supported in all major platforms; the pattern is deployed in production across regulated industries; the primary challenges are metadata schema governance and extraction quality, not technology readiness.
| Dimension | Score (1–5) | Rationale |
|---|---|---|
| Technology Readiness | 5 | Metadata filtering supported natively in all major vector databases |
| Tooling Ecosystem | 4 | Rich tooling for metadata extraction (Textract, Document Intelligence); filter composition is custom |
| Operational Guidance | 4 | Progressive disclosure patterns are well-understood; schema governance is organisation-specific |
| Security & Compliance | 4 | Classification and ACL filtering are well-established; filter tamper prevention requires careful implementation |
| Scalability Evidence | 5 | Metadata filter indexes scale to billions of vectors in managed services |
| Cost Predictability | 4 | Metadata extraction adds predictable per-document cost; filter execution is cheap |
18. Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2024-03-15 | EAAPL Working Group | Initial publication |
| 1.1 | 2024-08-01 | EAAPL Working Group | Progressive disclosure engine formalised; metadata schema canonical fields defined |
| 1.2 | 2025-01-20 | EAAPL Working Group | Filter tamper prevention security controls added; NLP filter extraction confidence monitoring added |