EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryRetrieval-Augmented GenerationEAAPL-RAG010
EAAPL-RAG010Proven
⇄ Compare

Contextual RAG with Metadata Filtering

[EAAPL-RAG010] Contextual RAG with Metadata Filtering

Category: Artificial Intelligence / Retrieval-Augmented Generation Sub-category: Metadata-Driven Contextual Retrieval Version: 1.2 Maturity: Proven Tags: rag metadata-filtering contextual-retrieval pre-retrieval-filtering faceted-search progressive-disclosure schema-design Regulatory Relevance: APRA CPS234, Privacy Act 1988 APP 3 (data minimisation), ISO/IEC 42001 Section 8.4, EU AI Act Article 10


1. Executive Summary

Contextual RAG with Metadata Filtering extends the foundational RAG pattern with a rich, queryable metadata layer that enables users and system components to scope retrieval with precision before semantic similarity search executes. Rather than searching all vectors globally, the retrieval is constrained by metadata predicates — date ranges, document categories, departmental ownership, classification levels, language, and custom domain attributes — reducing the effective search space, improving precision, and enforcing contextual boundaries that pure semantic search cannot provide.

For enterprise architects and product managers, metadata filtering is the mechanism that transforms a general-purpose knowledge search into a domain-specific, context-aware knowledge assistant. A legal counsel who queries the AI assistant expects answers sourced only from legal documents effective today, in the relevant jurisdiction, and classified at their clearance level — not answers drawn from a two-year-old draft or a policy from a different business unit. Metadata filtering encodes these contextual expectations as first-class retrieval parameters rather than requiring them to be expressed in the query text and hoped to be respected by the LLM. The pattern is the recommended baseline for all production enterprise RAG deployments over large, heterogeneous corpora.


2. Problem Statement

Business Problem

Enterprise knowledge corpora are large and heterogeneous. A financial institution's knowledge base contains policies from multiple jurisdictions, in multiple languages, at multiple classification levels, some current and some superseded. A query for "the current refund policy for corporate clients" could, without metadata filtering, retrieve a superseded policy from 2019, a policy applicable to retail clients, or a draft policy not yet approved. Each of these is semantically similar to the query but contextually wrong.

Technical Problem

Semantic similarity search optimises for vector distance, not contextual appropriateness. A document can be highly semantically similar to a query while being contextually wrong (wrong date, wrong department, wrong jurisdiction, wrong classification). Without metadata filtering, the retrieval layer has no mechanism to enforce contextual constraints — the LLM must infer them from document text, which it does unreliably, or the system must hope that contextually appropriate documents happen to score higher than contextually wrong ones.

Symptoms

  • AI assistant returns answers based on superseded policies or procedures
  • Answers reference documents from the wrong department or business unit, confusing users
  • Multilingual queries return results in the wrong language despite the user's language preference
  • Date-sensitive queries ("what are our current obligations under X") return historical documents
  • Users add explicit temporal and domain constraints to every query ("current", "Australia only", "retail banking") because they have learned the system ignores context

Cost of Inaction

  • User trust erosion: users who receive contextually wrong answers (superseded policy, wrong jurisdiction) lose confidence rapidly
  • Compliance risk: AI system provides outdated regulatory guidance, leading to incorrect compliance decisions
  • Support burden: high volume of "the AI gave me the wrong policy" reports requiring manual correction

3. Context

When to Apply

  • Large heterogeneous corpora where documents have significant variation in date, category, jurisdiction, classification, or status
  • Multi-department deployments where users need results scoped to their business unit or domain
  • Regulated corpora where only current, approved, and appropriately classified documents should be retrieved
  • Multilingual corpora where language-specific retrieval is important
  • Any deployment where temporal freshness of answers is a user requirement

When NOT to Apply

  • Very small corpora (< 10,000 documents) where the global search space is manageable without metadata filtering
  • Corpora with no meaningful metadata differentiation (all documents have the same date, category, and status)
  • Use cases where cross-context retrieval is desired (comparative analysis across time periods or departments)

Prerequisites

  • A well-defined metadata schema agreed across all source systems
  • Metadata populated consistently at ingestion time (missing metadata values must trigger quality alerts)
  • A metadata extraction pipeline for documents that do not have explicit metadata (e.g., PDF files without embedded metadata)
  • A user interface or API that exposes metadata filter parameters to users or calling applications

Industry Applicability

Industry Key Metadata Dimensions Filter Examples
Financial Services effective_date, jurisdiction, product_line, classification, regulatory_body date_range: [today-7d, today]; jurisdiction: AU; status: CURRENT
Healthcare clinical_specialty, formulary_version, guideline_body, publication_date specialty: oncology; status: ACTIVE; guideline_body: NHMRC
Government department, act_reference, security_classification, review_date department: Treasury; classification: OFFICIAL-SENSITIVE; status: CURRENT
Legal jurisdiction, court_level, decision_date, area_of_law jurisdiction: NSW; area_of_law: employment; date_after: 2020-01-01
Technology product_version, environment, doc_type, team product: payment-gateway; version: >=3.2; doc_type: runbook; env: prod

4. Architecture Overview

Contextual RAG with Metadata Filtering introduces a carefully designed metadata schema as a first-class architectural concern, alongside mechanisms for metadata extraction, filter construction, and progressive disclosure of context.

Metadata Schema Design

The metadata schema is the foundation of the pattern. Schema design decisions made at deployment time are difficult to change later because they require re-ingestion of the entire corpus. The schema must balance completeness (capturing all contextually relevant attributes) with practicality (all fields must be extractable and populated reliably).

A canonical enterprise RAG metadata schema includes:

  • Temporal: effective_date, expiry_date, last_modified, publication_date
  • Provenance: source_system, source_document_id, document_version, author, owner_department
  • Classification: security_classification, data_sensitivity, contains_pii
  • Content type: document_type (policy/procedure/runbook/faq/report/contract), language, format
  • Domain: jurisdiction, product_line, regulatory_body, clinical_specialty (domain-specific)
  • Status: lifecycle_status (DRAFT/CURRENT/SUPERSEDED/ARCHIVED)

Every field in the schema must have a defined cardinality (single-value or multi-value), a defined value set or validation rule, and a defined population source (embedded in source document, extracted by NLP, or assigned by governance process).

Metadata Extraction Pipeline

Not all enterprise documents arrive with structured metadata. The extraction pipeline must infer metadata from document content when explicit metadata is absent. This involves:

  • Date extraction: identify effective dates, review dates, and publication dates from document headers, footers, and first paragraphs using NLP or document intelligence services
  • Document type classification: classify the document as policy/procedure/FAQ/contract/report using a fine-tuned text classifier
  • Language detection: identify the document language using langdetect or equivalent
  • Entity extraction for domain metadata: extract regulatory body names, product names, jurisdiction references for domain metadata population

Metadata quality gates enforce minimum field population requirements before a document is indexed. Documents that fail quality gates enter a "metadata review" queue rather than being indexed without metadata.

Filter Construction

Metadata filters can be applied at three levels:

  1. System-imposed filters (always on, not user-controllable): ACL filters (EAAPL-RAG003), lifecycle status filter (exclude ARCHIVED and DRAFT by default), classification ceiling (user's maximum clearance)

  2. User-provided explicit filters: the user or calling application specifies filter values explicitly (e.g., jurisdiction: AU, date_after: 2024-01-01, document_type: policy). These can be extracted from the user's query text (NLP-based filter extraction: "current policy" → status: CURRENT) or provided via a structured filter UI

  3. Context-inferred filters: the system infers filters from context — the user's profile (department, role, jurisdiction), the conversation history, or the query classification. A user authenticated with a jurisdiction of "NSW" automatically has jurisdiction IN [NSW, AU] applied to their queries without specifying it explicitly

Filters are composed with AND logic by default; OR logic is available for multi-value fields (a user's permitted departments, multiple accepted languages).

Progressive Disclosure

When metadata filtering produces too few results (fewer than K minimum relevant chunks), a progressive disclosure strategy relaxes filters in a defined order until sufficient results are found. For example: if {status: CURRENT, jurisdiction: NSW} produces 0 results, the system relaxes jurisdiction to {AU} and tries again before relaxing status to include SUPERSEDED. Each relaxation is logged, and the response includes a provenance note: "No current NSW-specific documents found; the following is based on the applicable national policy."

Progressive disclosure prevents "no results" responses while maintaining metadata filter transparency — the user always knows what filters were applied.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Ingestion["Ingestion Pipeline"] A[Source Documents] B[Metadata Extractor] C[Vector Store] end subgraph Query["Query Pipeline"] D[User + Profile] E[Filter Composer] F{Threshold Check} G[Progressive Disclosure] end subgraph Generation["Generation"] H[Re-ranker] I[LLM + Context] end A --> B -->|metadata + chunks| C D -->|query + ACL filters| E E -->|filtered search| C C --> F F -->|too few results| G G -->|relax filters| E F -->|sufficient| H H --> I --> D style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#fef9c3,stroke:#eab308 style D fill:#dbeafe,stroke:#3b82f6 style E fill:#f0fdf4,stroke:#22c55e style F fill:#f3e8ff,stroke:#a855f7 style G fill:#f0fdf4,stroke:#22c55e style H fill:#f0fdf4,stroke:#22c55e style I fill:#d1fae5,stroke:#10b981

6. Components

Component Type Responsibility Technology Options Criticality
Metadata Schema Registry Configuration Define and version-control the canonical metadata schema PostgreSQL schema table; Confluent Schema Registry; custom YAML definition Critical
Metadata Extraction Pipeline Data Processing Extract metadata from documents lacking explicit metadata AWS Comprehend, Azure AI Language, Apache Tika, custom NLP High
Metadata Quality Gate Data Quality Validate metadata completeness and value correctness before indexing great_expectations, pydantic, custom validators Critical
Metadata Review Queue Operations Hold documents failing quality gate for human metadata review SQS, Azure Service Bus, PostgreSQL queue table High
Filter Extractor (NLP) NLP Extract metadata filter values from natural language query LLM-based extraction; rule-based for common patterns High
Profile-Based Filter Injector Business Logic Add filters derived from authenticated user's profile Custom middleware reading from identity provider High
System Filter Composer Security Add non-negotiable system filters (ACL, classification ceiling, status) Custom security middleware Critical
Filter Composer Orchestration Compose all filter sources into a single metadata filter predicate Custom Python; vector DB filter syntax (Pinecone filter, Weaviate where clause, pgvector WHERE) Critical
Progressive Disclosure Engine Business Logic Relax filters in defined order when results are insufficient Custom Python with configurable relaxation rules Medium
Filter Provenance Annotator UX Surface applied filters and any relaxations in response metadata Custom formatter High

7. Data Flow

Primary Flow

Step Actor Action Output
1 Source System Produce document with raw metadata Document bytes + source metadata
2 Metadata Extraction Pipeline Infer missing metadata fields from document content Enriched metadata record
3 Metadata Quality Gate Validate all required fields present and valid Pass or Fail decision
4 Chunking Engine Split document; inherit full metadata per chunk Chunks with {metadata: {effective_date, status, category, jurisdiction, ...}}
5 Vector DB Upsert chunk with metadata payload Indexed chunk with filterable metadata
6 User Submit query + user profile context Query + {department, jurisdiction, clearance, language}
7 Filter Extractor Parse temporal, domain, and type cues from query text Extracted filters: {date_after: "2024-01-01", status: "CURRENT"}
8 Profile-Based Filter Injector Add user-profile-derived filters Profile filters: {jurisdiction: "AU", language: "en"}
9 System Filter Composer Add ACL filter + classification ceiling + default status filter System filters: {acl: user_groups, classification: <= user_clearance, status: CURRENT}
10 Filter Composer Combine all filter sources with AND logic Final filter predicate
11 Vector ANN Search Execute search with filter predicate Top-K results within filtered space
12 Progressive Disclosure If K < minimum: relax least-restrictive filter; re-execute Sufficient results; relaxation log
13 Re-ranker Re-rank by semantic relevance and metadata freshness (recency bonus) Top-N re-ranked chunks
14 Context Assembler Assemble prompt with metadata annotations per chunk Annotated prompt
15 LLM Generate answer Response
16 Filter Provenance Annotator Append filter summary to response "Results filtered to: current Australian policy documents (2024)"

Error Flow

Error Condition Detection Recovery
All filter combinations return zero results Zero-result detection after max progressive disclosure relaxations Return "No documents found matching your query parameters"; do not generate from empty context
Metadata field missing on new document type Quality gate failure Route to metadata review queue; do not index without required fields
Filter NLP extraction produces incorrect date Low-confidence extraction detection Present extracted filter to user for confirmation ("Did you mean: results from 2024?")
Progressive disclosure relaxes ACL filter (must never happen) ACL filter marked as non-relaxable ACL filter is always in the non-relaxable set; throw error if relaxation touches ACL

8. Security Considerations

Filter Tamper Prevention

Metadata filter parameters must be constructed server-side from trusted sources (identity provider claims, server-side user profile). Client-supplied filter parameters must be validated against the user's authorised filter scope — a user must not be able to supply a filter parameter that elevates their access beyond their authorised scope. For example, a user with clearance: OFFICIAL-SENSITIVE must not be able to supply classification: PROTECTED as a filter parameter.

OWASP LLM Top 10 Mitigations

OWASP LLM Risk Metadata Filtering Specific Concern Mitigation
LLM06: Sensitive Information Disclosure Metadata filter bypass via client-supplied filter injection Server-side filter construction only; validate all client filter parameters against authorised scope
LLM04: Model Denial of Service Very broad filter (no constraints) causes full-index scan Minimum filter requirements enforced (at least one of: status filter or date range or category filter)
LLM09: Overreliance User assumes all relevant documents have been searched; filter silently excludes important context Filter provenance notation in response; progressive disclosure when results < minimum

9. Governance Considerations

Metadata Schema Governance

The metadata schema is a shared enterprise data contract. Changes to the schema require a formal change management process: RFC, review by source system owners and data stewards, backward compatibility assessment, and a migration plan for existing indexed chunks. The schema registry must version all schema changes.

Filter Relaxation Policy Governance

The progressive disclosure relaxation order and conditions must be formally agreed by data stewards and approved by the AI governance board. The relaxation policy determines what contextual boundaries can be crossed automatically (e.g., can jurisdiction be relaxed from state to national? Can status be relaxed from CURRENT to SUPERSEDED?) — these are governance decisions, not engineering decisions.

Governance Artefacts

Artefact Owner Frequency Purpose
Metadata Schema Version Log Data Architecture Per version Track schema evolution; migration history
Metadata Completeness Dashboard Data Quality Daily Monitor field population rates per source and document type
Filter Relaxation Audit Log AI Operations Per event Record every progressive disclosure event; detect over-relaxation
Filter Extraction Accuracy Report AI Operations Monthly Validate NLP-based filter extraction against ground truth

10. Operational Considerations

Monitoring

Metric Alert Threshold Notes
Metadata completeness rate (per field, per source) < 90% for required fields Source-specific data quality issue
Progressive disclosure trigger rate > 20% of queries Filter schema too restrictive or corpus coverage gap
Zero-result rate (after max relaxation) > 5% of queries Corpus coverage issue; alert knowledge manager
Filter NLP extraction confidence (average) < 0.80 Retrain or adjust NLP filter extractor

Service Level Objectives

SLO Target Notes
Metadata quality gate pass rate ≥ 95% of ingested documents Measured per source
Query with zero progressive disclosure ≥ 80% Most queries should find results within primary filter scope
Filtered retrieval latency overhead vs. unfiltered ≤ 10ms additional Metadata filter execution should be negligible

11. Cost Considerations

Cost Drivers

Cost Driver Notes Optimisation
Metadata extraction NLP (at ingestion) Per-document inference cost Batch processing; cache inference results per document content hash
Metadata quality review (human) Manual review of documents failing quality gate Improve extraction model quality to reduce manual review volume
Vector DB metadata storage Each chunk stores a full metadata object; increases storage per vector Compress metadata; store metadata in separate metadata store, reference by chunk_id
Filter index maintenance Some vector DBs charge for metadata index updates Use inverted metadata indexes for high-cardinality filterable fields

Indicative Cost Range

Deployment Scale Metadata Overhead vs. Base RAG
Small +10–20% (extraction cost dominates)
Medium +5–10% (extraction amortised; filter execution cheap)
Large +3–5% (highly amortised; filter execution negligible)

12. Trade-Off Analysis

Filter Enforcement Strictness

Option Result Completeness User Experience Risk Compliance Risk Recommendation
Strict filters only (no relaxation) May produce zero results High (frustrating) Low For security-classified corpora only
Progressive disclosure with logging High Low (graceful degradation) Low (transparent) Default recommendation
No filters (global search) Highest Contextually wrong results High Not recommended for enterprise

Metadata Extraction Automation

Option Metadata Quality Ingestion Cost Recommended For
Fully automated NLP extraction Medium-High Low High-volume, homogeneous corpora
Automated + human review for failed High Medium Mixed corpora; regulated use cases
Manual metadata assignment Highest Very High Critical, low-volume corpora

Architectural Tensions

Tension Trade-off Recommendation
Filter granularity vs. schema complexity More dimensions: precise filtering; more schema fields: harder to maintain Start with 5–8 core dimensions; add domain-specific fields per use case
Pre-retrieval filtering vs. post-retrieval filtering Pre-retrieval: fewer LLM tokens consumed; post-retrieval: higher recall Pre-retrieval filtering always; post-retrieval classification labelling for output

13. Failure Modes

Failure Mode Likelihood Impact Detection Recovery
Metadata extraction assigns wrong effective_date (OCR error) Medium High Spot-check QA; date anomaly detection Human review queue; date correction pipeline
Progressive disclosure crosses ACL boundary (must not happen) Very Low Critical ACL filter marked non-relaxable; audit log Immediate alert; rollback progressive disclosure configuration
Status field not updated when document is superseded (stale CURRENT status) Medium High Source system change notification + status sync job Automated status sync from source system; stale status alert
Filter extractor identifies incorrect date from ambiguous query Medium Medium Low-confidence extraction detection Confirm extracted filters with user; "Did you mean: results from 2024?"

14. Regulatory Considerations

Regulation Requirement Metadata Filtering Response
Privacy Act 1988 APP 3 Collect only the minimum personal information necessary Date/category/status metadata contains no personal information; domain metadata may reference individuals only via anonymised codes
APRA CPS 234 Access controls commensurate with information sensitivity Classification metadata field drives access control enforcement; non-negotiable filter
EU AI Act Article 10 Appropriate data governance for AI operational data Metadata schema governance; quality gate documentation; completeness metrics
GDPR Article 25 (Privacy by Design) Data minimisation by default Status filter defaults to CURRENT; expired documents excluded from retrieval by default without explicit override

15. Reference Implementations

AWS

  • Metadata extraction: Amazon Textract + Comprehend
  • Vector store with metadata: Amazon OpenSearch Service (supports rich filter queries on metadata fields)
  • Filter composition: Lambda function constructing OpenSearch filter DSL
  • Progressive disclosure: Lambda step function with configurable relaxation rules

Azure

  • Metadata extraction: Azure AI Document Intelligence + Language Service
  • Vector store: Azure AI Search (native filter expressions on indexed fields)
  • Filter composition: Azure Functions; Azure AI Search OData filter syntax
  • Progressive disclosure: Logic Apps workflow with retry logic

GCP

  • Metadata extraction: Google Document AI + Cloud Natural Language
  • Vector store: Vertex AI Vector Search (with numeric/string filter constraints)
  • Filter composition: Cloud Run; Vertex AI filter expression builder
  • Progressive disclosure: Cloud Workflows step function

Self-Hosted

  • Metadata extraction: Apache Tika + spaCy
  • Vector store: Weaviate (where filter), Qdrant (filter), pgvector (WHERE clause)
  • Progressive disclosure: Custom Python orchestration class

Pattern ID Pattern Name Relationship
EAAPL-RAG001 Enterprise RAG Foundation; RAG010 extends with rich metadata schema and filter composition
EAAPL-RAG003 Secure RAG ACL filter is a mandatory component of the system filter layer in RAG010
EAAPL-RAG005 Hybrid RAG Metadata filters applied to both dense and BM25 retrieval paths in hybrid mode
EAAPL-KNW003 AI Knowledge Corpus Management Corpus management policies include metadata completeness requirements
EAAPL-KNW006 Corpus Quality Assurance Quality gates include metadata completeness and value validation

17. Maturity Assessment

Overall Maturity: Proven — Metadata filtering in vector databases is a well-established capability supported in all major platforms; the pattern is deployed in production across regulated industries; the primary challenges are metadata schema governance and extraction quality, not technology readiness.

Dimension Score (1–5) Rationale
Technology Readiness 5 Metadata filtering supported natively in all major vector databases
Tooling Ecosystem 4 Rich tooling for metadata extraction (Textract, Document Intelligence); filter composition is custom
Operational Guidance 4 Progressive disclosure patterns are well-understood; schema governance is organisation-specific
Security & Compliance 4 Classification and ACL filtering are well-established; filter tamper prevention requires careful implementation
Scalability Evidence 5 Metadata filter indexes scale to billions of vectors in managed services
Cost Predictability 4 Metadata extraction adds predictable per-document cost; filter execution is cheap

18. Revision History

Version Date Author Changes
1.0 2024-03-15 EAAPL Working Group Initial publication
1.1 2024-08-01 EAAPL Working Group Progressive disclosure engine formalised; metadata schema canonical fields defined
1.2 2025-01-20 EAAPL Working Group Filter tamper prevention security controls added; NLP filter extraction confidence monitoring added
← Back to LibraryMore Retrieval-Augmented Generation