EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryRetrieval-Augmented GenerationEAAPL-RAG002
EAAPL-RAG002Proven
⇄ Compare

Multi-Source Retrieval-Augmented Generation

🔍 Retrieval-Augmented GenerationAPRA CPS230ISO/IEC 42001

[EAAPL-RAG002] Multi-Source Retrieval-Augmented Generation

Category: Artificial Intelligence / Retrieval-Augmented Generation Sub-category: Heterogeneous Source Integration Version: 1.3 Maturity: Proven Tags: rag multi-source source-connectors normalisation attribution conflict-resolution federation Regulatory Relevance: APRA CPS230, Privacy Act 1988, ISO/IEC 42001 Section 6.1, NIST AI RMF (Map 1.5)


1. Executive Summary

Enterprise knowledge is never stored in a single system. Policy documents live in SharePoint, technical runbooks in Confluence, customer data in Salesforce, financial reports in ERP exports, and operational metrics in data warehouses. Multi-Source RAG extends the foundational RAG pattern (EAAPL-RAG001) to orchestrate retrieval across these heterogeneous sources simultaneously, normalise their disparate formats into a unified embedding space, attribute answers to their authoritative origin, and resolve conflicts when sources contradict each other.

For enterprise leaders, this pattern eliminates the organisational friction of designating a "single system of record" before deploying AI assistants. Instead, the system learns to weight sources by authoritative context, surfacing the most relevant answer regardless of where it lives. The business outcome is a knowledge assistant that answers questions drawing simultaneously on policy, product, operational, and people data — the equivalent of asking the most informed person in the organisation, who happens to have read everything. Pilot deployments in financial services and government have demonstrated 35–50% reductions in time-to-answer for cross-functional knowledge queries, with source attribution enabling compliance teams to audit the basis of every answer.


2. Problem Statement

Business Problem

Enterprise knowledge is siloed across dozens of systems maintained by different teams with different governance models. A question as straightforward as "What is our current refund policy for corporate clients?" may require consulting a CRM record, a policy document in SharePoint, a product team's Confluence page, and a legal team's approved exceptions list — across four systems with four different search interfaces.

Technical Problem

Different source systems produce documents in incompatible formats (PDF, HTML, JSON, database rows, Markdown), with incompatible metadata schemas, at different refresh rates, with different access control models. A naive multi-source approach that concatenates results from independent keyword searches produces low-quality, unranked context with no cross-source relevance normalisation. When sources disagree (e.g., a policy document and a Confluence page cite different values for the same policy parameter), the system has no principled mechanism to resolve the conflict.

Symptoms

  • Users querying the AI assistant receive answers drawn only from one source system even when better information exists elsewhere
  • Source attribution in responses is absent or incorrect, making audit trails unreliable
  • Contradictory answers are generated when the same question is asked repeatedly, sourced from different systems on different runs
  • Integration team spending weeks building bespoke connectors for each new data source without a reusable connector framework

Cost of Inaction

  • Incomplete answers leading to incorrect operational decisions or customer communications
  • Connector fragmentation: each new source system requires bespoke integration, with no amortisation across the knowledge platform
  • Inability to unify enterprise search into a single interface — users maintain separate search habits for each system

3. Context

When to Apply

  • Organisations with knowledge spread across 3+ source systems that users need to query simultaneously
  • Deployments where the authoritative source of truth varies by document type (legal: SharePoint; operational: Confluence; customer: CRM)
  • Scenarios where source attribution and conflict resolution are required for compliance
  • Enterprise search modernisation programmes replacing multi-system search with a unified AI-powered interface

When NOT to Apply

  • Single-source deployments (use EAAPL-RAG001)
  • Sources are geographically or organisationally distributed with data sovereignty constraints (use EAAPL-RAG004 Federated RAG)
  • Real-time data sources with sub-minute freshness requirements (use EAAPL-RAG006 Streaming RAG)
  • Sources contain exclusively structured relational data (use SQL-generation or Graph RAG patterns)

Prerequisites

  • API access or connector availability for all target source systems
  • Unified identity model enabling ACL resolution across source systems (e.g., Azure AD groups that map to permissions in each source)
  • Metadata schema alignment work to identify common fields (document type, effective date, owner, classification) across sources
  • Source governance: each source system must have a designated data owner responsible for quality and freshness

Industry Applicability

Industry Source Systems Typically Integrated Primary Benefit
Financial Services Core banking exports, compliance manuals (SharePoint), risk frameworks (Confluence), regulatory filings (PDF archive) Single-pane regulatory Q&A across all documentation
Healthcare Clinical guidelines (PDF), pharmacy formulary (EHR export), policy manuals (SharePoint), research publications (PubMed API) Clinician decision support across all knowledge domains
Government Legislation (PDF), internal policy (SharePoint), operational procedures (Confluence), case management notes (TRIM) Unified public servant knowledge assistant
Retail Product catalogue (PIM API), supplier agreements (SharePoint), customer FAQs (CMS), warranty terms (PDF) Customer service automation with authoritative sourcing
Technology Developer docs (Confluence/GitLab wikis), runbooks (Notion/Markdown), API specifications (OpenAPI), issue history (Jira exports) Integrated developer knowledge assistant

4. Architecture Overview

Multi-Source RAG introduces a Source Abstraction Layer and a Normalisation Engine between raw source connectors and the unified embedding pipeline. These two components are the architectural differentiators from single-source RAG.

Source Connector Framework

Rather than building bespoke connectors, the pattern mandates a connector framework that abstracts each source into a standard document event model: {document_id, source_system, content_raw, metadata, acl_principals, fetched_at, content_hash}. Connectors are classified as pull (polling schedule), push (webhook/event stream), or on-demand (query-time API call). The connector framework handles authentication, rate limiting, error handling, and delta detection (comparing content hashes to avoid re-ingesting unchanged documents).

Each source system is assigned a source profile: a configuration object capturing the source's authority score by document type, the metadata mapping from source schema to canonical schema, the ACL mapping from source permissions to enterprise identity groups, and the refresh schedule. Source profiles are stored in a configuration registry, not hardcoded, enabling new sources to be onboarded without code changes.

Normalisation Engine

Raw documents from different sources arrive in incompatible formats. The normalisation engine applies format-specific parsers (PDF via Apache Tika, HTML via BeautifulSoup, JSON via schema mapping, Markdown via AST parser) to produce a canonical document structure. Critically, normalisation also includes semantic normalisation: the engine identifies domain-specific terminology that differs across sources (e.g., "customer" in CRM vs. "client" in legal documents vs. "policyholder" in insurance) and maps them to canonical concepts. This enables cross-source retrieval of semantically equivalent content regardless of surface form.

Unified Embedding Space

All normalised chunks are embedded using a single embedding model, producing a unified vector space where semantic similarity is meaningful across source provenance. This is why embedding model lock-in is more consequential in multi-source deployments: different embedding models produce incomparable vector spaces, so the model must be fixed across all sources and changed atomically.

Source provenance (which system the chunk came from) is stored as metadata, not encoded in the embedding. This separation allows retrieval to be source-agnostic (finding the most semantically relevant content regardless of source) while enabling post-retrieval source filtering and attribution.

Source-Weighted Retrieval

After the initial ANN retrieval returns top-K candidates from across all sources, a source relevance weighting step re-scores candidates by combining the semantic similarity score with a source authority weight for the query type. For example, a query about regulatory compliance would upweight chunks from the compliance SharePoint library and downweight chunks from informal Confluence pages, even if the Confluence page has a marginally higher embedding similarity. Authority weights are configured per-query-type in the source profiles and can be overridden by explicit query metadata (e.g., user specifying "search only official policy documents").

Conflict Resolution

When retrieved chunks from different sources make contradictory claims, the conflict resolver applies a deterministic resolution strategy:

  1. Recency-first: if effective dates differ, the more recent document's claim takes precedence
  2. Authority-first: if one source is designated authoritative for the domain, it takes precedence regardless of date
  3. Explicit conflict surfacing: when neither rule resolves the conflict, the system surfaces both claims to the user with source attribution, explicitly noting the discrepancy ("Policy document A states X; Confluence page B states Y. The authoritative source is A.")

The conflict resolver is configurable per source-pair and per document type, allowing organisations to encode their own authority hierarchies.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Sources["Source Systems"] A[SharePoint / Confluence] B[CRM / APIs / DB] end subgraph Normalisation["Normalisation + Ingestion"] C[Connector + Parser] D[Unified Vector Store] end subgraph Query["Query Pipeline"] E[User Query] F[ACL Pre-filter] G[Source Weight + Conflict Resolve] H[LLM + Citations] end A --> C B --> C C -->|embed + index| D E --> F --> D D --> G --> H --> E style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#f0fdf4,stroke:#22c55e style D fill:#fef9c3,stroke:#eab308 style E fill:#dbeafe,stroke:#3b82f6 style F fill:#f0fdf4,stroke:#22c55e style G fill:#f0fdf4,stroke:#22c55e style H fill:#d1fae5,stroke:#10b981

6. Components

Component Type Responsibility Technology Options Criticality
Source Connector Framework Integration Platform Standardise document fetch from all source systems into canonical event model Apache NiFi, Airbyte, custom Python framework, Azure Data Factory Critical
Source Profile Registry Configuration Store per-source authority weights, metadata mappings, ACL mappings, refresh schedules PostgreSQL, DynamoDB, Consul KV High
Format Parser Data Processing Convert raw bytes (PDF, HTML, JSON, XML, Markdown) to plain text + structure Apache Tika, unstructured.io, pypdf2, BeautifulSoup High
Semantic Normaliser NLP Map source-specific terminology to canonical enterprise ontology Custom dictionary mapping, spaCy entity linker, ontology lookup (EAAPL-KNW001) Medium
Metadata Normaliser Data Processing Map source schema fields to canonical metadata schema Custom Python mapper, dbt transformations High
ACL Normaliser Security Translate source-system permissions to enterprise identity groups Custom RBAC mapping layer, Azure AD group resolver Critical
Chunking Engine Data Processing Apply source-appropriate chunking strategy LlamaIndex, LangChain, custom source-aware splitters High
Embedding Model ML Inference Produce unified embedding vectors for all sources OpenAI, Vertex AI, BAAI bge (same model for ALL sources) Critical
Vector Database Storage Unified vector index with source metadata Pinecone, Weaviate, pgvector, OpenSearch Critical
Source Relevance Weighter Ranking Re-score retrieved chunks by source authority for the query type Custom Python scoring layer High
Conflict Resolver Business Logic Detect and resolve contradictions between sources Custom rule engine + LLM-assisted conflict detection High
Citation + Attribution Layer Post-processing Generate per-chunk source labels and document links Custom formatter with deep-link generation High

7. Data Flow

Primary Flow

Step Actor Action Output
1 Source Connector Detect new/modified document in source system (poll or webhook) Document bytes + source metadata
2 Format Parser Convert raw bytes to plain text; extract structural metadata (headings, tables) Normalised text + structural metadata
3 Semantic Normaliser Map domain terms to canonical vocabulary via ontology lookup Text with resolved canonical terminology
4 Metadata Normaliser Map source-schema fields (e.g., "Modified By" → owner, "Last Modified" → effective_date) Canonical metadata record
5 ACL Normaliser Resolve source access permissions to enterprise identity groups {acl_groups: [...], acl_users: [...]}
6 Chunking Engine Apply source-appropriate chunking; retain source ID and doc ID per chunk Chunks with {chunk_id, doc_id, source_id, content, metadata, acl}
7 Embedding Model Embed each chunk Dense vector per chunk
8 Vector DB Upsert vector with full metadata payload including source ID Indexed vector entry
9 User Submit query (optionally including source scope hint: "search only from compliance docs") Query + optional source hint
10 Query Processor Extract source hints; expand query; decompose if compound Enhanced query + source filter params
11 ACL Filter Resolve user's cross-source permissions; build per-source ACL filter Combined metadata filter
12 ANN Retrieval Search unified vector index with ACL + optional source scope filter Top-K candidates across all sources
13 Source Relevance Weighter Apply source authority weights per query type; re-score candidates Re-weighted candidate list
14 Conflict Resolver Detect contradictions between candidates from different sources; apply resolution rule Deduplicated, conflict-annotated candidate list
15 Context Assembler Construct prompt with source labels on each chunk Assembled prompt with provenance labels
16 LLM Generate answer citing source labels Raw response with citation markers
17 Citation Layer Map citation markers to deep-links for each source system Final answer with clickable source links

Error Flow

Error Condition Detection Recovery
Source system unavailable during ingestion Connector health check failure Retry with exponential backoff; continue with other sources; log staleness for affected source
Metadata mapping failure (unknown field in source schema) Schema validation error Ingest with partial metadata; flag for schema mapping review; do not block retrieval
Conflict resolution cannot determine authoritative source Conflict resolver returns UNRESOLVED Surface both claims to user with explicit conflict notice; do not silently pick one
Source scope hint references unknown source Query processor validation Ignore unknown source hint; warn user; proceed with all accessible sources

8. Security Considerations

Cross-Source ACL Enforcement

The most critical security requirement in multi-source RAG is ensuring that a user cannot retrieve content from source system B via the RAG interface when they lack direct access to source system B. The ACL Normaliser maps all source-system permissions to enterprise identity groups at ingestion time, and the pre-retrieval filter enforces these mappings at query time. The ACL mapping must be kept in sync with source-system permission changes — a scheduled ACL re-sync job must run at a minimum daily.

Data Classification Propagation

If a CRM record carries a CONFIDENTIAL classification, that classification must propagate through normalisation, chunking, embedding, and retrieval to the final response. The assembled context window classification must be the maximum of all included chunk classifications.

OWASP LLM Top 10 Mitigations

OWASP LLM Risk Multi-Source Specific Concern Mitigation
LLM01: Prompt Injection Adversarial content injected into one source system to manipulate RAG outputs via that source's documents Input sanitisation per connector; treat all retrieved content as untrusted data; system prompt injection guard
LLM06: Sensitive Information Disclosure User sees content from source they lack direct access to via cross-source retrieval ACL normaliser + per-source metadata filter in vector search; test with users who have partial source access
LLM09: Overreliance User assumes all sources have been searched; system silently skips an unavailable source Surface source availability status in response metadata; indicate when a source was unavailable during retrieval

9. Governance Considerations

Source Authority Governance

Each source system must have a designated data steward who owns the source profile configuration, including authority weights and metadata mappings. Changes to authority weights require approval from the data steward and an AI governance representative, because they change which answers users receive.

Conflict Resolution Audit Trail

Every conflict resolution decision must be logged: which sources conflicted, which resolution rule was applied, and what decision was made. This log is a governance artefact for auditors reviewing the basis of AI-generated answers.

Governance Artefacts

Artefact Owner Frequency Purpose
Source Inventory Data Governance Continuous Track all connected sources, their data stewards, refresh schedules, and authority weights
Conflict Resolution Log AI Operations Per event Audit trail of every conflict detected and resolution applied
Cross-Source ACL Mapping Matrix Security Monthly review Document which enterprise identity groups map to permissions in each source system
Source Freshness Dashboard AI Operations Daily Monitor ingestion recency per source; flag stale sources
Attribution Accuracy Sample Quality Assurance Quarterly Random sample of 50 responses; verify that cited sources actually contain the cited content

10. Operational Considerations

Monitoring

Metric Alert Threshold Notes
Source connector availability (per source) < 99% over 1 hour Alert data steward for affected source
Cross-source conflict rate > 10% of multi-source queries May indicate data quality issues or outdated documents in one source
Source staleness (hours since last successful sync) Tier 1: > 4h; Tier 2: > 24h Per-source SLA based on source criticality
Answer attribution accuracy (sampled) < 90% Trigger attribution pipeline review
ACL mapping sync lag > 1 hour behind source system Security risk; immediate alert

Service Level Objectives

SLO Target Window
Multi-source query response P95 ≤ 3 seconds Rolling 7-day
All Tier 1 sources available simultaneously ≥ 99.5% Monthly
Source conflict resolution correctness ≥ 95% (sampled) Quarterly

Disaster Recovery

Component RTO RPO DR Strategy
Source Connector Framework 2 hours N/A (re-runnable) Infrastructure-as-code; connector config in version control
Unified Vector Index 1 hour 1 hour Cross-region replica; snapshot to object storage
Source Profile Registry 30 minutes 1 hour Multi-AZ database with point-in-time recovery

11. Cost Considerations

Cost Drivers

Cost Driver Notes Optimisation
Connector maintenance Each source connector requires ongoing engineering for API changes Invest in a connector framework (e.g., Airbyte) with community connectors
Re-ingestion on source schema change Schema changes in source systems require partial or full re-ingestion Schema version pinning; versioned metadata mappings
Cross-source retrieval latency Unified index avoids per-source query fan-out; lower cost than federated alternatives Cache embeddings for high-frequency source combinations
Conflict resolution LLM calls When rule-based resolution fails, an LLM call resolves the conflict Rate-limit conflict resolution LLM calls; cache resolution decisions for known conflicts

Indicative Cost Range

Deployment Scale Monthly Cost Range Dominant Cost Factor
5 sources, < 5M vectors $1,500 – $5,000 Connector maintenance, LLM generation
10 sources, 5M–50M vectors $5,000 – $25,000 Vector DB hosting, embedding re-ingestion
20+ sources, > 50M vectors $20,000 – $100,000 Connector operations, index management

12. Trade-Off Analysis

Source Integration Approach

Option Recall Quality Integration Complexity Source Isolation Recommended For
Unified index (all sources → one vector DB) Highest High upfront, low ongoing None (ACL-enforced) Most enterprise deployments
Federated search (query each source index independently) Moderate (score normalisation challenge) Low per-source Full Data sovereignty constraints
Hybrid (unified for main sources + federated for sensitive sources) High High Partial Regulated industries with mixed sensitivity

Conflict Resolution Strategy

Strategy Correctness Transparency Complexity When to Use
Recency-first High for time-sensitive domains Low (hidden from user) Low Policy updates, product specs
Authority-first High for governance domains Low (hidden from user) Medium Legal, compliance documents
Explicit conflict surfacing Highest Highest High Regulated decisions, ambiguous policy

Architectural Tensions

Tension Trade-off Recommendation
Unified embedding model vs. per-source optimised models Unified: consistent search quality; per-source: potentially better recall within each domain Unified model always; domain-specific fine-tuning only if benchmark shows >10% recall improvement
Real-time ACL sync vs. eventual consistency Real-time: secure but operationally complex; eventual: simpler but creates ACL lag windows Risk-tiered: security-classified sources sync within 15 minutes; others within 1 hour

13. Failure Modes

Failure Mode Likelihood Impact Detection Recovery
Source A's stale document answers a query that Source B's current document would answer better High Medium Source freshness monitoring; answer quality degradation Re-trigger ingestion; implement freshness-aware source weighting
ACL mapping desync (user gains access in source but mapping not updated) Medium Medium ACL sync lag monitor Automated daily ACL re-sync; real-time sync for security-classified sources
Conflicting answers presented as unified answer (silent conflict resolution) Medium High Conflict rate monitoring; user feedback loop Enforce explicit conflict surfacing for Tier 1 source contradictions
Source connector auth token expiry causing silent ingestion failure High Medium Connector health monitoring; staleness alert Automated token refresh; alert on connector auth failure
Semantic normalisation introduces incorrect term mapping Low High Spot-check QA on normalised documents Human review of ontology mapping changes; A/B test normalisation changes

14. Regulatory Considerations

Regulation Requirement Multi-Source RAG Response
Privacy Act 1988 Personal information from different source systems must not be cross-matched without consent ACL enforcement prevents cross-source PII leakage; personal information in CRM restricted to CRM-authorised roles
APRA CPS 234 Access controls must reflect business requirements and be reviewed regularly ACL mapping matrix reviewed monthly; cross-source permission review included in quarterly access certification
EU AI Act Article 13 Traceability of AI-generated content to its sources Per-chunk source attribution in every response; conflict resolution decisions logged
ISO/IEC 42001 Data quality management across AI system inputs Source quality scores tracked; data stewards assigned per source; quality gates before new source onboarding

15. Reference Implementations

AWS

  • Connectors: AWS Glue with connectors for S3, RDS, Salesforce (via AppFlow), SharePoint (via custom Lambda)
  • Normalisation: AWS Lambda (Python) + Apache Tika on Fargate
  • Vector store: Amazon OpenSearch with k-NN; source_id as metadata field
  • Conflict resolution: Lambda function + DynamoDB for resolution rules

Azure

  • Connectors: Azure Logic Apps + Microsoft Graph API; Azure Data Factory for DB sources
  • Normalisation: Azure Functions + Azure AI Document Intelligence
  • Vector store: Azure AI Search with multi-index federation or unified index with source facets
  • Conflict resolution: Azure Functions with Cosmos DB for conflict log

GCP

  • Connectors: Cloud Run jobs + Pub/Sub for event-driven; Dataform for DB exports
  • Normalisation: Cloud Run + Document AI
  • Vector store: Vertex AI Vector Search; AlloyDB pgvector with source_id column
  • Conflict resolution: Cloud Functions + Firestore for resolution log

Pattern ID Pattern Name Relationship
EAAPL-RAG001 Enterprise RAG Foundation; RAG002 extends the ingestion and retrieval layers
EAAPL-RAG003 Secure RAG Complementary; ACL enforcement is essential in multi-source contexts
EAAPL-RAG004 Federated RAG Alternative for data sovereignty; RAG002 centralises, RAG004 distributes
EAAPL-KNW002 Semantic Data Layer Provides the ontology used by the Semantic Normaliser
EAAPL-KNW003 AI Knowledge Corpus Management Governs the corpus across all connected sources
EAAPL-KNW006 Corpus Quality Assurance Applies per-source quality gates before ingestion

17. Maturity Assessment

Overall Maturity: Proven — Multi-source RAG is deployed in production at many enterprises; connector frameworks are mature; the primary operational challenge is ongoing connector maintenance and ACL sync.

Dimension Score (1–5) Rationale
Technology Readiness 4 Core components mature; connector maintenance overhead is a persistent operational cost
Tooling Ecosystem 4 Airbyte, NiFi, and cloud-native connectors cover most enterprise sources; some sources still require custom connectors
Operational Guidance 3 Source authority weighting and conflict resolution are organisation-specific; less standardised than single-source RAG
Security & Compliance 4 Cross-source ACL enforcement is well-understood; implementation complexity is high
Scalability Evidence 3 Unified index scales well; connector maintenance at 20+ sources is operationally demanding
Cost Predictability 3 Connector maintenance costs are highly variable by source system complexity

18. Revision History

Version Date Author Changes
1.0 2024-03-01 EAAPL Working Group Initial publication
1.1 2024-06-15 EAAPL Working Group Added source relevance weighting; conflict resolution strategies formalised
1.2 2024-10-20 EAAPL Working Group ACL normaliser architecture added; cross-source security section expanded
1.3 2025-03-10 EAAPL Working Group Updated reference implementations; added semantic normalisation component
← Back to LibraryMore Retrieval-Augmented Generation