Multi-Source Retrieval-Augmented Generation
[EAAPL-RAG002] Multi-Source Retrieval-Augmented Generation
Category: Artificial Intelligence / Retrieval-Augmented Generation
Sub-category: Heterogeneous Source Integration
Version: 1.3
Maturity: Proven
Tags: rag multi-source source-connectors normalisation attribution conflict-resolution federation
Regulatory Relevance: APRA CPS230, Privacy Act 1988, ISO/IEC 42001 Section 6.1, NIST AI RMF (Map 1.5)
1. Executive Summary
Enterprise knowledge is never stored in a single system. Policy documents live in SharePoint, technical runbooks in Confluence, customer data in Salesforce, financial reports in ERP exports, and operational metrics in data warehouses. Multi-Source RAG extends the foundational RAG pattern (EAAPL-RAG001) to orchestrate retrieval across these heterogeneous sources simultaneously, normalise their disparate formats into a unified embedding space, attribute answers to their authoritative origin, and resolve conflicts when sources contradict each other.
For enterprise leaders, this pattern eliminates the organisational friction of designating a "single system of record" before deploying AI assistants. Instead, the system learns to weight sources by authoritative context, surfacing the most relevant answer regardless of where it lives. The business outcome is a knowledge assistant that answers questions drawing simultaneously on policy, product, operational, and people data — the equivalent of asking the most informed person in the organisation, who happens to have read everything. Pilot deployments in financial services and government have demonstrated 35–50% reductions in time-to-answer for cross-functional knowledge queries, with source attribution enabling compliance teams to audit the basis of every answer.
2. Problem Statement
Business Problem
Enterprise knowledge is siloed across dozens of systems maintained by different teams with different governance models. A question as straightforward as "What is our current refund policy for corporate clients?" may require consulting a CRM record, a policy document in SharePoint, a product team's Confluence page, and a legal team's approved exceptions list — across four systems with four different search interfaces.
Technical Problem
Different source systems produce documents in incompatible formats (PDF, HTML, JSON, database rows, Markdown), with incompatible metadata schemas, at different refresh rates, with different access control models. A naive multi-source approach that concatenates results from independent keyword searches produces low-quality, unranked context with no cross-source relevance normalisation. When sources disagree (e.g., a policy document and a Confluence page cite different values for the same policy parameter), the system has no principled mechanism to resolve the conflict.
Symptoms
- Users querying the AI assistant receive answers drawn only from one source system even when better information exists elsewhere
- Source attribution in responses is absent or incorrect, making audit trails unreliable
- Contradictory answers are generated when the same question is asked repeatedly, sourced from different systems on different runs
- Integration team spending weeks building bespoke connectors for each new data source without a reusable connector framework
Cost of Inaction
- Incomplete answers leading to incorrect operational decisions or customer communications
- Connector fragmentation: each new source system requires bespoke integration, with no amortisation across the knowledge platform
- Inability to unify enterprise search into a single interface — users maintain separate search habits for each system
3. Context
When to Apply
- Organisations with knowledge spread across 3+ source systems that users need to query simultaneously
- Deployments where the authoritative source of truth varies by document type (legal: SharePoint; operational: Confluence; customer: CRM)
- Scenarios where source attribution and conflict resolution are required for compliance
- Enterprise search modernisation programmes replacing multi-system search with a unified AI-powered interface
When NOT to Apply
- Single-source deployments (use EAAPL-RAG001)
- Sources are geographically or organisationally distributed with data sovereignty constraints (use EAAPL-RAG004 Federated RAG)
- Real-time data sources with sub-minute freshness requirements (use EAAPL-RAG006 Streaming RAG)
- Sources contain exclusively structured relational data (use SQL-generation or Graph RAG patterns)
Prerequisites
- API access or connector availability for all target source systems
- Unified identity model enabling ACL resolution across source systems (e.g., Azure AD groups that map to permissions in each source)
- Metadata schema alignment work to identify common fields (document type, effective date, owner, classification) across sources
- Source governance: each source system must have a designated data owner responsible for quality and freshness
Industry Applicability
| Industry | Source Systems Typically Integrated | Primary Benefit |
|---|---|---|
| Financial Services | Core banking exports, compliance manuals (SharePoint), risk frameworks (Confluence), regulatory filings (PDF archive) | Single-pane regulatory Q&A across all documentation |
| Healthcare | Clinical guidelines (PDF), pharmacy formulary (EHR export), policy manuals (SharePoint), research publications (PubMed API) | Clinician decision support across all knowledge domains |
| Government | Legislation (PDF), internal policy (SharePoint), operational procedures (Confluence), case management notes (TRIM) | Unified public servant knowledge assistant |
| Retail | Product catalogue (PIM API), supplier agreements (SharePoint), customer FAQs (CMS), warranty terms (PDF) | Customer service automation with authoritative sourcing |
| Technology | Developer docs (Confluence/GitLab wikis), runbooks (Notion/Markdown), API specifications (OpenAPI), issue history (Jira exports) | Integrated developer knowledge assistant |
4. Architecture Overview
Multi-Source RAG introduces a Source Abstraction Layer and a Normalisation Engine between raw source connectors and the unified embedding pipeline. These two components are the architectural differentiators from single-source RAG.
Source Connector Framework
Rather than building bespoke connectors, the pattern mandates a connector framework that abstracts each source into a standard document event model: {document_id, source_system, content_raw, metadata, acl_principals, fetched_at, content_hash}. Connectors are classified as pull (polling schedule), push (webhook/event stream), or on-demand (query-time API call). The connector framework handles authentication, rate limiting, error handling, and delta detection (comparing content hashes to avoid re-ingesting unchanged documents).
Each source system is assigned a source profile: a configuration object capturing the source's authority score by document type, the metadata mapping from source schema to canonical schema, the ACL mapping from source permissions to enterprise identity groups, and the refresh schedule. Source profiles are stored in a configuration registry, not hardcoded, enabling new sources to be onboarded without code changes.
Normalisation Engine
Raw documents from different sources arrive in incompatible formats. The normalisation engine applies format-specific parsers (PDF via Apache Tika, HTML via BeautifulSoup, JSON via schema mapping, Markdown via AST parser) to produce a canonical document structure. Critically, normalisation also includes semantic normalisation: the engine identifies domain-specific terminology that differs across sources (e.g., "customer" in CRM vs. "client" in legal documents vs. "policyholder" in insurance) and maps them to canonical concepts. This enables cross-source retrieval of semantically equivalent content regardless of surface form.
Unified Embedding Space
All normalised chunks are embedded using a single embedding model, producing a unified vector space where semantic similarity is meaningful across source provenance. This is why embedding model lock-in is more consequential in multi-source deployments: different embedding models produce incomparable vector spaces, so the model must be fixed across all sources and changed atomically.
Source provenance (which system the chunk came from) is stored as metadata, not encoded in the embedding. This separation allows retrieval to be source-agnostic (finding the most semantically relevant content regardless of source) while enabling post-retrieval source filtering and attribution.
Source-Weighted Retrieval
After the initial ANN retrieval returns top-K candidates from across all sources, a source relevance weighting step re-scores candidates by combining the semantic similarity score with a source authority weight for the query type. For example, a query about regulatory compliance would upweight chunks from the compliance SharePoint library and downweight chunks from informal Confluence pages, even if the Confluence page has a marginally higher embedding similarity. Authority weights are configured per-query-type in the source profiles and can be overridden by explicit query metadata (e.g., user specifying "search only official policy documents").
Conflict Resolution
When retrieved chunks from different sources make contradictory claims, the conflict resolver applies a deterministic resolution strategy:
- Recency-first: if effective dates differ, the more recent document's claim takes precedence
- Authority-first: if one source is designated authoritative for the domain, it takes precedence regardless of date
- Explicit conflict surfacing: when neither rule resolves the conflict, the system surfaces both claims to the user with source attribution, explicitly noting the discrepancy ("Policy document A states X; Confluence page B states Y. The authoritative source is A.")
The conflict resolver is configurable per source-pair and per document type, allowing organisations to encode their own authority hierarchies.
5. Architecture Diagram
6. Components
| Component | Type | Responsibility | Technology Options | Criticality |
|---|---|---|---|---|
| Source Connector Framework | Integration Platform | Standardise document fetch from all source systems into canonical event model | Apache NiFi, Airbyte, custom Python framework, Azure Data Factory | Critical |
| Source Profile Registry | Configuration | Store per-source authority weights, metadata mappings, ACL mappings, refresh schedules | PostgreSQL, DynamoDB, Consul KV | High |
| Format Parser | Data Processing | Convert raw bytes (PDF, HTML, JSON, XML, Markdown) to plain text + structure | Apache Tika, unstructured.io, pypdf2, BeautifulSoup | High |
| Semantic Normaliser | NLP | Map source-specific terminology to canonical enterprise ontology | Custom dictionary mapping, spaCy entity linker, ontology lookup (EAAPL-KNW001) | Medium |
| Metadata Normaliser | Data Processing | Map source schema fields to canonical metadata schema | Custom Python mapper, dbt transformations | High |
| ACL Normaliser | Security | Translate source-system permissions to enterprise identity groups | Custom RBAC mapping layer, Azure AD group resolver | Critical |
| Chunking Engine | Data Processing | Apply source-appropriate chunking strategy | LlamaIndex, LangChain, custom source-aware splitters | High |
| Embedding Model | ML Inference | Produce unified embedding vectors for all sources | OpenAI, Vertex AI, BAAI bge (same model for ALL sources) | Critical |
| Vector Database | Storage | Unified vector index with source metadata | Pinecone, Weaviate, pgvector, OpenSearch | Critical |
| Source Relevance Weighter | Ranking | Re-score retrieved chunks by source authority for the query type | Custom Python scoring layer | High |
| Conflict Resolver | Business Logic | Detect and resolve contradictions between sources | Custom rule engine + LLM-assisted conflict detection | High |
| Citation + Attribution Layer | Post-processing | Generate per-chunk source labels and document links | Custom formatter with deep-link generation | High |
7. Data Flow
Primary Flow
| Step | Actor | Action | Output |
|---|---|---|---|
| 1 | Source Connector | Detect new/modified document in source system (poll or webhook) | Document bytes + source metadata |
| 2 | Format Parser | Convert raw bytes to plain text; extract structural metadata (headings, tables) | Normalised text + structural metadata |
| 3 | Semantic Normaliser | Map domain terms to canonical vocabulary via ontology lookup | Text with resolved canonical terminology |
| 4 | Metadata Normaliser | Map source-schema fields (e.g., "Modified By" → owner, "Last Modified" → effective_date) |
Canonical metadata record |
| 5 | ACL Normaliser | Resolve source access permissions to enterprise identity groups | {acl_groups: [...], acl_users: [...]} |
| 6 | Chunking Engine | Apply source-appropriate chunking; retain source ID and doc ID per chunk | Chunks with {chunk_id, doc_id, source_id, content, metadata, acl} |
| 7 | Embedding Model | Embed each chunk | Dense vector per chunk |
| 8 | Vector DB | Upsert vector with full metadata payload including source ID | Indexed vector entry |
| 9 | User | Submit query (optionally including source scope hint: "search only from compliance docs") | Query + optional source hint |
| 10 | Query Processor | Extract source hints; expand query; decompose if compound | Enhanced query + source filter params |
| 11 | ACL Filter | Resolve user's cross-source permissions; build per-source ACL filter | Combined metadata filter |
| 12 | ANN Retrieval | Search unified vector index with ACL + optional source scope filter | Top-K candidates across all sources |
| 13 | Source Relevance Weighter | Apply source authority weights per query type; re-score candidates | Re-weighted candidate list |
| 14 | Conflict Resolver | Detect contradictions between candidates from different sources; apply resolution rule | Deduplicated, conflict-annotated candidate list |
| 15 | Context Assembler | Construct prompt with source labels on each chunk | Assembled prompt with provenance labels |
| 16 | LLM | Generate answer citing source labels | Raw response with citation markers |
| 17 | Citation Layer | Map citation markers to deep-links for each source system | Final answer with clickable source links |
Error Flow
| Error Condition | Detection | Recovery |
|---|---|---|
| Source system unavailable during ingestion | Connector health check failure | Retry with exponential backoff; continue with other sources; log staleness for affected source |
| Metadata mapping failure (unknown field in source schema) | Schema validation error | Ingest with partial metadata; flag for schema mapping review; do not block retrieval |
| Conflict resolution cannot determine authoritative source | Conflict resolver returns UNRESOLVED | Surface both claims to user with explicit conflict notice; do not silently pick one |
| Source scope hint references unknown source | Query processor validation | Ignore unknown source hint; warn user; proceed with all accessible sources |
8. Security Considerations
Cross-Source ACL Enforcement
The most critical security requirement in multi-source RAG is ensuring that a user cannot retrieve content from source system B via the RAG interface when they lack direct access to source system B. The ACL Normaliser maps all source-system permissions to enterprise identity groups at ingestion time, and the pre-retrieval filter enforces these mappings at query time. The ACL mapping must be kept in sync with source-system permission changes — a scheduled ACL re-sync job must run at a minimum daily.
Data Classification Propagation
If a CRM record carries a CONFIDENTIAL classification, that classification must propagate through normalisation, chunking, embedding, and retrieval to the final response. The assembled context window classification must be the maximum of all included chunk classifications.
OWASP LLM Top 10 Mitigations
| OWASP LLM Risk | Multi-Source Specific Concern | Mitigation |
|---|---|---|
| LLM01: Prompt Injection | Adversarial content injected into one source system to manipulate RAG outputs via that source's documents | Input sanitisation per connector; treat all retrieved content as untrusted data; system prompt injection guard |
| LLM06: Sensitive Information Disclosure | User sees content from source they lack direct access to via cross-source retrieval | ACL normaliser + per-source metadata filter in vector search; test with users who have partial source access |
| LLM09: Overreliance | User assumes all sources have been searched; system silently skips an unavailable source | Surface source availability status in response metadata; indicate when a source was unavailable during retrieval |
9. Governance Considerations
Source Authority Governance
Each source system must have a designated data steward who owns the source profile configuration, including authority weights and metadata mappings. Changes to authority weights require approval from the data steward and an AI governance representative, because they change which answers users receive.
Conflict Resolution Audit Trail
Every conflict resolution decision must be logged: which sources conflicted, which resolution rule was applied, and what decision was made. This log is a governance artefact for auditors reviewing the basis of AI-generated answers.
Governance Artefacts
| Artefact | Owner | Frequency | Purpose |
|---|---|---|---|
| Source Inventory | Data Governance | Continuous | Track all connected sources, their data stewards, refresh schedules, and authority weights |
| Conflict Resolution Log | AI Operations | Per event | Audit trail of every conflict detected and resolution applied |
| Cross-Source ACL Mapping Matrix | Security | Monthly review | Document which enterprise identity groups map to permissions in each source system |
| Source Freshness Dashboard | AI Operations | Daily | Monitor ingestion recency per source; flag stale sources |
| Attribution Accuracy Sample | Quality Assurance | Quarterly | Random sample of 50 responses; verify that cited sources actually contain the cited content |
10. Operational Considerations
Monitoring
| Metric | Alert Threshold | Notes |
|---|---|---|
| Source connector availability (per source) | < 99% over 1 hour | Alert data steward for affected source |
| Cross-source conflict rate | > 10% of multi-source queries | May indicate data quality issues or outdated documents in one source |
| Source staleness (hours since last successful sync) | Tier 1: > 4h; Tier 2: > 24h | Per-source SLA based on source criticality |
| Answer attribution accuracy (sampled) | < 90% | Trigger attribution pipeline review |
| ACL mapping sync lag | > 1 hour behind source system | Security risk; immediate alert |
Service Level Objectives
| SLO | Target | Window |
|---|---|---|
| Multi-source query response P95 | ≤ 3 seconds | Rolling 7-day |
| All Tier 1 sources available simultaneously | ≥ 99.5% | Monthly |
| Source conflict resolution correctness | ≥ 95% (sampled) | Quarterly |
Disaster Recovery
| Component | RTO | RPO | DR Strategy |
|---|---|---|---|
| Source Connector Framework | 2 hours | N/A (re-runnable) | Infrastructure-as-code; connector config in version control |
| Unified Vector Index | 1 hour | 1 hour | Cross-region replica; snapshot to object storage |
| Source Profile Registry | 30 minutes | 1 hour | Multi-AZ database with point-in-time recovery |
11. Cost Considerations
Cost Drivers
| Cost Driver | Notes | Optimisation |
|---|---|---|
| Connector maintenance | Each source connector requires ongoing engineering for API changes | Invest in a connector framework (e.g., Airbyte) with community connectors |
| Re-ingestion on source schema change | Schema changes in source systems require partial or full re-ingestion | Schema version pinning; versioned metadata mappings |
| Cross-source retrieval latency | Unified index avoids per-source query fan-out; lower cost than federated alternatives | Cache embeddings for high-frequency source combinations |
| Conflict resolution LLM calls | When rule-based resolution fails, an LLM call resolves the conflict | Rate-limit conflict resolution LLM calls; cache resolution decisions for known conflicts |
Indicative Cost Range
| Deployment Scale | Monthly Cost Range | Dominant Cost Factor |
|---|---|---|
| 5 sources, < 5M vectors | $1,500 – $5,000 | Connector maintenance, LLM generation |
| 10 sources, 5M–50M vectors | $5,000 – $25,000 | Vector DB hosting, embedding re-ingestion |
| 20+ sources, > 50M vectors | $20,000 – $100,000 | Connector operations, index management |
12. Trade-Off Analysis
Source Integration Approach
| Option | Recall Quality | Integration Complexity | Source Isolation | Recommended For |
|---|---|---|---|---|
| Unified index (all sources → one vector DB) | Highest | High upfront, low ongoing | None (ACL-enforced) | Most enterprise deployments |
| Federated search (query each source index independently) | Moderate (score normalisation challenge) | Low per-source | Full | Data sovereignty constraints |
| Hybrid (unified for main sources + federated for sensitive sources) | High | High | Partial | Regulated industries with mixed sensitivity |
Conflict Resolution Strategy
| Strategy | Correctness | Transparency | Complexity | When to Use |
|---|---|---|---|---|
| Recency-first | High for time-sensitive domains | Low (hidden from user) | Low | Policy updates, product specs |
| Authority-first | High for governance domains | Low (hidden from user) | Medium | Legal, compliance documents |
| Explicit conflict surfacing | Highest | Highest | High | Regulated decisions, ambiguous policy |
Architectural Tensions
| Tension | Trade-off | Recommendation |
|---|---|---|
| Unified embedding model vs. per-source optimised models | Unified: consistent search quality; per-source: potentially better recall within each domain | Unified model always; domain-specific fine-tuning only if benchmark shows >10% recall improvement |
| Real-time ACL sync vs. eventual consistency | Real-time: secure but operationally complex; eventual: simpler but creates ACL lag windows | Risk-tiered: security-classified sources sync within 15 minutes; others within 1 hour |
13. Failure Modes
| Failure Mode | Likelihood | Impact | Detection | Recovery |
|---|---|---|---|---|
| Source A's stale document answers a query that Source B's current document would answer better | High | Medium | Source freshness monitoring; answer quality degradation | Re-trigger ingestion; implement freshness-aware source weighting |
| ACL mapping desync (user gains access in source but mapping not updated) | Medium | Medium | ACL sync lag monitor | Automated daily ACL re-sync; real-time sync for security-classified sources |
| Conflicting answers presented as unified answer (silent conflict resolution) | Medium | High | Conflict rate monitoring; user feedback loop | Enforce explicit conflict surfacing for Tier 1 source contradictions |
| Source connector auth token expiry causing silent ingestion failure | High | Medium | Connector health monitoring; staleness alert | Automated token refresh; alert on connector auth failure |
| Semantic normalisation introduces incorrect term mapping | Low | High | Spot-check QA on normalised documents | Human review of ontology mapping changes; A/B test normalisation changes |
14. Regulatory Considerations
| Regulation | Requirement | Multi-Source RAG Response |
|---|---|---|
| Privacy Act 1988 | Personal information from different source systems must not be cross-matched without consent | ACL enforcement prevents cross-source PII leakage; personal information in CRM restricted to CRM-authorised roles |
| APRA CPS 234 | Access controls must reflect business requirements and be reviewed regularly | ACL mapping matrix reviewed monthly; cross-source permission review included in quarterly access certification |
| EU AI Act Article 13 | Traceability of AI-generated content to its sources | Per-chunk source attribution in every response; conflict resolution decisions logged |
| ISO/IEC 42001 | Data quality management across AI system inputs | Source quality scores tracked; data stewards assigned per source; quality gates before new source onboarding |
15. Reference Implementations
AWS
- Connectors: AWS Glue with connectors for S3, RDS, Salesforce (via AppFlow), SharePoint (via custom Lambda)
- Normalisation: AWS Lambda (Python) + Apache Tika on Fargate
- Vector store: Amazon OpenSearch with k-NN; source_id as metadata field
- Conflict resolution: Lambda function + DynamoDB for resolution rules
Azure
- Connectors: Azure Logic Apps + Microsoft Graph API; Azure Data Factory for DB sources
- Normalisation: Azure Functions + Azure AI Document Intelligence
- Vector store: Azure AI Search with multi-index federation or unified index with source facets
- Conflict resolution: Azure Functions with Cosmos DB for conflict log
GCP
- Connectors: Cloud Run jobs + Pub/Sub for event-driven; Dataform for DB exports
- Normalisation: Cloud Run + Document AI
- Vector store: Vertex AI Vector Search; AlloyDB pgvector with source_id column
- Conflict resolution: Cloud Functions + Firestore for resolution log
16. Related Patterns
| Pattern ID | Pattern Name | Relationship |
|---|---|---|
| EAAPL-RAG001 | Enterprise RAG | Foundation; RAG002 extends the ingestion and retrieval layers |
| EAAPL-RAG003 | Secure RAG | Complementary; ACL enforcement is essential in multi-source contexts |
| EAAPL-RAG004 | Federated RAG | Alternative for data sovereignty; RAG002 centralises, RAG004 distributes |
| EAAPL-KNW002 | Semantic Data Layer | Provides the ontology used by the Semantic Normaliser |
| EAAPL-KNW003 | AI Knowledge Corpus Management | Governs the corpus across all connected sources |
| EAAPL-KNW006 | Corpus Quality Assurance | Applies per-source quality gates before ingestion |
17. Maturity Assessment
Overall Maturity: Proven — Multi-source RAG is deployed in production at many enterprises; connector frameworks are mature; the primary operational challenge is ongoing connector maintenance and ACL sync.
| Dimension | Score (1–5) | Rationale |
|---|---|---|
| Technology Readiness | 4 | Core components mature; connector maintenance overhead is a persistent operational cost |
| Tooling Ecosystem | 4 | Airbyte, NiFi, and cloud-native connectors cover most enterprise sources; some sources still require custom connectors |
| Operational Guidance | 3 | Source authority weighting and conflict resolution are organisation-specific; less standardised than single-source RAG |
| Security & Compliance | 4 | Cross-source ACL enforcement is well-understood; implementation complexity is high |
| Scalability Evidence | 3 | Unified index scales well; connector maintenance at 20+ sources is operationally demanding |
| Cost Predictability | 3 | Connector maintenance costs are highly variable by source system complexity |
18. Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2024-03-01 | EAAPL Working Group | Initial publication |
| 1.1 | 2024-06-15 | EAAPL Working Group | Added source relevance weighting; conflict resolution strategies formalised |
| 1.2 | 2024-10-20 | EAAPL Working Group | ACL normaliser architecture added; cross-source security section expanded |
| 1.3 | 2025-03-10 | EAAPL Working Group | Updated reference implementations; added semantic normalisation component |