EAAPL-RAG002Proven

Multi-Source Retrieval-Augmented Generation

🔍 Retrieval-Augmented GenerationAPRA CPS230ISO/IEC 42001

[EAAPL-RAG002] Multi-Source Retrieval-Augmented Generation

Category: Artificial Intelligence / Retrieval-Augmented Generation Sub-category: Heterogeneous Source Integration Version: 1.3 Maturity: Proven Tags: rag multi-source source-connectors normalisation attribution conflict-resolution federation Regulatory Relevance: APRA CPS230, Privacy Act 1988, ISO/IEC 42001 Section 6.1, NIST AI RMF (Map 1.5)

1. Executive Summary

Enterprise knowledge is never stored in a single system. Policy documents live in SharePoint, technical runbooks in Confluence, customer data in Salesforce, financial reports in ERP exports, and operational metrics in data warehouses. Multi-Source RAG extends the foundational RAG pattern (EAAPL-RAG001) to orchestrate retrieval across these heterogeneous sources simultaneously, normalise their disparate formats into a unified embedding space, attribute answers to their authoritative origin, and resolve conflicts when sources contradict each other.

For enterprise leaders, this pattern eliminates the organisational friction of designating a "single system of record" before deploying AI assistants. Instead, the system learns to weight sources by authoritative context, surfacing the most relevant answer regardless of where it lives. The business outcome is a knowledge assistant that answers questions drawing simultaneously on policy, product, operational, and people data — the equivalent of asking the most informed person in the organisation, who happens to have read everything. Pilot deployments in financial services and government have demonstrated 35–50% reductions in time-to-answer for cross-functional knowledge queries, with source attribution enabling compliance teams to audit the basis of every answer.

2. Problem Statement

Business Problem

Enterprise knowledge is siloed across dozens of systems maintained by different teams with different governance models. A question as straightforward as "What is our current refund policy for corporate clients?" may require consulting a CRM record, a policy document in SharePoint, a product team's Confluence page, and a legal team's approved exceptions list — across four systems with four different search interfaces.

Technical Problem

Different source systems produce documents in incompatible formats (PDF, HTML, JSON, database rows, Markdown), with incompatible metadata schemas, at different refresh rates, with different access control models. A naive multi-source approach that concatenates results from independent keyword searches produces low-quality, unranked context with no cross-source relevance normalisation. When sources disagree (e.g., a policy document and a Confluence page cite different values for the same policy parameter), the system has no principled mechanism to resolve the conflict.

Symptoms

Users querying the AI assistant receive answers drawn only from one source system even when better information exists elsewhere
Source attribution in responses is absent or incorrect, making audit trails unreliable
Contradictory answers are generated when the same question is asked repeatedly, sourced from different systems on different runs
Integration team spending weeks building bespoke connectors for each new data source without a reusable connector framework

Cost of Inaction

Incomplete answers leading to incorrect operational decisions or customer communications
Connector fragmentation: each new source system requires bespoke integration, with no amortisation across the knowledge platform
Inability to unify enterprise search into a single interface — users maintain separate search habits for each system

3. Context

When to Apply

Organisations with knowledge spread across 3+ source systems that users need to query simultaneously
Deployments where the authoritative source of truth varies by document type (legal: SharePoint; operational: Confluence; customer: CRM)
Scenarios where source attribution and conflict resolution are required for compliance
Enterprise search modernisation programmes replacing multi-system search with a unified AI-powered interface

When NOT to Apply

Single-source deployments (use EAAPL-RAG001)
Sources are geographically or organisationally distributed with data sovereignty constraints (use EAAPL-RAG004 Federated RAG)
Real-time data sources with sub-minute freshness requirements (use EAAPL-RAG006 Streaming RAG)
Sources contain exclusively structured relational data (use SQL-generation or Graph RAG patterns)

Prerequisites

API access or connector availability for all target source systems
Unified identity model enabling ACL resolution across source systems (e.g., Azure AD groups that map to permissions in each source)
Metadata schema alignment work to identify common fields (document type, effective date, owner, classification) across sources
Source governance: each source system must have a designated data owner responsible for quality and freshness

Industry Applicability

Industry	Source Systems Typically Integrated	Primary Benefit
Financial Services	Core banking exports, compliance manuals (SharePoint), risk frameworks (Confluence), regulatory filings (PDF archive)	Single-pane regulatory Q&A across all documentation
Healthcare	Clinical guidelines (PDF), pharmacy formulary (EHR export), policy manuals (SharePoint), research publications (PubMed API)	Clinician decision support across all knowledge domains
Government	Legislation (PDF), internal policy (SharePoint), operational procedures (Confluence), case management notes (TRIM)	Unified public servant knowledge assistant
Retail	Product catalogue (PIM API), supplier agreements (SharePoint), customer FAQs (CMS), warranty terms (PDF)	Customer service automation with authoritative sourcing
Technology	Developer docs (Confluence/GitLab wikis), runbooks (Notion/Markdown), API specifications (OpenAPI), issue history (Jira exports)	Integrated developer knowledge assistant

4. Architecture Overview

Multi-Source RAG introduces a Source Abstraction Layer and a Normalisation Engine between raw source connectors and the unified embedding pipeline. These two components are the architectural differentiators from single-source RAG.

Source Connector Framework

Rather than building bespoke connectors, the pattern mandates a connector framework that abstracts each source into a standard document event model: {document_id, source_system, content_raw, metadata, acl_principals, fetched_at, content_hash}. Connectors are classified as pull (polling schedule), push (webhook/event stream), or on-demand (query-time API call). The connector framework handles authentication, rate limiting, error handling, and delta detection (comparing content hashes to avoid re-ingesting unchanged documents).

Each source system is assigned a source profile: a configuration object capturing the source's authority score by document type, the metadata mapping from source schema to canonical schema, the ACL mapping from source permissions to enterprise identity groups, and the refresh schedule. Source profiles are stored in a configuration registry, not hardcoded, enabling new sources to be onboarded without code changes.

Normalisation Engine

Raw documents from different sources arrive in incompatible formats. The normalisation engine applies format-specific parsers (PDF via Apache Tika, HTML via BeautifulSoup, JSON via schema mapping, Markdown via AST parser) to produce a canonical document structure. Critically, normalisation also includes semantic normalisation: the engine identifies domain-specific terminology that differs across sources (e.g., "customer" in CRM vs. "client" in legal documents vs. "policyholder" in insurance) and maps them to canonical concepts. This enables cross-source retrieval of semantically equivalent content regardless of surface form.

Unified Embedding Space

All normalised chunks are embedded using a single embedding model, producing a unified vector space where semantic similarity is meaningful across source provenance. This is why embedding model lock-in is more consequential in multi-source deployments: different embedding models produce incomparable vector spaces, so the model must be fixed across all sources and changed atomically.

Source provenance (which system the chunk came from) is stored as metadata, not encoded in the embedding. This separation allows retrieval to be source-agnostic (finding the most semantically relevant content regardless of source) while enabling post-retrieval source filtering and attribution.

Source-Weighted Retrieval

After the initial ANN retrieval returns top-K candidates from across all sources, a source relevance weighting step re-scores candidates by combining the semantic similarity score with a source authority weight for the query type. For example, a query about regulatory compliance would upweight chunks from the compliance SharePoint library and downweight chunks from informal Confluence pages, even if the Confluence page has a marginally higher embedding similarity. Authority weights are configured per-query-type in the source profiles and can be overridden by explicit query metadata (e.g., user specifying "search only official policy documents").

Conflict Resolution

When retrieved chunks from different sources make contradictory claims, the conflict resolver applies a deterministic resolution strategy:

Recency-first: if effective dates differ, the more recent document's claim takes precedence
Authority-first: if one source is designated authoritative for the domain, it takes precedence regardless of date
Explicit conflict surfacing: when neither rule resolves the conflict, the system surfaces both claims to the user with source attribution, explicitly noting the discrepancy ("Policy document A states X; Confluence page B states Y. The authoritative source is A.")

The conflict resolver is configurable per source-pair and per document type, allowing organisations to encode their own authority hierarchies.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Sources["Source Systems"] A[SharePoint / Confluence] B[CRM / APIs / DB] end subgraph Normalisation["Normalisation + Ingestion"] C[Connector + Parser] D[Unified Vector Store] end subgraph Query["Query Pipeline"] E[User Query] F[ACL Pre-filter] G[Source Weight + Conflict Resolve] H[LLM + Citations] end A --> C B --> C C -->|embed + index| D E --> F --> D D --> G --> H --> E style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#f0fdf4,stroke:#22c55e style D fill:#fef9c3,stroke:#eab308 style E fill:#dbeafe,stroke:#3b82f6 style F fill:#f0fdf4,stroke:#22c55e style G fill:#f0fdf4,stroke:#22c55e style H fill:#d1fae5,stroke:#10b981

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Source Connector Framework	Integration Platform	Standardise document fetch from all source systems into canonical event model	Apache NiFi, Airbyte, custom Python framework, Azure Data Factory	Critical
Source Profile Registry	Configuration	Store per-source authority weights, metadata mappings, ACL mappings, refresh schedules	PostgreSQL, DynamoDB, Consul KV	High
Format Parser	Data Processing	Convert raw bytes (PDF, HTML, JSON, XML, Markdown) to plain text + structure	Apache Tika, unstructured.io, pypdf2, BeautifulSoup	High
Semantic Normaliser	NLP	Map source-specific terminology to canonical enterprise ontology	Custom dictionary mapping, spaCy entity linker, ontology lookup (EAAPL-KNW001)	Medium
Metadata Normaliser	Data Processing	Map source schema fields to canonical metadata schema	Custom Python mapper, dbt transformations	High
ACL Normaliser	Security	Translate source-system permissions to enterprise identity groups	Custom RBAC mapping layer, Azure AD group resolver	Critical
Chunking Engine	Data Processing	Apply source-appropriate chunking strategy	LlamaIndex, LangChain, custom source-aware splitters	High
Embedding Model	ML Inference	Produce unified embedding vectors for all sources	OpenAI, Vertex AI, BAAI bge (same model for ALL sources)	Critical
Vector Database	Storage	Unified vector index with source metadata	Pinecone, Weaviate, pgvector, OpenSearch	Critical
Source Relevance Weighter	Ranking	Re-score retrieved chunks by source authority for the query type	Custom Python scoring layer	High
Conflict Resolver	Business Logic	Detect and resolve contradictions between sources	Custom rule engine + LLM-assisted conflict detection	High
Citation + Attribution Layer	Post-processing	Generate per-chunk source labels and document links	Custom formatter with deep-link generation	High

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	Source Connector	Detect new/modified document in source system (poll or webhook)	Document bytes + source metadata
2	Format Parser	Convert raw bytes to plain text; extract structural metadata (headings, tables)	Normalised text + structural metadata
3	Semantic Normaliser	Map domain terms to canonical vocabulary via ontology lookup	Text with resolved canonical terminology
4	Metadata Normaliser	Map source-schema fields (e.g., "Modified By" → `owner`, "Last Modified" → `effective_date`)	Canonical metadata record
5	ACL Normaliser	Resolve source access permissions to enterprise identity groups	`{acl_groups: [...], acl_users: [...]}`
6	Chunking Engine	Apply source-appropriate chunking; retain source ID and doc ID per chunk	Chunks with `{chunk_id, doc_id, source_id, content, metadata, acl}`
7	Embedding Model	Embed each chunk	Dense vector per chunk
8	Vector DB	Upsert vector with full metadata payload including source ID	Indexed vector entry
9	User	Submit query (optionally including source scope hint: "search only from compliance docs")	Query + optional source hint
10	Query Processor	Extract source hints; expand query; decompose if compound	Enhanced query + source filter params
11	ACL Filter	Resolve user's cross-source permissions; build per-source ACL filter	Combined metadata filter
12	ANN Retrieval	Search unified vector index with ACL + optional source scope filter	Top-K candidates across all sources
13	Source Relevance Weighter	Apply source authority weights per query type; re-score candidates	Re-weighted candidate list
14	Conflict Resolver	Detect contradictions between candidates from different sources; apply resolution rule	Deduplicated, conflict-annotated candidate list
15	Context Assembler	Construct prompt with source labels on each chunk	Assembled prompt with provenance labels
16	LLM	Generate answer citing source labels	Raw response with citation markers
17	Citation Layer	Map citation markers to deep-links for each source system	Final answer with clickable source links

Error Flow

Error Condition	Detection	Recovery
Source system unavailable during ingestion	Connector health check failure	Retry with exponential backoff; continue with other sources; log staleness for affected source
Metadata mapping failure (unknown field in source schema)	Schema validation error	Ingest with partial metadata; flag for schema mapping review; do not block retrieval
Conflict resolution cannot determine authoritative source	Conflict resolver returns UNRESOLVED	Surface both claims to user with explicit conflict notice; do not silently pick one
Source scope hint references unknown source	Query processor validation	Ignore unknown source hint; warn user; proceed with all accessible sources

8. Security Considerations

Cross-Source ACL Enforcement

The most critical security requirement in multi-source RAG is ensuring that a user cannot retrieve content from source system B via the RAG interface when they lack direct access to source system B. The ACL Normaliser maps all source-system permissions to enterprise identity groups at ingestion time, and the pre-retrieval filter enforces these mappings at query time. The ACL mapping must be kept in sync with source-system permission changes — a scheduled ACL re-sync job must run at a minimum daily.

Data Classification Propagation

If a CRM record carries a CONFIDENTIAL classification, that classification must propagate through normalisation, chunking, embedding, and retrieval to the final response. The assembled context window classification must be the maximum of all included chunk classifications.

OWASP LLM Top 10 Mitigations

OWASP LLM Risk	Multi-Source Specific Concern	Mitigation
LLM01: Prompt Injection	Adversarial content injected into one source system to manipulate RAG outputs via that source's documents	Input sanitisation per connector; treat all retrieved content as untrusted data; system prompt injection guard
LLM06: Sensitive Information Disclosure	User sees content from source they lack direct access to via cross-source retrieval	ACL normaliser + per-source metadata filter in vector search; test with users who have partial source access
LLM09: Overreliance	User assumes all sources have been searched; system silently skips an unavailable source	Surface source availability status in response metadata; indicate when a source was unavailable during retrieval

9. Governance Considerations

Source Authority Governance

Each source system must have a designated data steward who owns the source profile configuration, including authority weights and metadata mappings. Changes to authority weights require approval from the data steward and an AI governance representative, because they change which answers users receive.

Conflict Resolution Audit Trail

Every conflict resolution decision must be logged: which sources conflicted, which resolution rule was applied, and what decision was made. This log is a governance artefact for auditors reviewing the basis of AI-generated answers.

Governance Artefacts

Artefact	Owner	Frequency	Purpose
Source Inventory	Data Governance	Continuous	Track all connected sources, their data stewards, refresh schedules, and authority weights
Conflict Resolution Log	AI Operations	Per event	Audit trail of every conflict detected and resolution applied
Cross-Source ACL Mapping Matrix	Security	Monthly review	Document which enterprise identity groups map to permissions in each source system
Source Freshness Dashboard	AI Operations	Daily	Monitor ingestion recency per source; flag stale sources
Attribution Accuracy Sample	Quality Assurance	Quarterly	Random sample of 50 responses; verify that cited sources actually contain the cited content

10. Operational Considerations

Monitoring

Metric	Alert Threshold	Notes
Source connector availability (per source)	< 99% over 1 hour	Alert data steward for affected source
Cross-source conflict rate	> 10% of multi-source queries	May indicate data quality issues or outdated documents in one source
Source staleness (hours since last successful sync)	Tier 1: > 4h; Tier 2: > 24h	Per-source SLA based on source criticality
Answer attribution accuracy (sampled)	< 90%	Trigger attribution pipeline review
ACL mapping sync lag	> 1 hour behind source system	Security risk; immediate alert

Service Level Objectives

SLO	Target	Window
Multi-source query response P95	≤ 3 seconds	Rolling 7-day
All Tier 1 sources available simultaneously	≥ 99.5%	Monthly
Source conflict resolution correctness	≥ 95% (sampled)	Quarterly

Disaster Recovery

Component	RTO	RPO	DR Strategy
Source Connector Framework	2 hours	N/A (re-runnable)	Infrastructure-as-code; connector config in version control
Unified Vector Index	1 hour	1 hour	Cross-region replica; snapshot to object storage
Source Profile Registry	30 minutes	1 hour	Multi-AZ database with point-in-time recovery

11. Cost Considerations

Cost Drivers

Cost Driver	Notes	Optimisation
Connector maintenance	Each source connector requires ongoing engineering for API changes	Invest in a connector framework (e.g., Airbyte) with community connectors
Re-ingestion on source schema change	Schema changes in source systems require partial or full re-ingestion	Schema version pinning; versioned metadata mappings
Cross-source retrieval latency	Unified index avoids per-source query fan-out; lower cost than federated alternatives	Cache embeddings for high-frequency source combinations
Conflict resolution LLM calls	When rule-based resolution fails, an LLM call resolves the conflict	Rate-limit conflict resolution LLM calls; cache resolution decisions for known conflicts

Indicative Cost Range

Deployment Scale	Monthly Cost Range	Dominant Cost Factor
5 sources, < 5M vectors	$1,500 – $5,000	Connector maintenance, LLM generation
10 sources, 5M–50M vectors	$5,000 – $25,000	Vector DB hosting, embedding re-ingestion
20+ sources, > 50M vectors	$20,000 – $100,000	Connector operations, index management

12. Trade-Off Analysis

Source Integration Approach

Option	Recall Quality	Integration Complexity	Source Isolation	Recommended For
Unified index (all sources → one vector DB)	Highest	High upfront, low ongoing	None (ACL-enforced)	Most enterprise deployments
Federated search (query each source index independently)	Moderate (score normalisation challenge)	Low per-source	Full	Data sovereignty constraints
Hybrid (unified for main sources + federated for sensitive sources)	High	High	Partial	Regulated industries with mixed sensitivity

Conflict Resolution Strategy

Strategy	Correctness	Transparency	Complexity	When to Use
Recency-first	High for time-sensitive domains	Low (hidden from user)	Low	Policy updates, product specs
Authority-first	High for governance domains	Low (hidden from user)	Medium	Legal, compliance documents
Explicit conflict surfacing	Highest	Highest	High	Regulated decisions, ambiguous policy

Architectural Tensions

Tension	Trade-off	Recommendation
Unified embedding model vs. per-source optimised models	Unified: consistent search quality; per-source: potentially better recall within each domain	Unified model always; domain-specific fine-tuning only if benchmark shows >10% recall improvement
Real-time ACL sync vs. eventual consistency	Real-time: secure but operationally complex; eventual: simpler but creates ACL lag windows	Risk-tiered: security-classified sources sync within 15 minutes; others within 1 hour

13. Failure Modes

Failure Mode	Likelihood	Impact	Detection	Recovery
Source A's stale document answers a query that Source B's current document would answer better	High	Medium	Source freshness monitoring; answer quality degradation	Re-trigger ingestion; implement freshness-aware source weighting
ACL mapping desync (user gains access in source but mapping not updated)	Medium	Medium	ACL sync lag monitor	Automated daily ACL re-sync; real-time sync for security-classified sources
Conflicting answers presented as unified answer (silent conflict resolution)	Medium	High	Conflict rate monitoring; user feedback loop	Enforce explicit conflict surfacing for Tier 1 source contradictions
Source connector auth token expiry causing silent ingestion failure	High	Medium	Connector health monitoring; staleness alert	Automated token refresh; alert on connector auth failure
Semantic normalisation introduces incorrect term mapping	Low	High	Spot-check QA on normalised documents	Human review of ontology mapping changes; A/B test normalisation changes

14. Regulatory Considerations

Regulation	Requirement	Multi-Source RAG Response
Privacy Act 1988	Personal information from different source systems must not be cross-matched without consent	ACL enforcement prevents cross-source PII leakage; personal information in CRM restricted to CRM-authorised roles
APRA CPS 234	Access controls must reflect business requirements and be reviewed regularly	ACL mapping matrix reviewed monthly; cross-source permission review included in quarterly access certification
EU AI Act Article 13	Traceability of AI-generated content to its sources	Per-chunk source attribution in every response; conflict resolution decisions logged
ISO/IEC 42001	Data quality management across AI system inputs	Source quality scores tracked; data stewards assigned per source; quality gates before new source onboarding

15. Reference Implementations

AWS

Connectors: AWS Glue with connectors for S3, RDS, Salesforce (via AppFlow), SharePoint (via custom Lambda)
Normalisation: AWS Lambda (Python) + Apache Tika on Fargate
Vector store: Amazon OpenSearch with k-NN; source_id as metadata field
Conflict resolution: Lambda function + DynamoDB for resolution rules

Azure

Connectors: Azure Logic Apps + Microsoft Graph API; Azure Data Factory for DB sources
Normalisation: Azure Functions + Azure AI Document Intelligence
Vector store: Azure AI Search with multi-index federation or unified index with source facets
Conflict resolution: Azure Functions with Cosmos DB for conflict log

GCP

Connectors: Cloud Run jobs + Pub/Sub for event-driven; Dataform for DB exports
Normalisation: Cloud Run + Document AI
Vector store: Vertex AI Vector Search; AlloyDB pgvector with source_id column
Conflict resolution: Cloud Functions + Firestore for resolution log

Pattern ID	Pattern Name	Relationship
EAAPL-RAG001	Enterprise RAG	Foundation; RAG002 extends the ingestion and retrieval layers
EAAPL-RAG003	Secure RAG	Complementary; ACL enforcement is essential in multi-source contexts
EAAPL-RAG004	Federated RAG	Alternative for data sovereignty; RAG002 centralises, RAG004 distributes
EAAPL-KNW002	Semantic Data Layer	Provides the ontology used by the Semantic Normaliser
EAAPL-KNW003	AI Knowledge Corpus Management	Governs the corpus across all connected sources
EAAPL-KNW006	Corpus Quality Assurance	Applies per-source quality gates before ingestion

17. Maturity Assessment

Overall Maturity: Proven — Multi-source RAG is deployed in production at many enterprises; connector frameworks are mature; the primary operational challenge is ongoing connector maintenance and ACL sync.

Dimension	Score (1–5)	Rationale
Technology Readiness	4	Core components mature; connector maintenance overhead is a persistent operational cost
Tooling Ecosystem	4	Airbyte, NiFi, and cloud-native connectors cover most enterprise sources; some sources still require custom connectors
Operational Guidance	3	Source authority weighting and conflict resolution are organisation-specific; less standardised than single-source RAG
Security & Compliance	4	Cross-source ACL enforcement is well-understood; implementation complexity is high
Scalability Evidence	3	Unified index scales well; connector maintenance at 20+ sources is operationally demanding
Cost Predictability	3	Connector maintenance costs are highly variable by source system complexity

18. Revision History

Version	Date	Author	Changes
1.0	2024-03-01	EAAPL Working Group	Initial publication
1.1	2024-06-15	EAAPL Working Group	Added source relevance weighting; conflict resolution strategies formalised
1.2	2024-10-20	EAAPL Working Group	ACL normaliser architecture added; cross-source security section expanded
1.3	2025-03-10	EAAPL Working Group	Updated reference implementations; added semantic normalisation component

← Back to Library More Retrieval-Augmented Generation →