EAAPL-RAG008Proven

Multimodal Retrieval-Augmented Generation

🔍 Retrieval-Augmented GenerationEU AI ActISO/IEC 42001

[EAAPL-RAG008] Multimodal Retrieval-Augmented Generation

Category: Artificial Intelligence / Retrieval-Augmented Generation Sub-category: Multimodal and Cross-Modal Retrieval Version: 1.0 Maturity: Emerging Tags: rag multimodal vision image-retrieval table-understanding cross-modal clip colpali document-intelligence Regulatory Relevance: EU AI Act Article 10 (Data quality across modalities), ISO/IEC 42001 Section 6.1, NIST AI RMF (Map 1.5)

1. Executive Summary

Multimodal RAG extends the retrieval-augmented generation paradigm to knowledge corpora that include images, charts, diagrams, tables, and other non-text modalities alongside prose documents. Enterprise knowledge frequently exists in forms that standard text-based RAG cannot access: engineering diagrams, medical imaging reports, financial charts, product photographs, scanned contracts, and slide decks with data visualisations. Multimodal RAG enables users to query this knowledge in natural language and receive answers grounded in visual as well as textual evidence.

For enterprise leaders in engineering, healthcare, manufacturing, finance, and legal domains, Multimodal RAG unlocks a significant portion of the knowledge corpus that text-only RAG leaves inaccessible. A maintenance engineer asking "What does the valve assembly look like on model X?" needs a diagram, not a text description. A financial analyst asking "Show me the revenue trend chart from the Q3 investor presentation" needs the chart itself, retrieved and analysed, not a text summary of revenue figures. The pattern is emerging rather than mature — the enabling technologies (multimodal embedding models, vision-language models capable of grounded QA over retrieved images) are advancing rapidly but have not yet reached the operational reliability of text-only RAG. Early adopters with visual-heavy corpora should pilot carefully and plan for ongoing model upgrades.

2. Problem Statement

Business Problem

Enterprise document corpora are not exclusively textual. Technical manuals contain engineering diagrams. Financial reports contain charts. Contracts contain tables of fees and conditions. Training materials contain screenshots. Product documentation contains photographs. Text-only RAG silently ignores all of this visual content, leaving entire knowledge domains inaccessible to AI-assisted search and Q&A.

Technical Problem

Text embedding models cannot embed images. Vector similarity search over text embeddings cannot retrieve images by semantic query. Vision-language models capable of answering questions from image content exist but require grounded evidence retrieval — they cannot answer questions about a diagram without the diagram being present in context. The architecture must therefore solve two distinct problems: (1) how to retrieve the most relevant image/diagram/table for a given query, and (2) how to present it to the generation model in a form that enables grounded visual question answering.

Symptoms

AI assistant cannot answer questions about diagrams, charts, or photographs even though they exist in the knowledge corpus
Users receive text-only answers to questions that require visual evidence, and must manually locate the relevant diagram
RAG system quality evaluations show low recall on questions derived from figure captions, table contents, or diagram annotations
Users explicitly request "show me the diagram" or "retrieve the chart" and the system cannot comply

Cost of Inaction

Significant portions of the knowledge corpus remain unsearchable via AI, limiting the ROI of the RAG investment
Engineers, clinicians, and financial analysts must perform manual visual search in parallel with AI text search, duplicating effort
Competitive disadvantage as multimodal AI capabilities become standard in enterprise knowledge platforms

3. Context

When to Apply

Knowledge corpora where more than 10% of content value is in non-text form (images, diagrams, charts, tables)
Technical documentation with engineering diagrams, schematics, or photographs
Financial document Q&A where charts and tables are primary information carriers
Healthcare document Q&A over radiology reports, clinical diagrams, or pharmaceutical product photographs
Legal contracts with tabular terms, fee schedules, and signature pages

When NOT to Apply

Text-only corpora where all visual content is purely decorative (logos, page backgrounds)
Deployments with strict latency requirements where multimodal embedding and vision-language model inference adds unacceptable overhead
Organisations without a mature multimodal data ingestion and storage capability — text-only RAG should be deployed first

Prerequisites

A multimodal embedding model capable of embedding both text queries and image content in the same vector space (CLIP, ColPali, Nomic Embed Vision)
A vision-language model capable of grounded visual question answering (GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet — all support image input)
An image/figure extraction pipeline (extracts figures and tables from PDFs and documents)
Object storage for image assets (referenced from the vector index by URL or object key)
A table understanding component (converts tables to structured JSON or markdown for LLM consumption)

Industry Applicability

Industry	Modality	Use Case
Engineering / Manufacturing	CAD diagrams, P&ID schematics, assembly photographs	"Show the assembly procedure for component X"
Financial Services	Revenue charts, balance sheet tables, trend graphs	"What does the FCF trend chart in the Q3 report show?"
Healthcare	Anatomical diagrams, procedure illustrations, drug formulary tables	"What does the surgical approach diagram for procedure Y look like?"
Legal	Contract tables (fee schedules, milestone lists), signature pages	"What are the termination fees in Table 3 of the MSA?"
Retail / E-commerce	Product photographs, size charts, packaging diagrams	"Show me the product dimensions chart for SKU X"
Architecture / Construction	Floor plans, elevation drawings, material schedules	"Show the floor plan for Level 3 of Building B"

4. Architecture Overview

Multimodal RAG introduces two new ingestion paths and one new retrieval path alongside the standard text pipeline. Understanding the distinct characteristics of each modality is essential to designing an effective architecture.

Multimodal Document Parsing

The ingestion pipeline must first extract non-text elements from documents. For PDFs, this requires a document parsing step that identifies figures, tables, and images and extracts them as separate assets. Tools such as Apache Tika, AWS Textract, Azure Document Intelligence, or Google Document AI can extract figures and tables from PDFs with varying quality. Each extracted asset is assigned a unique asset ID, a reference to its parent document, a page number, a caption (if present), and surrounding context text (the paragraphs immediately before and after the figure in the source document).

Image Embedding Path

Each extracted image is embedded using a multimodal embedding model that produces vectors in the same semantic space as text embeddings. CLIP (Contrastive Language-Image Pre-training) is the canonical architecture: trained on image-text pairs, it produces comparable embeddings for text queries and images, enabling cross-modal retrieval ("text query → retrieve relevant images"). ColPali is an emerging alternative that produces multi-vector patch-level embeddings for higher-resolution document understanding.

Images are stored in object storage (S3, Azure Blob, GCS) and referenced by URL in the vector index. The vector index entry contains the image URL, the image's embedding, the parent document ID, the caption, and surrounding context text. Storing the full image bytes in the vector database is not recommended — only references.

Table Understanding Path

Tables require a distinct handling path because they are structured data, not natural language. Table extraction (via document intelligence services) produces structured table representations. These are then converted to either Markdown table format (for LLM consumption) or JSON (for structured query interfaces). The table as a whole is embedded as a text embedding (of its Markdown representation) for retrieval, not as an image.

Cross-Modal Retrieval

At query time, the user's text query is embedded using the multimodal embedding model and used to search both the text chunk index and the image/table index simultaneously. Scores from both indexes are merged (using RRF or weighted combination) and the top-K results across all modalities are selected. The context assembler then constructs a multimodal prompt that includes both text chunks and images (base64-encoded or URL-referenced, depending on the VLM API).

Vision-Language Model for Grounded Visual QA

The generation step uses a vision-language model (VLM) that accepts both text and images in its context window. The VLM is instructed to answer the user's question based on the provided text and visual evidence, citing both text sources and image sources. The prompt structure places the visual evidence (images, tables) alongside the text chunks and explicitly asks the model to ground its response in the visual content when relevant.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Ingestion["Multimodal Ingestion"] A[Source Documents] B[Document Parser] C[Text Vector Index] D[Image + Table Indexes] end subgraph Query["Cross-Modal Retrieval"] E[User Query] F[Multimodal Embedder] G[Cross-Modal Merger] end subgraph Generation["VLM Generation"] H[Multimodal Context] I[Vision-Language Model] end A --> B B -->|text chunks| C B -->|images + tables| D E --> F F --> C F --> D C --> G D --> G G --> H --> I --> E style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#fef9c3,stroke:#eab308 style D fill:#fef9c3,stroke:#eab308 style E fill:#dbeafe,stroke:#3b82f6 style F fill:#f0fdf4,stroke:#22c55e style G fill:#f0fdf4,stroke:#22c55e style H fill:#f0fdf4,stroke:#22c55e style I fill:#d1fae5,stroke:#10b981

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Document Intelligence / Parser	Data Processing	Extract text, figures, and tables from documents	AWS Textract, Azure Document Intelligence, Google Document AI, Unstructured.io	Critical
Image/Figure Extractor	Data Processing	Isolate figure regions from PDFs; extract captions and context	PyMuPDF, pdfplumber, Apache Tika, Document AI	High
Object Storage	Storage	Store extracted image assets referenced by vector index	Amazon S3, Azure Blob Storage, Google Cloud Storage	Critical
Multimodal Embedding Model	ML Inference	Embed images and text in shared vector space	OpenAI CLIP, ColPali, Nomic Embed Vision, Google multimodal embedding	Critical
Table Extractor	Data Processing	Extract table data as structured representation	AWS Textract (table mode), Azure DI table extraction, camelot-py	High
Image Vector Index	Storage	ANN index over image embeddings with metadata	Weaviate (multi-vector), Qdrant, Pinecone	Critical
Table Vector Index	Storage	Index of table embeddings (as text) with structured metadata	Same vector DB; separate namespace/collection	High
Cross-Modal Retrieval Orchestrator	Retrieval	Query all modality indexes; merge results	Custom Python; LangChain multi-retriever	High
Multimodal Context Assembler	Orchestration	Construct VLM prompt with text + images + tables	Custom; LangChain; LlamaIndex multimodal retriever	High
Vision-Language Model	ML Inference	Generate grounded answer from multimodal context	GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro	Critical

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	Document Intelligence	Parse PDF; extract text blocks, figure regions, table regions	Separated text, image bytes, table data per document
2	Image Extractor	Crop figure regions; extract caption and surrounding text context	`{image_id, image_bytes, caption, context_text, page, doc_id}`
3	Object Storage	Store image_bytes at `s3://bucket/{doc_id}/{image_id}.png`	Image URL
4	Multimodal Embedding	Embed image using CLIP/ColPali	Image embedding vector
5	Image Vector Index	Upsert `{image_id, embedding, image_url, caption, context_text, doc_id}`	Indexed image entry
6	Table Extractor	Extract table as Markdown/JSON	`{table_id, markdown, page, doc_id}`
7	Table Vector Index	Embed Markdown representation; upsert	Indexed table entry
8	User	Submit natural language query	Query string
9	Multimodal Query Embedder	Embed query using the same multimodal model	Query vector (comparable to image and text vectors)
10	Cross-Modal Retrieval	ANN search across text, image, and table indexes	Candidates from each modality
11	Modal Result Merger	Apply RRF across modalities	Unified ranked candidate list
12	Context Assembler	Fetch text chunks; fetch image bytes (base64) or URLs; fetch table Markdown	Multimodal prompt: text + images + tables
13	VLM	Generate answer grounded in visual + textual evidence	Response with image and text citations

Error Flow

Error Condition	Detection	Recovery
Figure extraction fails (complex PDF layout)	Extraction error log; empty image count	Ingest text-only for failed pages; flag document for manual review
Multimodal embedding model unavailable	API health check	Fall back to caption-text embedding (text-only retrieval for images); surface quality degradation
VLM image token limit exceeded	Token count validation before VLM call	Reduce number of images in context; summarise image captions instead
Image URL expired (object storage pre-signed URL)	HTTP 403 on VLM image fetch	Use long-lived URLs or regenerate pre-signed URL at query time

8. Security Considerations

Image Content Classification

Images in enterprise documents may contain sensitive content (facial photographs, handwritten signatures, confidential diagrams). Image classification must be applied at ingestion to flag sensitive images and enforce the same ACL-based access controls as text documents (EAAPL-RAG003). A document with a PROTECTED classification propagates that classification to all extracted images.

OWASP LLM Top 10 Mitigations

OWASP LLM Risk	Multimodal-Specific Concern	Mitigation
LLM01: Prompt Injection	Visual prompt injection: adversarial content embedded in images (invisible text in images)	Image content safety scanner before indexing and before VLM context assembly
LLM06: Sensitive Information Disclosure	Image contains PII (photograph, signature, handwritten notes) visible to unauthorised users	ACL enforcement on image retrieval; image-level classification tagging
LLM02: Insecure Output Handling	VLM describes confidential diagram content verbatim	Output classification labelling; no verbatim reproduction of classified visual content

9. Governance Considerations

Visual Content Governance

All extracted images must be inventoried as part of the corpus inventory (EAAPL-KNW003). Images containing personal information (photographs, handwritten documents) require explicit privacy assessment. The image extraction pipeline must be reviewed by the Privacy Officer before processing HR or healthcare documents.

Governance Artefacts

Artefact	Owner	Frequency	Purpose
Multimodal Corpus Inventory	Knowledge Manager	Continuous	Track all images, tables, and their source documents
Image Classification Audit	Privacy Officer	Quarterly	Review images containing personal information
Visual Retrieval Quality Report	AI Operations	Monthly	Benchmark cross-modal retrieval recall on a test set

10. Operational Considerations

Monitoring

Metric	Alert Threshold	Notes
Image extraction success rate	< 90%	Document Intelligence API quality issue
Multimodal embedding API latency	> 500ms	Affects ingestion throughput
VLM image token cost per session	> $0.50	High image count in context; optimise retrieval K for images
Cross-modal retrieval recall (benchmark)	< 0.70	Multimodal embedding quality degradation

Service Level Objectives

SLO	Target	Notes
Multimodal query P95 latency	≤ 5 seconds	Longer than text-only due to image embedding and VLM processing
Image extraction coverage	≥ 90% of documents with figures	Measured monthly
Visual retrieval recall@5	≥ 0.70 on benchmark	Measured monthly

11. Cost Considerations

Cost Drivers

Cost Driver	Notes	Optimisation
Document Intelligence (image extraction)	$10–$30 per 1,000 pages	Batch processing; cache extraction results
Multimodal embedding	Higher cost than text embedding; CLIP APIs ~$0.05–0.15/1K images	Self-host CLIP for large corpora
VLM with image input	Vision tokens are significantly more expensive than text tokens (GPT-4o: $2.50/1M image tokens)	Limit images per context window; use lower-resolution images when detail is not required
Object storage (images)	$0.02–$0.025/GB/month	Lifecycle policies to move old images to cheaper storage tiers

Indicative Cost Range

Deployment Scale	Monthly Cost (Multimodal)	Notes
Small pilot (< 100K images)	$500 – $2,000	Primarily extraction and embedding setup cost
Medium (100K–1M images)	$2,000 – $10,000	VLM query cost becomes dominant
Large (> 1M images)	$10,000 – $50,000	Self-hosted CLIP; VLM batching

12. Trade-Off Analysis

Multimodal Embedding Approach

Approach	Cross-Modal Quality	Cost	Complexity	Recommendation
CLIP (ViT-B/32 or ViT-L)	Good	Low (self-hostable)	Low	Default for most deployments
ColPali (multi-vector patch)	Higher for document images	Higher compute	Medium	For document-heavy corpora (PDFs with diagrams)
Caption-only embedding	Low cross-modal quality	Very Low	Very Low	Fallback only; not recommended for visual retrieval

Table Handling Strategy

Strategy	Retrieval Quality	Structured Query Support	Complexity
Markdown text embedding	Good	None	Low
JSON structured representation	Good	SQL-like queries possible	Medium
Table as image (render table as PNG)	Moderate	None	Low

Architectural Tensions

Tension	Trade-off	Recommendation
Context window image count vs. VLM cost	More images: better visual grounding; higher token cost	Cap images at 3 per query; prioritise highest-scored image retrieval
Image resolution vs. processing speed	High resolution: better VLM understanding; higher token cost and latency	Use 512px thumbnails for retrieval context; offer "view full resolution" link

13. Failure Modes

Failure Mode	Likelihood	Impact	Detection	Recovery
VLM hallucinates about image content not present in context	Medium	High	Citation validation; visual grounding check	Explicit prompt instruction to describe only visible content; confidence scoring
Figure extraction misses complex multi-column layouts	High	Medium	Extraction coverage monitoring	Manual review queue for documents with < 80% figure coverage
Cross-modal embedding model version drift	Low	High	Retrieval quality benchmark	Atomic re-embedding on model upgrade (same process as text-only RAG)
Object storage image URL expiry causing VLM 403	Medium	High	VLM error log	Use long-lived signed URLs; regenerate at query time

14. Regulatory Considerations

Regulation	Requirement	Multimodal RAG Response
Privacy Act 1988 APP 11	Sensitive personal information (photographs) must be protected	Facial photograph detection at ingestion; restricted access for documents containing photographs
EU AI Act Article 10	Training and operational data quality across all modalities	Image extraction quality metrics; multimodal benchmark on representative corpus
GDPR Article 9	Special categories of data (medical images, biometrics) require explicit consent	Healthcare and biometric images require separate consent and access control tier

15. Reference Implementations

AWS

Document Intelligence: Amazon Textract (figure + table extraction)
Image storage: Amazon S3
Multimodal embedding: Amazon Titan Multimodal Embeddings G1 or self-hosted CLIP on SageMaker
Image vector index: Amazon OpenSearch Service with k-NN
VLM: Amazon Bedrock (Claude 3.5 Sonnet or Nova)

Azure

Document Intelligence: Azure AI Document Intelligence (figure + table extraction)
Image storage: Azure Blob Storage
Multimodal embedding: Azure OpenAI (CLIP via custom deployment) or Azure AI Vision
Image vector index: Azure AI Search (vector mode)
VLM: Azure OpenAI GPT-4o (native image input)

GCP

Document Intelligence: Google Document AI
Image storage: Google Cloud Storage
Multimodal embedding: Vertex AI Multimodal Embeddings
Image vector index: Vertex AI Vector Search
VLM: Vertex AI Gemini 1.5 Pro (native multimodal)

Pattern ID	Pattern Name	Relationship
EAAPL-RAG001	Enterprise RAG	Foundation; RAG008 extends text retrieval with cross-modal capability
EAAPL-RAG005	Hybrid RAG	Hybrid retrieval applied to text path; image path uses cross-modal embedding only
EAAPL-RAG009	Graph RAG	Diagram elements can be modelled as knowledge graph entities; complementary
EAAPL-KNW003	AI Knowledge Corpus Management	Corpus management must include visual asset lifecycle

17. Maturity Assessment

Overall Maturity: Emerging — Multimodal embedding models and VLMs with image input are production-grade (GPT-4o, Gemini 1.5 Pro); document intelligence for figure extraction is mature; end-to-end multimodal RAG pipelines are in early production at leading enterprises but tooling is less standardised than text-only RAG.

Dimension	Score (1–5)	Rationale
Technology Readiness	3	VLMs are GA; multimodal embedding models are evolving rapidly; figure extraction quality varies
Tooling Ecosystem	2	No turnkey multimodal RAG framework; significant custom development required
Operational Guidance	2	Limited production guidance; benchmark and evaluation standards for visual retrieval are nascent
Security & Compliance	2	Image ACL enforcement and visual PII detection are less mature than text equivalents
Scalability Evidence	2	Limited large-scale production evidence; cost at scale not fully characterised
Cost Predictability	2	VLM image token costs are high and variable; optimisation strategies are still evolving

18. Revision History

Version	Date	Author	Changes
1.0	2025-01-10	EAAPL Working Group	Initial publication; ColPali and GPT-4o multimodal integrated

← Back to Library More Retrieval-Augmented Generation →