EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryRetrieval-Augmented GenerationEAAPL-RAG008
EAAPL-RAG008Proven
⇄ Compare

Multimodal Retrieval-Augmented Generation

[EAAPL-RAG008] Multimodal Retrieval-Augmented Generation

Category: Artificial Intelligence / Retrieval-Augmented Generation Sub-category: Multimodal and Cross-Modal Retrieval Version: 1.0 Maturity: Emerging Tags: rag multimodal vision image-retrieval table-understanding cross-modal clip colpali document-intelligence Regulatory Relevance: EU AI Act Article 10 (Data quality across modalities), ISO/IEC 42001 Section 6.1, NIST AI RMF (Map 1.5)


1. Executive Summary

Multimodal RAG extends the retrieval-augmented generation paradigm to knowledge corpora that include images, charts, diagrams, tables, and other non-text modalities alongside prose documents. Enterprise knowledge frequently exists in forms that standard text-based RAG cannot access: engineering diagrams, medical imaging reports, financial charts, product photographs, scanned contracts, and slide decks with data visualisations. Multimodal RAG enables users to query this knowledge in natural language and receive answers grounded in visual as well as textual evidence.

For enterprise leaders in engineering, healthcare, manufacturing, finance, and legal domains, Multimodal RAG unlocks a significant portion of the knowledge corpus that text-only RAG leaves inaccessible. A maintenance engineer asking "What does the valve assembly look like on model X?" needs a diagram, not a text description. A financial analyst asking "Show me the revenue trend chart from the Q3 investor presentation" needs the chart itself, retrieved and analysed, not a text summary of revenue figures. The pattern is emerging rather than mature — the enabling technologies (multimodal embedding models, vision-language models capable of grounded QA over retrieved images) are advancing rapidly but have not yet reached the operational reliability of text-only RAG. Early adopters with visual-heavy corpora should pilot carefully and plan for ongoing model upgrades.


2. Problem Statement

Business Problem

Enterprise document corpora are not exclusively textual. Technical manuals contain engineering diagrams. Financial reports contain charts. Contracts contain tables of fees and conditions. Training materials contain screenshots. Product documentation contains photographs. Text-only RAG silently ignores all of this visual content, leaving entire knowledge domains inaccessible to AI-assisted search and Q&A.

Technical Problem

Text embedding models cannot embed images. Vector similarity search over text embeddings cannot retrieve images by semantic query. Vision-language models capable of answering questions from image content exist but require grounded evidence retrieval — they cannot answer questions about a diagram without the diagram being present in context. The architecture must therefore solve two distinct problems: (1) how to retrieve the most relevant image/diagram/table for a given query, and (2) how to present it to the generation model in a form that enables grounded visual question answering.

Symptoms

  • AI assistant cannot answer questions about diagrams, charts, or photographs even though they exist in the knowledge corpus
  • Users receive text-only answers to questions that require visual evidence, and must manually locate the relevant diagram
  • RAG system quality evaluations show low recall on questions derived from figure captions, table contents, or diagram annotations
  • Users explicitly request "show me the diagram" or "retrieve the chart" and the system cannot comply

Cost of Inaction

  • Significant portions of the knowledge corpus remain unsearchable via AI, limiting the ROI of the RAG investment
  • Engineers, clinicians, and financial analysts must perform manual visual search in parallel with AI text search, duplicating effort
  • Competitive disadvantage as multimodal AI capabilities become standard in enterprise knowledge platforms

3. Context

When to Apply

  • Knowledge corpora where more than 10% of content value is in non-text form (images, diagrams, charts, tables)
  • Technical documentation with engineering diagrams, schematics, or photographs
  • Financial document Q&A where charts and tables are primary information carriers
  • Healthcare document Q&A over radiology reports, clinical diagrams, or pharmaceutical product photographs
  • Legal contracts with tabular terms, fee schedules, and signature pages

When NOT to Apply

  • Text-only corpora where all visual content is purely decorative (logos, page backgrounds)
  • Deployments with strict latency requirements where multimodal embedding and vision-language model inference adds unacceptable overhead
  • Organisations without a mature multimodal data ingestion and storage capability — text-only RAG should be deployed first

Prerequisites

  • A multimodal embedding model capable of embedding both text queries and image content in the same vector space (CLIP, ColPali, Nomic Embed Vision)
  • A vision-language model capable of grounded visual question answering (GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet — all support image input)
  • An image/figure extraction pipeline (extracts figures and tables from PDFs and documents)
  • Object storage for image assets (referenced from the vector index by URL or object key)
  • A table understanding component (converts tables to structured JSON or markdown for LLM consumption)

Industry Applicability

Industry Modality Use Case
Engineering / Manufacturing CAD diagrams, P&ID schematics, assembly photographs "Show the assembly procedure for component X"
Financial Services Revenue charts, balance sheet tables, trend graphs "What does the FCF trend chart in the Q3 report show?"
Healthcare Anatomical diagrams, procedure illustrations, drug formulary tables "What does the surgical approach diagram for procedure Y look like?"
Legal Contract tables (fee schedules, milestone lists), signature pages "What are the termination fees in Table 3 of the MSA?"
Retail / E-commerce Product photographs, size charts, packaging diagrams "Show me the product dimensions chart for SKU X"
Architecture / Construction Floor plans, elevation drawings, material schedules "Show the floor plan for Level 3 of Building B"

4. Architecture Overview

Multimodal RAG introduces two new ingestion paths and one new retrieval path alongside the standard text pipeline. Understanding the distinct characteristics of each modality is essential to designing an effective architecture.

Multimodal Document Parsing

The ingestion pipeline must first extract non-text elements from documents. For PDFs, this requires a document parsing step that identifies figures, tables, and images and extracts them as separate assets. Tools such as Apache Tika, AWS Textract, Azure Document Intelligence, or Google Document AI can extract figures and tables from PDFs with varying quality. Each extracted asset is assigned a unique asset ID, a reference to its parent document, a page number, a caption (if present), and surrounding context text (the paragraphs immediately before and after the figure in the source document).

Image Embedding Path

Each extracted image is embedded using a multimodal embedding model that produces vectors in the same semantic space as text embeddings. CLIP (Contrastive Language-Image Pre-training) is the canonical architecture: trained on image-text pairs, it produces comparable embeddings for text queries and images, enabling cross-modal retrieval ("text query → retrieve relevant images"). ColPali is an emerging alternative that produces multi-vector patch-level embeddings for higher-resolution document understanding.

Images are stored in object storage (S3, Azure Blob, GCS) and referenced by URL in the vector index. The vector index entry contains the image URL, the image's embedding, the parent document ID, the caption, and surrounding context text. Storing the full image bytes in the vector database is not recommended — only references.

Table Understanding Path

Tables require a distinct handling path because they are structured data, not natural language. Table extraction (via document intelligence services) produces structured table representations. These are then converted to either Markdown table format (for LLM consumption) or JSON (for structured query interfaces). The table as a whole is embedded as a text embedding (of its Markdown representation) for retrieval, not as an image.

Cross-Modal Retrieval

At query time, the user's text query is embedded using the multimodal embedding model and used to search both the text chunk index and the image/table index simultaneously. Scores from both indexes are merged (using RRF or weighted combination) and the top-K results across all modalities are selected. The context assembler then constructs a multimodal prompt that includes both text chunks and images (base64-encoded or URL-referenced, depending on the VLM API).

Vision-Language Model for Grounded Visual QA

The generation step uses a vision-language model (VLM) that accepts both text and images in its context window. The VLM is instructed to answer the user's question based on the provided text and visual evidence, citing both text sources and image sources. The prompt structure places the visual evidence (images, tables) alongside the text chunks and explicitly asks the model to ground its response in the visual content when relevant.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Ingestion["Multimodal Ingestion"] A[Source Documents] B[Document Parser] C[Text Vector Index] D[Image + Table Indexes] end subgraph Query["Cross-Modal Retrieval"] E[User Query] F[Multimodal Embedder] G[Cross-Modal Merger] end subgraph Generation["VLM Generation"] H[Multimodal Context] I[Vision-Language Model] end A --> B B -->|text chunks| C B -->|images + tables| D E --> F F --> C F --> D C --> G D --> G G --> H --> I --> E style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#fef9c3,stroke:#eab308 style D fill:#fef9c3,stroke:#eab308 style E fill:#dbeafe,stroke:#3b82f6 style F fill:#f0fdf4,stroke:#22c55e style G fill:#f0fdf4,stroke:#22c55e style H fill:#f0fdf4,stroke:#22c55e style I fill:#d1fae5,stroke:#10b981

6. Components

Component Type Responsibility Technology Options Criticality
Document Intelligence / Parser Data Processing Extract text, figures, and tables from documents AWS Textract, Azure Document Intelligence, Google Document AI, Unstructured.io Critical
Image/Figure Extractor Data Processing Isolate figure regions from PDFs; extract captions and context PyMuPDF, pdfplumber, Apache Tika, Document AI High
Object Storage Storage Store extracted image assets referenced by vector index Amazon S3, Azure Blob Storage, Google Cloud Storage Critical
Multimodal Embedding Model ML Inference Embed images and text in shared vector space OpenAI CLIP, ColPali, Nomic Embed Vision, Google multimodal embedding Critical
Table Extractor Data Processing Extract table data as structured representation AWS Textract (table mode), Azure DI table extraction, camelot-py High
Image Vector Index Storage ANN index over image embeddings with metadata Weaviate (multi-vector), Qdrant, Pinecone Critical
Table Vector Index Storage Index of table embeddings (as text) with structured metadata Same vector DB; separate namespace/collection High
Cross-Modal Retrieval Orchestrator Retrieval Query all modality indexes; merge results Custom Python; LangChain multi-retriever High
Multimodal Context Assembler Orchestration Construct VLM prompt with text + images + tables Custom; LangChain; LlamaIndex multimodal retriever High
Vision-Language Model ML Inference Generate grounded answer from multimodal context GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro Critical

7. Data Flow

Primary Flow

Step Actor Action Output
1 Document Intelligence Parse PDF; extract text blocks, figure regions, table regions Separated text, image bytes, table data per document
2 Image Extractor Crop figure regions; extract caption and surrounding text context {image_id, image_bytes, caption, context_text, page, doc_id}
3 Object Storage Store image_bytes at s3://bucket/{doc_id}/{image_id}.png Image URL
4 Multimodal Embedding Embed image using CLIP/ColPali Image embedding vector
5 Image Vector Index Upsert {image_id, embedding, image_url, caption, context_text, doc_id} Indexed image entry
6 Table Extractor Extract table as Markdown/JSON {table_id, markdown, page, doc_id}
7 Table Vector Index Embed Markdown representation; upsert Indexed table entry
8 User Submit natural language query Query string
9 Multimodal Query Embedder Embed query using the same multimodal model Query vector (comparable to image and text vectors)
10 Cross-Modal Retrieval ANN search across text, image, and table indexes Candidates from each modality
11 Modal Result Merger Apply RRF across modalities Unified ranked candidate list
12 Context Assembler Fetch text chunks; fetch image bytes (base64) or URLs; fetch table Markdown Multimodal prompt: text + images + tables
13 VLM Generate answer grounded in visual + textual evidence Response with image and text citations

Error Flow

Error Condition Detection Recovery
Figure extraction fails (complex PDF layout) Extraction error log; empty image count Ingest text-only for failed pages; flag document for manual review
Multimodal embedding model unavailable API health check Fall back to caption-text embedding (text-only retrieval for images); surface quality degradation
VLM image token limit exceeded Token count validation before VLM call Reduce number of images in context; summarise image captions instead
Image URL expired (object storage pre-signed URL) HTTP 403 on VLM image fetch Use long-lived URLs or regenerate pre-signed URL at query time

8. Security Considerations

Image Content Classification

Images in enterprise documents may contain sensitive content (facial photographs, handwritten signatures, confidential diagrams). Image classification must be applied at ingestion to flag sensitive images and enforce the same ACL-based access controls as text documents (EAAPL-RAG003). A document with a PROTECTED classification propagates that classification to all extracted images.

OWASP LLM Top 10 Mitigations

OWASP LLM Risk Multimodal-Specific Concern Mitigation
LLM01: Prompt Injection Visual prompt injection: adversarial content embedded in images (invisible text in images) Image content safety scanner before indexing and before VLM context assembly
LLM06: Sensitive Information Disclosure Image contains PII (photograph, signature, handwritten notes) visible to unauthorised users ACL enforcement on image retrieval; image-level classification tagging
LLM02: Insecure Output Handling VLM describes confidential diagram content verbatim Output classification labelling; no verbatim reproduction of classified visual content

9. Governance Considerations

Visual Content Governance

All extracted images must be inventoried as part of the corpus inventory (EAAPL-KNW003). Images containing personal information (photographs, handwritten documents) require explicit privacy assessment. The image extraction pipeline must be reviewed by the Privacy Officer before processing HR or healthcare documents.

Governance Artefacts

Artefact Owner Frequency Purpose
Multimodal Corpus Inventory Knowledge Manager Continuous Track all images, tables, and their source documents
Image Classification Audit Privacy Officer Quarterly Review images containing personal information
Visual Retrieval Quality Report AI Operations Monthly Benchmark cross-modal retrieval recall on a test set

10. Operational Considerations

Monitoring

Metric Alert Threshold Notes
Image extraction success rate < 90% Document Intelligence API quality issue
Multimodal embedding API latency > 500ms Affects ingestion throughput
VLM image token cost per session > $0.50 High image count in context; optimise retrieval K for images
Cross-modal retrieval recall (benchmark) < 0.70 Multimodal embedding quality degradation

Service Level Objectives

SLO Target Notes
Multimodal query P95 latency ≤ 5 seconds Longer than text-only due to image embedding and VLM processing
Image extraction coverage ≥ 90% of documents with figures Measured monthly
Visual retrieval recall@5 ≥ 0.70 on benchmark Measured monthly

11. Cost Considerations

Cost Drivers

Cost Driver Notes Optimisation
Document Intelligence (image extraction) $10–$30 per 1,000 pages Batch processing; cache extraction results
Multimodal embedding Higher cost than text embedding; CLIP APIs ~$0.05–0.15/1K images Self-host CLIP for large corpora
VLM with image input Vision tokens are significantly more expensive than text tokens (GPT-4o: $2.50/1M image tokens) Limit images per context window; use lower-resolution images when detail is not required
Object storage (images) $0.02–$0.025/GB/month Lifecycle policies to move old images to cheaper storage tiers

Indicative Cost Range

Deployment Scale Monthly Cost (Multimodal) Notes
Small pilot (< 100K images) $500 – $2,000 Primarily extraction and embedding setup cost
Medium (100K–1M images) $2,000 – $10,000 VLM query cost becomes dominant
Large (> 1M images) $10,000 – $50,000 Self-hosted CLIP; VLM batching

12. Trade-Off Analysis

Multimodal Embedding Approach

Approach Cross-Modal Quality Cost Complexity Recommendation
CLIP (ViT-B/32 or ViT-L) Good Low (self-hostable) Low Default for most deployments
ColPali (multi-vector patch) Higher for document images Higher compute Medium For document-heavy corpora (PDFs with diagrams)
Caption-only embedding Low cross-modal quality Very Low Very Low Fallback only; not recommended for visual retrieval

Table Handling Strategy

Strategy Retrieval Quality Structured Query Support Complexity
Markdown text embedding Good None Low
JSON structured representation Good SQL-like queries possible Medium
Table as image (render table as PNG) Moderate None Low

Architectural Tensions

Tension Trade-off Recommendation
Context window image count vs. VLM cost More images: better visual grounding; higher token cost Cap images at 3 per query; prioritise highest-scored image retrieval
Image resolution vs. processing speed High resolution: better VLM understanding; higher token cost and latency Use 512px thumbnails for retrieval context; offer "view full resolution" link

13. Failure Modes

Failure Mode Likelihood Impact Detection Recovery
VLM hallucinates about image content not present in context Medium High Citation validation; visual grounding check Explicit prompt instruction to describe only visible content; confidence scoring
Figure extraction misses complex multi-column layouts High Medium Extraction coverage monitoring Manual review queue for documents with < 80% figure coverage
Cross-modal embedding model version drift Low High Retrieval quality benchmark Atomic re-embedding on model upgrade (same process as text-only RAG)
Object storage image URL expiry causing VLM 403 Medium High VLM error log Use long-lived signed URLs; regenerate at query time

14. Regulatory Considerations

Regulation Requirement Multimodal RAG Response
Privacy Act 1988 APP 11 Sensitive personal information (photographs) must be protected Facial photograph detection at ingestion; restricted access for documents containing photographs
EU AI Act Article 10 Training and operational data quality across all modalities Image extraction quality metrics; multimodal benchmark on representative corpus
GDPR Article 9 Special categories of data (medical images, biometrics) require explicit consent Healthcare and biometric images require separate consent and access control tier

15. Reference Implementations

AWS

  • Document Intelligence: Amazon Textract (figure + table extraction)
  • Image storage: Amazon S3
  • Multimodal embedding: Amazon Titan Multimodal Embeddings G1 or self-hosted CLIP on SageMaker
  • Image vector index: Amazon OpenSearch Service with k-NN
  • VLM: Amazon Bedrock (Claude 3.5 Sonnet or Nova)

Azure

  • Document Intelligence: Azure AI Document Intelligence (figure + table extraction)
  • Image storage: Azure Blob Storage
  • Multimodal embedding: Azure OpenAI (CLIP via custom deployment) or Azure AI Vision
  • Image vector index: Azure AI Search (vector mode)
  • VLM: Azure OpenAI GPT-4o (native image input)

GCP

  • Document Intelligence: Google Document AI
  • Image storage: Google Cloud Storage
  • Multimodal embedding: Vertex AI Multimodal Embeddings
  • Image vector index: Vertex AI Vector Search
  • VLM: Vertex AI Gemini 1.5 Pro (native multimodal)

Pattern ID Pattern Name Relationship
EAAPL-RAG001 Enterprise RAG Foundation; RAG008 extends text retrieval with cross-modal capability
EAAPL-RAG005 Hybrid RAG Hybrid retrieval applied to text path; image path uses cross-modal embedding only
EAAPL-RAG009 Graph RAG Diagram elements can be modelled as knowledge graph entities; complementary
EAAPL-KNW003 AI Knowledge Corpus Management Corpus management must include visual asset lifecycle

17. Maturity Assessment

Overall Maturity: Emerging — Multimodal embedding models and VLMs with image input are production-grade (GPT-4o, Gemini 1.5 Pro); document intelligence for figure extraction is mature; end-to-end multimodal RAG pipelines are in early production at leading enterprises but tooling is less standardised than text-only RAG.

Dimension Score (1–5) Rationale
Technology Readiness 3 VLMs are GA; multimodal embedding models are evolving rapidly; figure extraction quality varies
Tooling Ecosystem 2 No turnkey multimodal RAG framework; significant custom development required
Operational Guidance 2 Limited production guidance; benchmark and evaluation standards for visual retrieval are nascent
Security & Compliance 2 Image ACL enforcement and visual PII detection are less mature than text equivalents
Scalability Evidence 2 Limited large-scale production evidence; cost at scale not fully characterised
Cost Predictability 2 VLM image token costs are high and variable; optimisation strategies are still evolving

18. Revision History

Version Date Author Changes
1.0 2025-01-10 EAAPL Working Group Initial publication; ColPali and GPT-4o multimodal integrated
← Back to LibraryMore Retrieval-Augmented Generation