MM RAG

Multimodal Retrieval-Augmented Generation

Table of Contents

Overview

MM RAG (Multimodal RAG) is an advanced document intelligence system that combines vector search, OCR, and language models to extract and synthesize information from complex PDF documents. The system is specifically optimized to run efficiently on NVIDIA's GB10 platform.

What is the GB10?

The GB10 (Grace-Blackwell GB10) is an NVIDIA processor based on the ARM Grace architecture (72 ARM Neoverse v2 cores) optimized for AI inference and server workloads. Unlike traditional GPUs, the GB10 is designed for:

  • Maximum energy efficiency for AI inference

  • Native ARM architecture with excellent CPU operation support

  • High memory density (up to 480 GB of LPDDR5X)

  • Optimal performance/watt for embedding and RAG tasks

Why is MM RAG optimized for GB10?

This project was designed from the ground up to leverage the unique characteristics of the GB10:

  1. CPU-First Embeddings: Uses CPU-efficient embedding models (sentence-transformers, CLIP) that excel on ARM Grace rather than requiring GPUs.

  2. Small VLM models: Uses Qwen2-VL-2B-Instruct (2 billion parameters) which easily fits in ARM memory and performs well in CPU inference mode.

  3. HTTP-offloaded LLM: Delegates text generation to a remote Ollama server, allowing the GB10 to focus on embedding and vector search.

  4. Optimized Chunking and OCR: Exploits the 72 ARM cores to parallelize OCR (Tesseract) and document processing.

  5. No CUDA dependency: Runs entirely in CPU mode, avoiding the complexity of CUDA installation on ARM.

  6. Abundant memory: Takes advantage of the GB10's large memory capacity to simultaneously load embedding models (text + vision) and the VLM without swapping.

Architecture

Main Components

┌─────────────────┐
│ Frontend UI │ (HTML/JS/CSS)
│ Static Files │
└────────┬────────┘


┌─────────────────┐
│ FastAPI │ Port 8000
│ (main.py) │
├─────────────────┤
│ - PDF Ingest │
│ - Embeddings │
│ - Retrieval │
│ - LLM Calls │
│ - VLM (lazy) │
└────────┬────────┘

┌────┴────┬────────────┬──────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────────┐ ┌──────────┐
│Qdrant│ │Tess. │ │Ollama │ │Qwen2-VL │
│Vector│ │OCR │ │(remote) │ │(local) │
│DB │ │ │ │LLM HTTP │ │Vision-LLM│
└──────┘ └──────┘ └──────────┘ └──────────┘

Qdrant Collections

The system uses two distinct vector collections:

  1. docs_text (dimension: 384) - Stores text chunks extracted by OCR - Embeddings: paraphrase-multilingual-MiniLM-L12-v2 - Optimized for multilingual semantic search (FR/EN) - Metadata: doc_id, source, page, chunk_index, bbox

  2. docs_vision (dimension: 512) - Stores embeddings of complete page images - Embeddings: OpenCLIP ViT-B-32 - Enables visual and multimodal search - Metadata: doc_id, source, page, image_path

GB10 Optimizations

1. CPU-First Strategy

python
# embedder.py - Configuration for GB10
DEVICE = os.getenv("EMBED_DEVICE", "cpu") # Force CPU on GB10
torch.set_num_threads(4) # Use 4 of the 72 available cores

Embedding models are executed on CPU with a limited number of threads to avoid contention and enable application-level parallelization.

2. Intelligent Image Downscaling

python
# Reduce large images before embedding to save RAM
MAX_SIDE = 1024 # Limit max size to 1024px

This optimization reduces memory footprint without significant quality loss for vector search.

3. Lazy VLM Loading

python
# vlm.py - Load model only when needed
_vlm_loaded = False
def loadvlm():
global vlmloaded
if vlmloaded:
return
# Load Qwen2-VL-2B only on first call

Saves up to 4 GB of RAM at startup, loads the VLM only when a multimodal question is asked.

4. HTTP-Offloaded LLM

yaml
# docker-compose.yml
LLM_HTTP_URL: "http://192.168.1.89:11434"
LLM_OPENAI_MODEL: "ministral-3:14b"

The GB10 delegates text generation to a remote Ollama server (can be a GPU for large models), focusing on its core competency: embedding and search.

5. Parallelized OCR

Tesseract naturally exploits the GB10's multi-core for OCR of multiple pages simultaneously during ingestion.

6. Optimized Chunking

python
MAX_CHARS_PER_CHUNK = 1500 # Optimal chunk size
OVERLAP_CHARS = 200 # Overlap for contextual continuity

Chunk size calibrated to:

  • Maximize semantic density

  • Avoid excessive fragmentation

  • Optimize search performance on ARM

Features

1. PDF Document Ingestion

  • PDF Upload via REST API

  • Multimodal extraction: - Render each page to high-resolution image (scale=2) - Complete OCR with Tesseract (French/English support) - Generate text and vision embeddings

  • Structured storage: - Page images in /data/images/ - Vectors in Qdrant (docs_text + docs_vision) - Complete metadata (doc_id, page, source, bbox)

2. Hybrid Text + Vision Search

python
# Simultaneous search in both spaces
results = query_hybrid(
qdrant_url=QDRANT_URL,
query="What are the security procedures?",
top_k_text=15, # 15 best text chunks
top_k_vision=4, # 4 best visual pages
doc_ids=["doc123"]
)

3. Intelligent Question-Answering

  • Retrieval: Retrieves relevant passages (text + vision)

  • Enriched context: Combines OCR text + page metadata

  • Generation: Calls remote LLM with context

  • Citations: Mentions page numbers in answers

4. Vision-Language Model (VLM) Mode

For questions requiring visual analysis:

python
# Send page images + context to VLM
answer = answer_with_images(
question="What does this diagram show?",
images=[page1_img, page2_img],
context_text=ocr_context
)

5. Document Group Management

  • Create thematic groups of documents

  • Search within a specific group

  • Organize document library

6. Modern Web Interface

  • Conversational chat with documents

  • Visualization of search results

  • Preview of page images

  • Management of documents and groups

  • Relevance scores for each result

Installation

Prerequisites

  • Docker and Docker Compose installed

  • Python 3.11+ (for local development)

  • Tesseract OCR installed on Docker host

  • Ollama server accessible (for LLM)

Standard Installation

powershell
# 1. Clone the project
git clone <repo_url>
cd mmrag
# 2. Create data directories
New-Item -ItemType Directory -Force -Path "data\images"
New-Item -ItemType Directory -Force -Path "data\qdrant"
# 3. Configure environment (see Configuration section)
# 4. Build and launch
docker-compose build
docker-compose up -d
# 5. Check logs
docker-compose logs -f api

Installation on GB10

For GB10 deployment, follow the same steps but ensure:

yaml
# docker-compose.yml - Force CPU usage
environment:
EMBED_DEVICE: "cpu"
VLM_DEVICE: "cpu"
TORCH_NUM_THREADS: "4"

Configuration

Main Environment Variables

Qdrant (Vector Database)

yaml
QDRANT_URL: "http://qdrant:6333"
QDRANT_TEXT_COLLECTION: "docs_text"
QDRANT_VISION_COLLECTION: "docs_vision"

OCR and Chunking

yaml
TESS_LANGS: "fra" # Tesseract languages (fra, eng, etc.)
MAX_CHARS_PER_CHUNK: "1500" # Max text chunk size
OVERLAP_CHARS: "200" # Overlap between chunks

Embeddings

yaml
TEXT_EMBED_DIM: "384" # Sentence-transformers dimension
VISION_EMBED_DIM: "512" # CLIP ViT-B-32 dimension
EMBED_DEVICE: "cpu" # Device for embeddings (cpu/cuda)
EMBED_MAX_IMAGE_SIDE: "1024" # Max image size before embedding
TORCH_NUM_THREADS: "4" # PyTorch threads

Remote LLM (Ollama)

yaml
LLM_HTTP_MODE: "openai" # OpenAI-compatible mode
LLM_HTTP_URL: "http://192.168.1.89:11434" # Ollama server URL
LLM_OPENAI_PATH: "/v1/chat/completions" # API endpoint
LLM_OPENAI_MODEL: "ministral-3:14b" # Model to use
LLM_HTTP_KEY: "" # API key (if needed)

VLM (Vision-Language Model)

yaml
VLM_MODEL_ID: "Qwen/Qwen2-VL-2B-Instruct" # Hugging Face model
VLM_DEVICE: "cpu" # Device (cpu/cuda)

Storage

yaml
STORAGE_DIR: "/app/storage" # Directory for images

Usage

1. Access Web Interface

Open browser: http://localhost:8000 (or GB10 server IP)

2. Ingest a PDF Document

Via interface:

  1. Go to "Documents" tab

  2. Click "Upload PDF"

  3. Select a file

  4. Wait for ingestion to complete Via API:

bash
curl -X POST http://localhost:8000/ingest/pdf \
-F "pdf=@document.pdf"

Response:

json
{
"doc_id": "a1b2c3d4-...",
"filename": "document.pdf",
"ingested": {
"pages": 15,
"text_chunks": 42,
"vision_points": 15
}
}

3. Ask a Question

Via interface:

  1. "Chat" tab

  2. Select document or group

  3. Type question

  4. View answer with page citations Via API:

bash
curl -X POST http://localhost:8000/answer \
-H "Content-Type: application/json" \
-d '{
"question": "What are the main conclusions?",
"doc_id": "a1b2c3d4-...",
"top_k_text": 15,
"top_k_vision": 4,
"max_tokens": 400
}'

4. Pure Search (without LLM)

bash
curl -X POST http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{
"query": "information security",
"doc_ids": ["doc1", "doc2"],
"top_k_text": 10,
"top_k_vision": 5
}'

5. Vision-Language Mode

For questions requiring visual analysis:

bash

curl -X POST http://localhost:8000/answer-vision \
-H "Content-Type: application/json" \
-d '{
"question": "Describe the chart on page 5",
"doc_id": "a1b2c3d4-...",
"top_k_pages": 4
}'

The VLM (Qwen2-VL) will receive page images and generate an answer based on visual analysis.

API Endpoints

Documents

POST /ingest/pdf

Ingest a PDF document. Request:

pdf: PDF file (multipart/form-data) Response:

json
{
"doc_id": "uuid",
"filename": "file.pdf",
"ingested": {
"pages": 10,
"text_chunks": 25,
"vision_points": 10
}
}
GET /documents

List all ingested documents. Response:

json
{
"documents": [
{
"doc_id": "uuid1",
"filename": "doc1.pdf",
"pages": 10,
"ingested_at": "2026-01-01T12:00:00"
}
]
}

DELETE /documents/{doc_id}

Delete a document.

Search : POST /search

Hybrid text + vision search. Request:

json
{
"query": "text to search",
"doc_id": "uuid", // optional
"doc_ids": ["uuid1", "uuid2"], // optional
"group_id": "group_uuid", // optional
"top_k_text": 10,
"top_k_vision": 5
}

Response:

json
{
"query": "searched text",
"text": [
{
"score": 0.92,
"payload": {
"doc_id": "uuid",
"page": 3,
"text": "chunk content...",
"source": "document.pdf"
}
}
],
"vision": [
{
"score": 0.87,
"payload": {
"doc_id": "uuid",
"page": 3,
"image_url": "/files/images/uuid_p3.png"
}
}
]
}

Question-Answering

POST /answer

Ask a question about documents. Request:

json
{
"question": "What is the conclusion?",
"doc_id": "uuid",
"top_k_text": 15,
"top_k_vision": 4,
"max_tokens": 400,
"use_semantic_search": true
}

Response:

json
{
"answer": "The main conclusion is... [Page 12]",
"sources": {
"text": [...],
"vision": [...]
}
}

POST /answer-vision

Question with visual analysis via VLM. Request:

json
{
"question": "What does this diagram show?",
"doc_id": "uuid",
"top_k_pages": 4,
"max_tokens": 400
}

Groups : GET /groups

List all groups : POST /groups

Create a new group. Request:

json
{
"name": "Financial Reports",
"description": "All Q1-Q4 2025 reports",
"doc_ids": ["uuid1", "uuid2"]
}

PUT /groups/{group_id}

Update a group.

DELETE /groups/{group_id}

Delete a group.

Deployment

Local Deployment (Dev)

powershell
# Start services
docker-compose up -d
# View logs
docker-compose logs -f
# Restart API after changes
docker-compose restart api

Deployment on GB10 Server

The project includes an automated PowerShell deployment script:

powershell
# Edit deploy.ps1 with your parameters
$remoteHost = "user@gb10-server"
$remotePath = "~/projects/mmrag"
# Launch deployment
.\deploy.ps1

The script:

  1. Synchronizes files via SCP

  2. Builds Docker image on remote server

  3. Restarts services

  4. Displays logs

Production Configuration

For production on GB10:

yaml
# docker-compose.yml
services:
api:
restart: always
deploy:
resources:
limits:
cpus: '16' # Limit to 16 of 72 cores
memory: 32G # RAM limit
environment:
EMBED_DEVICE: "cpu"
VLM_DEVICE: "cpu"
TORCH_NUM_THREADS: "8"

Monitoring

bash
# Check CPU/RAM usage
docker stats mmrag-api
# Real-time logs
docker-compose logs -f api
# Qdrant health
curl http://localhost:6333/health

Technologies

Backend

  • FastAPI 0.115.6 - Modern, high-performance web framework

  • Uvicorn 0.32.1 - High-performance ASGI server

  • Qdrant 1.11.5 - Vector database

  • qdrant-client 1.12.1 - Python client for Qdrant

Document Processing

  • pypdfium2 4.30.0 - PDF rendering to images

  • pytesseract 0.3.13 - OCR (Python wrapper for Tesseract)

  • Pillow 11.0.0 - Image manipulation

Embeddings and AI

  • sentence-transformers 3.3.1 - Multilingual text embeddings - Model: paraphrase-multilingual-MiniLM-L12-v2 (384 dim)

  • open-clip-torch 2.26.1 - Vision embeddings - Model: ViT-B-32 OpenAI (512 dim)

  • PyTorch 2.1+ - Deep learning framework

  • transformers 4.47.1 - Hugging Face Transformers

  • Qwen2-VL-2B-Instruct - Vision-Language Model (lazy load)

Communication

httpx 0.27.2 - Modern HTTP client for LLM calls

Frontend

  • HTML5/CSS3 - Responsive web interface

  • Vanilla JavaScript - No framework, optimal performance

  • SVG Icons - Vector icons

GB10 Performance

Typical Benchmarks (GB10 with 72 ARM cores)

  • Operation Time Notes

  • PDF ingestion (10 pages) ~15s OCR + text/vision embedding

  • Hybrid search ~200ms Query 2 collections simultaneously

  • Text embedding (1 chunk) ~50ms CPU, sentence-transformers

  • Vision embedding (1 page) ~100ms CPU, CLIP ViT-B-32

  • LLM generation (400 tokens) ~5s Via remote Ollama

  • VLM inference (4 images) ~8s Qwen2-VL-2B in CPU mode

Resource Usage

  • RAM at startup: ~2 GB (without VLM)

  • RAM with VLM loaded: ~6 GB

  • CPU idle: <5% of 72 cores

  • CPU during ingestion: 40-60% (OCR parallelization)

  • CPU during search: 10-15%

Recommended GB10 Optimizations

  1. Increase TORCH_NUM_THREADS to 8-12 to leverage more cores

  2. Enable BF16 if supported by model (memory savings)

  3. Preload VLM at startup if frequently used

  4. Use NVMe SSD for /data/qdrant (I/O performance)

  5. Route remote LLM to GPU for large models

Roadmap

Version 1.1 (Q1 2026)

  • Multi-user support with authentication

  • Embedding cache to speed up re-ingestion

  • Layout-aware OCR with precise bbox detection

  • Export answers to Markdown/PDF

Version 1.2 (Q2 2026)

  • Support for additional formats (DOCX, PPTX, images)

  • Fine-tuning embedding model on domain corpus

  • Automatic document clustering

  • GraphQL API complement to REST

Version 2.0 (Q3 2026)

  • Streaming mode for long answers

  • Integration of native multimodal models (Gemini, GPT-4V)

  • Audio/video document support

  • Analytics dashboard and usage metrics

Additional Technical Notes

Why two separate collections (text/vision)?

The separation into two collections enables:

  1. Different dimensions (384 vs 512) optimized for each modality

  2. Independent search then result fusion

  3. Differential scaling: add more shards to the most solicited collection

  4. Easier debugging: isolate issues by modality

Chunking Strategy

Chunking with overlap ensures:

  • Semantic continuity: cut sentences appear in 2 chunks

  • Sufficient context: 1500 chars ≈ 1-2 paragraphs

  • Search performance: chunks neither too small (noise) nor too large (dilution)

Model Choices

Model

Reason for choice

  • MiniLM-L12-v2

  • Multilingual, compact, excellent on CPU

  • CLIP ViT-B-32

  • De facto standard, good quality/size balance

  • Qwen2-VL-2B

  • Small highly performant VLM, CPU-optimized

  • Ministral-3:14b

  • Efficient LLM, good French, Q4 quantized

Security

For production deployment:

  • Add JWT authentication on endpoints

  • Limit PDF upload size (max 50MB)

  • Rate limiting on /answer to prevent abuse

  • Mandatory HTTPS with Let's Encrypt certificate

  • Strict doc_ids validation to prevent IDOR

Proprietary - Emmanuel Forgues © 2026