MM RAG
Multimodal Retrieval-Augmented Generation
Table of Contents
Overview
MM RAG (Multimodal RAG) is an advanced document intelligence system that combines vector search, OCR, and language models to extract and synthesize information from complex PDF documents. The system is specifically optimized to run efficiently on NVIDIA's GB10 platform.
What is the GB10?
The GB10 (Grace-Blackwell GB10) is an NVIDIA processor based on the ARM Grace architecture (72 ARM Neoverse v2 cores) optimized for AI inference and server workloads. Unlike traditional GPUs, the GB10 is designed for:
Maximum energy efficiency for AI inference
Native ARM architecture with excellent CPU operation support
High memory density (up to 480 GB of LPDDR5X)
Optimal performance/watt for embedding and RAG tasks
Why is MM RAG optimized for GB10?
This project was designed from the ground up to leverage the unique characteristics of the GB10:
CPU-First Embeddings: Uses CPU-efficient embedding models (sentence-transformers, CLIP) that excel on ARM Grace rather than requiring GPUs.
Small VLM models: Uses Qwen2-VL-2B-Instruct (2 billion parameters) which easily fits in ARM memory and performs well in CPU inference mode.
HTTP-offloaded LLM: Delegates text generation to a remote Ollama server, allowing the GB10 to focus on embedding and vector search.
Optimized Chunking and OCR: Exploits the 72 ARM cores to parallelize OCR (Tesseract) and document processing.
No CUDA dependency: Runs entirely in CPU mode, avoiding the complexity of CUDA installation on ARM.
Abundant memory: Takes advantage of the GB10's large memory capacity to simultaneously load embedding models (text + vision) and the VLM without swapping.
Architecture
Main Components
┌─────────────────┐
│ Frontend UI │ (HTML/JS/CSS)
│ Static Files │
└────────┬────────┘
│
▼
┌─────────────────┐
│ FastAPI │ Port 8000
│ (main.py) │
├─────────────────┤
│ - PDF Ingest │
│ - Embeddings │
│ - Retrieval │
│ - LLM Calls │
│ - VLM (lazy) │
└────────┬────────┘
│
┌────┴────┬────────────┬──────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────────┐ ┌──────────┐
│Qdrant│ │Tess. │ │Ollama │ │Qwen2-VL │
│Vector│ │OCR │ │(remote) │ │(local) │
│DB │ │ │ │LLM HTTP │ │Vision-LLM│
└──────┘ └──────┘ └──────────┘ └──────────┘
Qdrant Collections
The system uses two distinct vector collections:
docs_text (dimension: 384) - Stores text chunks extracted by OCR - Embeddings: paraphrase-multilingual-MiniLM-L12-v2 - Optimized for multilingual semantic search (FR/EN) - Metadata: doc_id, source, page, chunk_index, bbox
docs_vision (dimension: 512) - Stores embeddings of complete page images - Embeddings: OpenCLIP ViT-B-32 - Enables visual and multimodal search - Metadata: doc_id, source, page, image_path
GB10 Optimizations
1. CPU-First Strategy
python
# embedder.py - Configuration for GB10
DEVICE = os.getenv("EMBED_DEVICE", "cpu") # Force CPU on GB10
torch.set_num_threads(4) # Use 4 of the 72 available cores
Embedding models are executed on CPU with a limited number of threads to avoid contention and enable application-level parallelization.
2. Intelligent Image Downscaling
python
# Reduce large images before embedding to save RAM
MAX_SIDE = 1024 # Limit max size to 1024px
This optimization reduces memory footprint without significant quality loss for vector search.
3. Lazy VLM Loading
python
# vlm.py - Load model only when needed
_vlm_loaded = False
def loadvlm():
global vlmloaded
if vlmloaded:
return
# Load Qwen2-VL-2B only on first call
Saves up to 4 GB of RAM at startup, loads the VLM only when a multimodal question is asked.
4. HTTP-Offloaded LLM
yaml
# docker-compose.yml
LLM_HTTP_URL: "http://192.168.1.89:11434"
LLM_OPENAI_MODEL: "ministral-3:14b"
The GB10 delegates text generation to a remote Ollama server (can be a GPU for large models), focusing on its core competency: embedding and search.
5. Parallelized OCR
Tesseract naturally exploits the GB10's multi-core for OCR of multiple pages simultaneously during ingestion.
6. Optimized Chunking
python
MAX_CHARS_PER_CHUNK = 1500 # Optimal chunk size
OVERLAP_CHARS = 200 # Overlap for contextual continuity
Chunk size calibrated to:
Maximize semantic density
Avoid excessive fragmentation
Optimize search performance on ARM
Features
1. PDF Document Ingestion
PDF Upload via REST API
Multimodal extraction: - Render each page to high-resolution image (scale=2) - Complete OCR with Tesseract (French/English support) - Generate text and vision embeddings
Structured storage: - Page images in /data/images/ - Vectors in Qdrant (docs_text + docs_vision) - Complete metadata (doc_id, page, source, bbox)
2. Hybrid Text + Vision Search
python
# Simultaneous search in both spaces
results = query_hybrid(
qdrant_url=QDRANT_URL,
query="What are the security procedures?",
top_k_text=15, # 15 best text chunks
top_k_vision=4, # 4 best visual pages
doc_ids=["doc123"]
)
3. Intelligent Question-Answering
Retrieval: Retrieves relevant passages (text + vision)
Enriched context: Combines OCR text + page metadata
Generation: Calls remote LLM with context
Citations: Mentions page numbers in answers
4. Vision-Language Model (VLM) Mode
For questions requiring visual analysis:
python
# Send page images + context to VLM
answer = answer_with_images(
question="What does this diagram show?",
images=[page1_img, page2_img],
context_text=ocr_context
)
5. Document Group Management
Create thematic groups of documents
Search within a specific group
Organize document library
6. Modern Web Interface
Conversational chat with documents
Visualization of search results
Preview of page images
Management of documents and groups
Relevance scores for each result
Installation
Prerequisites
Docker and Docker Compose installed
Python 3.11+ (for local development)
Tesseract OCR installed on Docker host
Ollama server accessible (for LLM)
Standard Installation
powershell
# 1. Clone the project
git clone <repo_url>
cd mmrag
# 2. Create data directories
New-Item -ItemType Directory -Force -Path "data\images"
New-Item -ItemType Directory -Force -Path "data\qdrant"
# 3. Configure environment (see Configuration section)
# 4. Build and launch
docker-compose build
docker-compose up -d
# 5. Check logs
docker-compose logs -f api
Installation on GB10
For GB10 deployment, follow the same steps but ensure:
yaml
# docker-compose.yml - Force CPU usage
environment:
EMBED_DEVICE: "cpu"
VLM_DEVICE: "cpu"
TORCH_NUM_THREADS: "4"
Configuration
Main Environment Variables
Qdrant (Vector Database)
yaml
QDRANT_URL: "http://qdrant:6333"
QDRANT_TEXT_COLLECTION: "docs_text"
QDRANT_VISION_COLLECTION: "docs_vision"
OCR and Chunking
yaml
TESS_LANGS: "fra" # Tesseract languages (fra, eng, etc.)
MAX_CHARS_PER_CHUNK: "1500" # Max text chunk size
OVERLAP_CHARS: "200" # Overlap between chunks
Embeddings
yaml
TEXT_EMBED_DIM: "384" # Sentence-transformers dimension
VISION_EMBED_DIM: "512" # CLIP ViT-B-32 dimension
EMBED_DEVICE: "cpu" # Device for embeddings (cpu/cuda)
EMBED_MAX_IMAGE_SIDE: "1024" # Max image size before embedding
TORCH_NUM_THREADS: "4" # PyTorch threads
Remote LLM (Ollama)
yaml
LLM_HTTP_MODE: "openai" # OpenAI-compatible mode
LLM_HTTP_URL: "http://192.168.1.89:11434" # Ollama server URL
LLM_OPENAI_PATH: "/v1/chat/completions" # API endpoint
LLM_OPENAI_MODEL: "ministral-3:14b" # Model to use
LLM_HTTP_KEY: "" # API key (if needed)
VLM (Vision-Language Model)
yaml
VLM_MODEL_ID: "Qwen/Qwen2-VL-2B-Instruct" # Hugging Face model
VLM_DEVICE: "cpu" # Device (cpu/cuda)
Storage
yaml
STORAGE_DIR: "/app/storage" # Directory for images
Usage
1. Access Web Interface
Open browser: http://localhost:8000 (or GB10 server IP)
2. Ingest a PDF Document
Via interface:
Go to "Documents" tab
Click "Upload PDF"
Select a file
Wait for ingestion to complete Via API:
bash
curl -X POST http://localhost:8000/ingest/pdf \
-F "pdf=@document.pdf"
Response:
json
{
"doc_id": "a1b2c3d4-...",
"filename": "document.pdf",
"ingested": {
"pages": 15,
"text_chunks": 42,
"vision_points": 15
}
}
3. Ask a Question
Via interface:
"Chat" tab
Select document or group
Type question
View answer with page citations Via API:
bash
curl -X POST http://localhost:8000/answer \
-H "Content-Type: application/json" \
-d '{
"question": "What are the main conclusions?",
"doc_id": "a1b2c3d4-...",
"top_k_text": 15,
"top_k_vision": 4,
"max_tokens": 400
}'
4. Pure Search (without LLM)
bash
curl -X POST http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{
"query": "information security",
"doc_ids": ["doc1", "doc2"],
"top_k_text": 10,
"top_k_vision": 5
}'
5. Vision-Language Mode
For questions requiring visual analysis:
bash
curl -X POST http://localhost:8000/answer-vision \
-H "Content-Type: application/json" \
-d '{
"question": "Describe the chart on page 5",
"doc_id": "a1b2c3d4-...",
"top_k_pages": 4
}'
The VLM (Qwen2-VL) will receive page images and generate an answer based on visual analysis.
API Endpoints
Documents
POST /ingest/pdf
Ingest a PDF document. Request:
pdf: PDF file (multipart/form-data) Response:
json
{
"doc_id": "uuid",
"filename": "file.pdf",
"ingested": {
"pages": 10,
"text_chunks": 25,
"vision_points": 10
}
}
GET /documents
List all ingested documents. Response:
json
{
"documents": [
{
"doc_id": "uuid1",
"filename": "doc1.pdf",
"pages": 10,
"ingested_at": "2026-01-01T12:00:00"
}
]
}
DELETE /documents/{doc_id}
Delete a document.
Search : POST /search
Hybrid text + vision search. Request:
json
{
"query": "text to search",
"doc_id": "uuid", // optional
"doc_ids": ["uuid1", "uuid2"], // optional
"group_id": "group_uuid", // optional
"top_k_text": 10,
"top_k_vision": 5
}
Response:
json
{
"query": "searched text",
"text": [
{
"score": 0.92,
"payload": {
"doc_id": "uuid",
"page": 3,
"text": "chunk content...",
"source": "document.pdf"
}
}
],
"vision": [
{
"score": 0.87,
"payload": {
"doc_id": "uuid",
"page": 3,
"image_url": "/files/images/uuid_p3.png"
}
}
]
}
Question-Answering
POST /answer
Ask a question about documents. Request:
json
{
"question": "What is the conclusion?",
"doc_id": "uuid",
"top_k_text": 15,
"top_k_vision": 4,
"max_tokens": 400,
"use_semantic_search": true
}
Response:
json
{
"answer": "The main conclusion is... [Page 12]",
"sources": {
"text": [...],
"vision": [...]
}
}
POST /answer-vision
Question with visual analysis via VLM. Request:
json
{
"question": "What does this diagram show?",
"doc_id": "uuid",
"top_k_pages": 4,
"max_tokens": 400
}
Groups : GET /groups
List all groups : POST /groups
Create a new group. Request:
json
{
"name": "Financial Reports",
"description": "All Q1-Q4 2025 reports",
"doc_ids": ["uuid1", "uuid2"]
}
PUT /groups/{group_id}
Update a group.
DELETE /groups/{group_id}
Delete a group.
Deployment
Local Deployment (Dev)
powershell
# Start services
docker-compose up -d
# View logs
docker-compose logs -f
# Restart API after changes
docker-compose restart api
Deployment on GB10 Server
The project includes an automated PowerShell deployment script:
powershell
# Edit deploy.ps1 with your parameters
$remoteHost = "user@gb10-server"
$remotePath = "~/projects/mmrag"
# Launch deployment
.\deploy.ps1
The script:
Synchronizes files via SCP
Builds Docker image on remote server
Restarts services
Displays logs
Production Configuration
For production on GB10:
yaml
# docker-compose.yml
services:
api:
restart: always
deploy:
resources:
limits:
cpus: '16' # Limit to 16 of 72 cores
memory: 32G # RAM limit
environment:
EMBED_DEVICE: "cpu"
VLM_DEVICE: "cpu"
TORCH_NUM_THREADS: "8"
Monitoring
bash
# Check CPU/RAM usage
docker stats mmrag-api
# Real-time logs
docker-compose logs -f api
# Qdrant health
curl http://localhost:6333/health
Technologies
Backend
FastAPI 0.115.6 - Modern, high-performance web framework
Uvicorn 0.32.1 - High-performance ASGI server
Qdrant 1.11.5 - Vector database
qdrant-client 1.12.1 - Python client for Qdrant
Document Processing
pypdfium2 4.30.0 - PDF rendering to images
pytesseract 0.3.13 - OCR (Python wrapper for Tesseract)
Pillow 11.0.0 - Image manipulation
Embeddings and AI
sentence-transformers 3.3.1 - Multilingual text embeddings - Model: paraphrase-multilingual-MiniLM-L12-v2 (384 dim)
open-clip-torch 2.26.1 - Vision embeddings - Model: ViT-B-32 OpenAI (512 dim)
PyTorch 2.1+ - Deep learning framework
transformers 4.47.1 - Hugging Face Transformers
Qwen2-VL-2B-Instruct - Vision-Language Model (lazy load)
Communication
● httpx 0.27.2 - Modern HTTP client for LLM calls
Frontend
HTML5/CSS3 - Responsive web interface
Vanilla JavaScript - No framework, optimal performance
SVG Icons - Vector icons
GB10 Performance
Typical Benchmarks (GB10 with 72 ARM cores)
Operation Time Notes
PDF ingestion (10 pages) ~15s OCR + text/vision embedding
Hybrid search ~200ms Query 2 collections simultaneously
Text embedding (1 chunk) ~50ms CPU, sentence-transformers
Vision embedding (1 page) ~100ms CPU, CLIP ViT-B-32
LLM generation (400 tokens) ~5s Via remote Ollama
VLM inference (4 images) ~8s Qwen2-VL-2B in CPU mode
Resource Usage
RAM at startup: ~2 GB (without VLM)
RAM with VLM loaded: ~6 GB
CPU idle: <5% of 72 cores
CPU during ingestion: 40-60% (OCR parallelization)
CPU during search: 10-15%
Recommended GB10 Optimizations
Increase TORCH_NUM_THREADS to 8-12 to leverage more cores
Enable BF16 if supported by model (memory savings)
Preload VLM at startup if frequently used
Use NVMe SSD for /data/qdrant (I/O performance)
Route remote LLM to GPU for large models
Roadmap
Version 1.1 (Q1 2026)
Multi-user support with authentication
Embedding cache to speed up re-ingestion
Layout-aware OCR with precise bbox detection
Export answers to Markdown/PDF
Version 1.2 (Q2 2026)
Support for additional formats (DOCX, PPTX, images)
Fine-tuning embedding model on domain corpus
Automatic document clustering
GraphQL API complement to REST
Version 2.0 (Q3 2026)
Streaming mode for long answers
Integration of native multimodal models (Gemini, GPT-4V)
Audio/video document support
Analytics dashboard and usage metrics
Additional Technical Notes
Why two separate collections (text/vision)?
The separation into two collections enables:
Different dimensions (384 vs 512) optimized for each modality
Independent search then result fusion
Differential scaling: add more shards to the most solicited collection
Easier debugging: isolate issues by modality
Chunking Strategy
Chunking with overlap ensures:
Semantic continuity: cut sentences appear in 2 chunks
Sufficient context: 1500 chars ≈ 1-2 paragraphs
Search performance: chunks neither too small (noise) nor too large (dilution)
Model Choices
Model
Reason for choice
MiniLM-L12-v2
Multilingual, compact, excellent on CPU
CLIP ViT-B-32
De facto standard, good quality/size balance
Qwen2-VL-2B
Small highly performant VLM, CPU-optimized
Ministral-3:14b
Efficient LLM, good French, Q4 quantized
Security
For production deployment:
Add JWT authentication on endpoints
Limit PDF upload size (max 50MB)
Rate limiting on /answer to prevent abuse
Mandatory HTTPS with Let's Encrypt certificate
Strict doc_ids validation to prevent IDOR
Proprietary - Emmanuel Forgues © 2026