MM RAG

Multimodal Retrieval-Augmented Generation

Table of Contents

Overview
Architecture
GB10 Optimizations
Features
Installation
Configuration
Usage
API Endpoints
Deployment
Technologies

Overview

MM RAG (Multimodal RAG) is an advanced document intelligence system that combines vector search, OCR, and language models to extract and synthesize information from complex PDF documents. The system is specifically optimized to run efficiently on NVIDIA's GB10 platform.

What is the GB10?

The GB10 (Grace-Blackwell GB10) is an NVIDIA processor based on the ARM Grace architecture (72 ARM Neoverse v2 cores) optimized for AI inference and server workloads. Unlike traditional GPUs, the GB10 is designed for:

Maximum energy efficiency for AI inference
Native ARM architecture with excellent CPU operation support
High memory density (up to 480 GB of LPDDR5X)
Optimal performance/watt for embedding and RAG tasks

Why is MM RAG optimized for GB10?

This project was designed from the ground up to leverage the unique characteristics of the GB10:

CPU-First Embeddings: Uses CPU-efficient embedding models (sentence-transformers, CLIP) that excel on ARM Grace rather than requiring GPUs.
Small VLM models: Uses Qwen2-VL-2B-Instruct (2 billion parameters) which easily fits in ARM memory and performs well in CPU inference mode.
HTTP-offloaded LLM: Delegates text generation to a remote Ollama server, allowing the GB10 to focus on embedding and vector search.
Optimized Chunking and OCR: Exploits the 72 ARM cores to parallelize OCR (Tesseract) and document processing.
No CUDA dependency: Runs entirely in CPU mode, avoiding the complexity of CUDA installation on ARM.
Abundant memory: Takes advantage of the GB10's large memory capacity to simultaneously load embedding models (text + vision) and the VLM without swapping.

Architecture

Main Components

┌─────────────────┐
│ Frontend UI │ (HTML/JS/CSS)
│ Static Files │
└────────┬────────┘
│
▼
┌─────────────────┐
│ FastAPI │ Port 8000
│ (main.py) │
├─────────────────┤
│ - PDF Ingest │
│ - Embeddings │
│ - Retrieval │
│ - LLM Calls │
│ - VLM (lazy) │
└────────┬────────┘
│
┌────┴────┬────────────┬──────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────────┐ ┌──────────┐
│Qdrant│ │Tess. │ │Ollama │ │Qwen2-VL │
│Vector│ │OCR │ │(remote) │ │(local) │
│DB │ │ │ │LLM HTTP │ │Vision-LLM│
└──────┘ └──────┘ └──────────┘ └──────────┘

Qdrant Collections

The system uses two distinct vector collections:

docs_text (dimension: 384) - Stores text chunks extracted by OCR - Embeddings: paraphrase-multilingual-MiniLM-L12-v2 - Optimized for multilingual semantic search (FR/EN) - Metadata: doc_id, source, page, chunk_index, bbox
docs_vision (dimension: 512) - Stores embeddings of complete page images - Embeddings: OpenCLIP ViT-B-32 - Enables visual and multimodal search - Metadata: doc_id, source, page, image_path

GB10 Optimizations

1. CPU-First Strategy

python
# embedder.py - Configuration for GB10
DEVICE = os.getenv("EMBED_DEVICE", "cpu") # Force CPU on GB10
torch.set_num_threads(4) # Use 4 of the 72 available cores

Embedding models are executed on CPU with a limited number of threads to avoid contention and enable application-level parallelization.

2. Intelligent Image Downscaling

python
# Reduce large images before embedding to save RAM
MAX_SIDE = 1024 # Limit max size to 1024px

This optimization reduces memory footprint without significant quality loss for vector search.

3. Lazy VLM Loading

python
# vlm.py - Load model only when needed
_vlm_loaded = False
def loadvlm():
global vlmloaded
if vlmloaded:
return
# Load Qwen2-VL-2B only on first call

Saves up to 4 GB of RAM at startup, loads the VLM only when a multimodal question is asked.

4. HTTP-Offloaded LLM

yaml
# docker-compose.yml
LLM_HTTP_URL: "http://192.168.1.89:11434"
LLM_OPENAI_MODEL: "ministral-3:14b"

The GB10 delegates text generation to a remote Ollama server (can be a GPU for large models), focusing on its core competency: embedding and search.

5. Parallelized OCR

Tesseract naturally exploits the GB10's multi-core for OCR of multiple pages simultaneously during ingestion.

6. Optimized Chunking

python
MAX_CHARS_PER_CHUNK = 1500 # Optimal chunk size
OVERLAP_CHARS = 200 # Overlap for contextual continuity

Chunk size calibrated to:

Maximize semantic density
Avoid excessive fragmentation
Optimize search performance on ARM

Features

1. PDF Document Ingestion

PDF Upload via REST API
Multimodal extraction: - Render each page to high-resolution image (scale=2) - Complete OCR with Tesseract (French/English support) - Generate text and vision embeddings
Structured storage: - Page images in /data/images/ - Vectors in Qdrant (docs_text + docs_vision) - Complete metadata (doc_id, page, source, bbox)

2. Hybrid Text + Vision Search

python
# Simultaneous search in both spaces
results = query_hybrid(
qdrant_url=QDRANT_URL,
query="What are the security procedures?",
top_k_text=15, # 15 best text chunks
top_k_vision=4, # 4 best visual pages
doc_ids=["doc123"]
)

3. Intelligent Question-Answering

Retrieval: Retrieves relevant passages (text + vision)
Enriched context: Combines OCR text + page metadata
Generation: Calls remote LLM with context
Citations: Mentions page numbers in answers

4. Vision-Language Model (VLM) Mode

For questions requiring visual analysis:

python
# Send page images + context to VLM
answer = answer_with_images(
question="What does this diagram show?",
images=[page1_img, page2_img],
context_text=ocr_context
)

5. Document Group Management

Create thematic groups of documents
Search within a specific group
Organize document library

6. Modern Web Interface

Conversational chat with documents
Visualization of search results
Preview of page images
Management of documents and groups
Relevance scores for each result

Installation

Prerequisites

Docker and Docker Compose installed
Python 3.11+ (for local development)
Tesseract OCR installed on Docker host
Ollama server accessible (for LLM)

Standard Installation

powershell
# 1. Clone the project
git clone <repo_url>
cd mmrag
# 2. Create data directories
New-Item -ItemType Directory -Force -Path "data\images"
New-Item -ItemType Directory -Force -Path "data\qdrant"
# 3. Configure environment (see Configuration section)
# 4. Build and launch
docker-compose build
docker-compose up -d
# 5. Check logs
docker-compose logs -f api

Installation on GB10

For GB10 deployment, follow the same steps but ensure:

yaml
# docker-compose.yml - Force CPU usage
environment:
EMBED_DEVICE: "cpu"
VLM_DEVICE: "cpu"
TORCH_NUM_THREADS: "4"

Configuration

Main Environment Variables

Qdrant (Vector Database)

yaml
QDRANT_URL: "http://qdrant:6333"
QDRANT_TEXT_COLLECTION: "docs_text"
QDRANT_VISION_COLLECTION: "docs_vision"

OCR and Chunking

yaml
TESS_LANGS: "fra" # Tesseract languages (fra, eng, etc.)
MAX_CHARS_PER_CHUNK: "1500" # Max text chunk size
OVERLAP_CHARS: "200" # Overlap between chunks

Embeddings

yaml
TEXT_EMBED_DIM: "384" # Sentence-transformers dimension
VISION_EMBED_DIM: "512" # CLIP ViT-B-32 dimension
EMBED_DEVICE: "cpu" # Device for embeddings (cpu/cuda)
EMBED_MAX_IMAGE_SIDE: "1024" # Max image size before embedding
TORCH_NUM_THREADS: "4" # PyTorch threads

Remote LLM (Ollama)

yaml
LLM_HTTP_MODE: "openai" # OpenAI-compatible mode
LLM_HTTP_URL: "http://192.168.1.89:11434" # Ollama server URL
LLM_OPENAI_PATH: "/v1/chat/completions" # API endpoint
LLM_OPENAI_MODEL: "ministral-3:14b" # Model to use
LLM_HTTP_KEY: "" # API key (if needed)

VLM (Vision-Language Model)

yaml
VLM_MODEL_ID: "Qwen/Qwen2-VL-2B-Instruct" # Hugging Face model
VLM_DEVICE: "cpu" # Device (cpu/cuda)

Storage

yaml
STORAGE_DIR: "/app/storage" # Directory for images

Usage

1. Access Web Interface

Open browser: http://localhost:8000 (or GB10 server IP)

2. Ingest a PDF Document

Via interface:

Go to "Documents" tab
Click "Upload PDF"
Select a file
Wait for ingestion to complete Via API:

bash
curl -X POST http://localhost:8000/ingest/pdf \
-F "pdf=@document.pdf"

Response:

json
{
"doc_id": "a1b2c3d4-...",
"filename": "document.pdf",
"ingested": {
"pages": 15,
"text_chunks": 42,
"vision_points": 15
}
}

3. Ask a Question

Via interface:

"Chat" tab
Select document or group
Type question
View answer with page citations Via API:

bash
curl -X POST http://localhost:8000/answer \
-H "Content-Type: application/json" \
-d '{
"question": "What are the main conclusions?",
"doc_id": "a1b2c3d4-...",
"top_k_text": 15,
"top_k_vision": 4,
"max_tokens": 400
}'

4. Pure Search (without LLM)

bash
curl -X POST http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{
"query": "information security",
"doc_ids": ["doc1", "doc2"],
"top_k_text": 10,
"top_k_vision": 5
}'

5. Vision-Language Mode

For questions requiring visual analysis:

bash

curl -X POST http://localhost:8000/answer-vision \
-H "Content-Type: application/json" \
-d '{
"question": "Describe the chart on page 5",
"doc_id": "a1b2c3d4-...",
"top_k_pages": 4
}'

The VLM (Qwen2-VL) will receive page images and generate an answer based on visual analysis.

API Endpoints

Documents

POST /ingest/pdf

Ingest a PDF document. Request:

pdf: PDF file (multipart/form-data) Response:

json
{
"doc_id": "uuid",
"filename": "file.pdf",
"ingested": {
"pages": 10,
"text_chunks": 25,
"vision_points": 10
}
}
GET /documents

List all ingested documents. Response:

json
{
"documents": [
{
"doc_id": "uuid1",
"filename": "doc1.pdf",
"pages": 10,
"ingested_at": "2026-01-01T12:00:00"
}
]
}

DELETE /documents/{doc_id}

Delete a document.

Search : POST /search

Hybrid text + vision search. Request:

json
{
"query": "text to search",
"doc_id": "uuid", // optional
"doc_ids": ["uuid1", "uuid2"], // optional
"group_id": "group_uuid", // optional
"top_k_text": 10,
"top_k_vision": 5
}

Response:

json
{
"query": "searched text",
"text": [
{
"score": 0.92,
"payload": {
"doc_id": "uuid",
"page": 3,
"text": "chunk content...",
"source": "document.pdf"
}
}
],
"vision": [
{
"score": 0.87,
"payload": {
"doc_id": "uuid",
"page": 3,
"image_url": "/files/images/uuid_p3.png"
}
}
]
}

Question-Answering

POST /answer

Ask a question about documents. Request:

json
{
"question": "What is the conclusion?",
"doc_id": "uuid",
"top_k_text": 15,
"top_k_vision": 4,
"max_tokens": 400,
"use_semantic_search": true
}

Response:

json
{
"answer": "The main conclusion is... [Page 12]",
"sources": {
"text": [...],
"vision": [...]
}
}

POST /answer-vision

Question with visual analysis via VLM. Request:

json
{
"question": "What does this diagram show?",
"doc_id": "uuid",
"top_k_pages": 4,
"max_tokens": 400
}

Groups : GET /groups

List all groups : POST /groups

Create a new group. Request:

json
{
"name": "Financial Reports",
"description": "All Q1-Q4 2025 reports",
"doc_ids": ["uuid1", "uuid2"]
}

PUT /groups/{group_id}

Update a group.

DELETE /groups/{group_id}

Delete a group.

Deployment

Local Deployment (Dev)

powershell
# Start services
docker-compose up -d
# View logs
docker-compose logs -f
# Restart API after changes
docker-compose restart api

Deployment on GB10 Server

The project includes an automated PowerShell deployment script:

powershell
# Edit deploy.ps1 with your parameters
$remoteHost = "user@gb10-server"
$remotePath = "~/projects/mmrag"
# Launch deployment
.\deploy.ps1

The script:

Synchronizes files via SCP
Builds Docker image on remote server
Restarts services
Displays logs

Production Configuration

For production on GB10:

yaml
# docker-compose.yml
services:
api:
restart: always
deploy:
resources:
limits:
cpus: '16' # Limit to 16 of 72 cores
memory: 32G # RAM limit
environment:
EMBED_DEVICE: "cpu"
VLM_DEVICE: "cpu"
TORCH_NUM_THREADS: "8"

Monitoring

bash
# Check CPU/RAM usage
docker stats mmrag-api
# Real-time logs
docker-compose logs -f api
# Qdrant health
curl http://localhost:6333/health

Technologies

Backend

FastAPI 0.115.6 - Modern, high-performance web framework
Uvicorn 0.32.1 - High-performance ASGI server
Qdrant 1.11.5 - Vector database
qdrant-client 1.12.1 - Python client for Qdrant

Document Processing

pypdfium2 4.30.0 - PDF rendering to images
pytesseract 0.3.13 - OCR (Python wrapper for Tesseract)
Pillow 11.0.0 - Image manipulation

Embeddings and AI

sentence-transformers 3.3.1 - Multilingual text embeddings - Model: paraphrase-multilingual-MiniLM-L12-v2 (384 dim)
open-clip-torch 2.26.1 - Vision embeddings - Model: ViT-B-32 OpenAI (512 dim)
PyTorch 2.1+ - Deep learning framework
transformers 4.47.1 - Hugging Face Transformers
Qwen2-VL-2B-Instruct - Vision-Language Model (lazy load)

Communication

● httpx 0.27.2 - Modern HTTP client for LLM calls

Frontend

HTML5/CSS3 - Responsive web interface
Vanilla JavaScript - No framework, optimal performance
SVG Icons - Vector icons

GB10 Performance

Typical Benchmarks (GB10 with 72 ARM cores)

Operation Time Notes
PDF ingestion (10 pages) ~15s OCR + text/vision embedding
Hybrid search ~200ms Query 2 collections simultaneously
Text embedding (1 chunk) ~50ms CPU, sentence-transformers
Vision embedding (1 page) ~100ms CPU, CLIP ViT-B-32
LLM generation (400 tokens) ~5s Via remote Ollama
VLM inference (4 images) ~8s Qwen2-VL-2B in CPU mode

Resource Usage

RAM at startup: ~2 GB (without VLM)
RAM with VLM loaded: ~6 GB
CPU idle: <5% of 72 cores
CPU during ingestion: 40-60% (OCR parallelization)
CPU during search: 10-15%

Recommended GB10 Optimizations

Increase TORCH_NUM_THREADS to 8-12 to leverage more cores
Enable BF16 if supported by model (memory savings)
Preload VLM at startup if frequently used
Use NVMe SSD for /data/qdrant (I/O performance)
Route remote LLM to GPU for large models

Roadmap

Version 1.1 (Q1 2026)

Multi-user support with authentication
Embedding cache to speed up re-ingestion
Layout-aware OCR with precise bbox detection
Export answers to Markdown/PDF

Version 1.2 (Q2 2026)

Support for additional formats (DOCX, PPTX, images)
Fine-tuning embedding model on domain corpus
Automatic document clustering
GraphQL API complement to REST

Version 2.0 (Q3 2026)

Streaming mode for long answers
Integration of native multimodal models (Gemini, GPT-4V)
Audio/video document support
Analytics dashboard and usage metrics

Additional Technical Notes

Why two separate collections (text/vision)?

The separation into two collections enables:

Different dimensions (384 vs 512) optimized for each modality
Independent search then result fusion
Differential scaling: add more shards to the most solicited collection
Easier debugging: isolate issues by modality

Chunking Strategy

Chunking with overlap ensures:

Semantic continuity: cut sentences appear in 2 chunks
Sufficient context: 1500 chars ≈ 1-2 paragraphs
Search performance: chunks neither too small (noise) nor too large (dilution)

Model Choices

Model

Reason for choice

MiniLM-L12-v2
Multilingual, compact, excellent on CPU
CLIP ViT-B-32
De facto standard, good quality/size balance
Qwen2-VL-2B
Small highly performant VLM, CPU-optimized
Ministral-3:14b
Efficient LLM, good French, Q4 quantized

Security

For production deployment:

Add JWT authentication on endpoints
Limit PDF upload size (max 50MB)
Rate limiting on /answer to prevent abuse
Mandatory HTTPS with Let's Encrypt certificate
Strict doc_ids validation to prevent IDOR

MM RAG

Art