VLM2Vec-V2: A Practical Guide to Unified Multimodal Embeddings for Images, Videos, and Documents
Audience: developers, product managers, and researchers with at least a junior-college background
Goal: learn how one open-source model can turn text, images, videos, and PDF pages into a single, searchable vector space—without adding extra tools or cloud bills.
1. Why Another Multimodal Model?
Pain Point | Real-World Example | Business Impact |
---|---|---|
Most models only handle photos | CLIP works great on Instagram pictures | You still need a second system for YouTube clips or slide decks |
Fragmented pipelines | One micro-service for PDF search, another for video search | Higher latency and ops overhead |
Limited training data | MSCOCO, Flickr30K—mostly static images | Model fails on charts, forms, or cooking videos |
VLM2Vec-V2 closes these gaps with a single embedding space that accepts text, images, videos, and visual documents in one call.
2. Meet MMEB-V2: 78 Tasks, One Report Card
Think of MMEB-V2 as the final exam for any multimodal embedding model. It groups tasks into five families:
Family | Sample Dataset | What You Feed In | What You Get Back | Everyday Use Case |
---|---|---|---|---|
Image classification | ImageNet-1K | one image | label “tabby cat” | smartphone photo organizer |
Image QA | OK-VQA | image + question | short answer | tourist app: “What mountain is this?” |
Image retrieval | MSCOCO | caption | matching photo | e-commerce “find similar dress” |
Video retrieval | MSR-VTT | caption | matching clip | stock-footage search |
Moment retrieval | QVHighlights | long video + query | start & end timestamps | highlight extraction for sports |
Video classification | Kinetics-700 | short clip | action label “playing violin” | content moderation |
Video QA | NExT-QA | clip + question | multiple-choice answer | cooking assistant |
Visual-document retrieval | ViDoRe | natural-language question | PDF page | legal-tech clause finder |
Figure: the 78 tasks span images, videos, and PDFs. Blue boxes are legacy tasks from MMEB-v1; red boxes are new in v2.
3. Inside VLM2Vec-V2: Three Core Parts
3.1 Backbone: Qwen2-VL-2B
-
Parameters: 2 billion—small enough for a single A100 40 GB -
Native resolution: any input size, no letter-boxing required -
Frame sampling: 8 uniformly spaced frames per video (configurable)
3.2 Training Objective: Instruction-Guided Contrastive Learning
Every training sample is a pair (instruction + query, positive target)
.
We use the InfoNCE loss:
L = -log [ exp(sim(q, t⁺)/τ) / Σ exp(sim(q, ti)/τ) ]
-
sim
: cosine similarity -
τ
: temperature = 0.02 (best balance across tasks)
3.3 Unified Input Format
<|image_pad|>
Instruction: Find a video that shows dolphins jumping near a boat.
Query: dolphins jumping near boat
---
<|video_pad|>
Understand the provided video.
Same token layout for PDF pages—only the pad token changes.
4. Training Data Sources (No External Additions)
Source | Pairs / Pages | Key Content |
---|---|---|
LLaVA-Hound synthetic videos | 540 k | captions, QA pairs scraped and cleaned |
ViDoRe + VisRAG documents | 480 k | slides, charts, scientific PDFs |
MMEB-train images | 1.2 M | COCO, Flickr, etc. to keep legacy skill |
Sampling tricks that matter:
-
On-the-fly mixing table – prevents the model from overfitting one modality -
Interleaved sub-batches – 1024 global batch split into 16 × 64; harder negatives, stable gradients
5. Benchmark Results in Plain Numbers
Model | Overall 78-Task Avg | Image 36 | Video 18 | VisDoc 24 |
---|---|---|---|---|
VLM2Vec-V2 (2B) | 58.0 | 64.9 | 34.6 | 65.4 |
GME-7B | 57.8 | 56.0 | 38.4 | 75.2 |
ColPali v1.3 (3B, VisDoc-only) | 44.4 | 34.9 | 28.2 | 71.0 |
Take-away: VLM2Vec-V2 is the highest combined score among unified models; specialists still win on narrow tracks.
6. Hands-On: 30-Minute End-to-End Walkthrough
6.1 Prerequisites
conda create -n vlm2vec python=3.10
conda activate vlm2vec
pip install -r requirements.txt
6.2 Download All Datasets
bash experiments/release_public/data/download_data.sh
Expect ~200 GB; SSD strongly recommended.
6.3 Launch Training (8-GPU example)
cd experiments/release_public/train
torchrun --nproc_per_node=8 train.py \
--config configs/vlm2vec_v2_qwen2vl_2b.yaml \
--output_dir ./runs/v2
Key YAML fields you might tweak:
model_name: Qwen/Qwen2-VL-2B-Instruct
lora_rank: 16 # 8 or 32 also tested; 16 is sweet spot
batch_size: 1024 # global
sub_batch_size: 64 # intra-batch negatives
max_steps: 5000 # still climbing after 5 k
6.4 Evaluate on All Tasks
cd experiments/release_public/eval
python eval.py \
--model_path ./runs/v2/checkpoint-5000 \
--datasets all
Produces leaderboard.csv
with every sub-score.
7. Frequently Asked Questions
Q1: Can I run this on a single RTX 4090?
Yes for inference (batch=1). Training needs gradient checkpointing + LoRA; expect ~22 GB peak.
Q2: How do I add my own PDF collection?
- Export each page to PNG (300 dpi recommended).
- Create a new `VisDocDataset` class under `src/dataset/`; inherit base methods.
- Add download link in `download_data.sh`.
- Register in your yaml config under `datasets:`.
Q3: Why Hit@1 for images and NDCG@5 for documents?
Hit@1 rewards exact top match—fine for “find this cat.”
NDCG@5 cares about ranking quality—critical when the answer might be on page 3 of 20.
Q4: Is 5 k steps the ceiling?
No. VisDoc and Video scores keep rising; authors left longer runs for future work.
Q5: License and commercial use?
Weights: Apache-2.0.
Datasets: check each subset’s original license (e.g., Kinetics is CC-BY-4.0).
Always red-check before shipping to production.
Q6: How does it handle very long videos?
Default is 8 frames. For >3 min videos, chunk into 30-second windows, encode each, then merge via max-pooling or centroid.
Q7: LoRA rank 32 is worse—why?
Likely over-parameterization. 16 gives enough capacity without destabilizing contrastive objectives.
Q8: Any streaming demo?
Yes: [Hugging Face Space](https://huggingface.co/spaces/TIGER-Lab/MMEB-Leaderboard) lets you upload an image or paste a YouTube URL and see top matches in real time.
Q9: Upgrading from V1—will I lose work?
Absolutely. Commit or stash first, then:
“`bash
git checkout main
git fetch –all –prune
git reset –hard origin/main
“`
Q10: Can I freeze the vision encoder?
Yes—set `lora_target_modules: [“q_proj”, “v_proj”]` only. Accuracy drops ~2 pts but VRAM halves.
8. Decision Matrix: When to Choose VLM2Vec-V2
Requirement | Recommendation | Rationale |
---|---|---|
Unified search across images, videos, PDFs | ✅ | Only open-source model with joint training on all three |
Pure e-commerce image search | ❓ | If CLIP already meets latency/accuracy, no urgent need |
Video moderation + highlight reels | ✅ | Combines classification + moment retrieval in one pass |
Long PDF + OCR text | ❓ | If text layer is clean, pure text embeddings can be faster |
On-device inference (< 8 GB RAM) | ❌ | 2 B weights still need 4 GB+; consider distillation |
9. Key Takeaways in One Minute
-
One model now handles text, image, video, and PDF vectors. -
78 tasks in MMEB-V2 serve as a neutral benchmark—no cherry-picking. -
2 B parameters + LoRA fit on a single high-end GPU. -
Apache-2.0 license means you can embed it in SaaS tomorrow—just mind dataset licenses. -
30-minute tutorial above is all you need to reproduce the paper and start fine-tuning on your own data.
Ready to try? Clone the repo, run the download script, and type torchrun
. Your multimodal search box is only a few commands away.