MMDocRAG: How Multimodal Retrieval-Augmented Generation Transforms Document QA Systems

高效码农

2 months ago

MMDocRAG: Revolutionizing Multimodal Document QA with Retrieval-Augmented Generation

The Dual Challenge in Document Understanding

Today’s Document Visual Question Answering (DocVQA) systems grapple with processing lengthy, multimodal documents (text, images, tables) while performing cross-modal reasoning. Traditional text-centric approaches often miss critical visual information, creating significant knowledge gaps. Worse still? The field lacks standardized benchmarks to evaluate how well models integrate multimodal evidence.

Introducing the MMDocRAG Benchmark

Developed by leading researchers, MMDocRAG provides a breakthrough solution with:

4,055 expert-annotated QA pairs anchored to multi-page evidence chains
Novel evaluation metrics for multimodal quote selection
Hybrid answer generation combining text and visual references
Large-scale validation across 60+ vision/language models and 14 retrieval systems

[Key Findings]  
- Proprietary vision-language models outperform text-only models by 22-38%  
- Fine-tuned LLMs gain 52% accuracy boost with detailed image captions  
- Open-source models trail in cross-modal reasoning capabilities

Hands-On Implementation Guide

Step 1: Data Acquisition

Download the core image dataset:

wget https://huggingface.co/datasets/MMDocIR/MMDocRAG/blob/main/images.zip  
unzip images.zip -d ./dataset/

Method 1: API-Based Inference (Rapid Prototyping)

API Key Setup:

▸

Google Gemini
▸

Anthropic
▸

OpenAI

Execution Command:

python inference_api.py qwen3-32b --setting 20 --mode pure-text --no-enable-thinking

Parameters:

▸

--setting: Evidence quotes (15/20)
▸

--mode: Input format (pure-text/multimodal)
▸

--no-enable-thinking: Disable chain-of-thought for Qwen models

Method 2: Local Model Inference (Custom Deployment)

Environment Configuration:

Python 3.9  
PyTorch 2.1.2+cu121  
ms-swift toolkit

Model Download:

git lfs install  
git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct ./checkpoint/

Run Inference:

python inference_checkpoint.py Qwen2.5-7B-Instruct --setting 20 --lora Qwen2.5-7B-Instruct_lora

Fine-Tuning with LoRA:

python train_swift_qwen.py Qwen2.5-7B-Instruct --setting 20

Output: Weights saved to Qwen2.5-7B-Instruct_lora

Comprehensive Evaluation Framework

1. LLM-as-Judge Assessment:

python eval_llm_judge.py response/qwen3-4b_pure-text_response_quotes20.jsonl --setting 20

Output: JSONL file with qualitative scores

2. Full Metric Analysis:

python eval_all.py \  
  --path_response response/qwen3-4b_response.jsonl \  
  --path_judge evaluation/judge_scores.jsonl \  
  --setting 20

Metrics Calculated:

▸

Quote selection F1-score
▸

BLEU textual similarity
▸

ROUGE-L coherence
▸

LLM-as-Judge composite score

Groundbreaking Research Insights

Testing 30 open-source, 25 proprietary, and 5 fine-tuned models revealed:

Proprietary Model Dominance:
- ▸
  
  Gemini 2.5 Pro leads in visual understanding (F1: 0.87)
- ▸
  
  Claude 3.5 Sonnet excels in reasoning tasks (Accuracy: 92.3%)
- ▸
  
  GPT-4o maintains best overall performance

Open-Source Breakthroughs:

{  
  "Qwen2.5-72B-Inst-Fine-tuning": "37% F1 improvement post-tuning",  
  "InternVL3-78B": "Visual comprehension within 5% of GPT-4o",  
  "Llama4-Mave-17Bx128E": "Processes 128-page docs 40% faster"  
}

The Captioning Effect:
- ▸
  
  Detailed image descriptions boost Qwen2.5 accuracy by 52%
- ▸
  
  LLaMA3-70B’s ROUGE-L jumps from 0.48 → 0.71

Reproducing Research Results

Full replication commands provided:

python eval_all.py --model qwen3-4b --setting 20 --mode pure-text

Model Identifier Reference:

{  
  "Top Proprietary Models": ["Gemini-2.5-Pro", "GPT-4o", "Claude-3.5-Sonnet"],  
  "Leading Open-Source": ["Qwen3-32B", "Llama3.3-70B-Inst", "InternVL3-78B"],  
  "Fine-Tuning Paradigms": [  
    "Qwen2.5-72B-Inst-Fine-tuning",  
    "Deepseek-R1-Distill-Llama-70B"  
  ]  
}

Citation and Licensing

@misc{dong2025mmdocrag,  
  title={Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering},  
  author={Dong, Kuicai and Chang, Yujing and Huang, Shijie and Wang, Yasheng and Tang, Ruiming and Liu, Yong},  
  year={2025},  
  eprint={2505.16470},  
  archivePrefix={arXiv},  
  primaryClass={cs.IR}  
}

License Information:

Research use only. Complies with OpenAI’s Terms

Resource Directory

▸

📖 Full Paper
▸

🏠 Project Homepage
▸

🤗 HuggingFace Dataset
▸

👉 GitHub Repository

“

Research Implications: MMDocRAG establishes the new gold standard for document QA systems. Its true value lies in exposing three critical development vectors: visual-semantic fusion, evidence selection optimization, and cross-page reasoning – the holy trinity for next-gen intelligent document processing.