MegaRAG: Teaching RAG to Read Diagrams, Charts, and Slide Layouts Like a Human
“
What makes MegaRAG different?
It treats every page as a mini-multimodal graph—text, figures, tables, and even the page screenshot itself become nodes. A two-pass large-language-model pipeline first extracts entities in parallel, then refines cross-modal edges using a global subgraph. The final answer is produced in two stages to prevent modality bias. On four public benchmarks the system outperforms GraphRAG and LightRAG by up to 45 percentage points while running on a single RTX-3090.
§
The Core Question This Article Answers
“How can I build a retrieval-augmented-generation stack that actually understands slides, textbooks, and financial reports containing images, charts, and complex layouts—without hand-crafting knowledge graphs or paying for massive GPU clusters?”
§
1. Why Classic RAG Hits a Visual Wall
Summary
Traditional RAG splits documents into text chunks, embeds them, and retrieves the top-k chunks. That works for prose, but it silently drops three critical signals:
-
Layout (where on the page an element sits) -
Visual evidence (charts, diagrams, screenshots) -
Cross-page references (figure 3-a mentioned on page 7, explained on page 9)
MegaRAG keeps all three by turning every page into a small multimodal knowledge graph (MMKG) and then stitching the pages together.
“
Author’s reflection: During internal tests on a 788-page open-source history textbook, vanilla chunking lost 62 % of the questions whose answers lay solely in maps or timelines. That number was too painful to ignore.
§
2. System Walk-Through in One Minute
Summary
-
Parse → 2. Extract → 3. Refine → 4. Index → 5. Retrieve → 6. Two-stage generate.
The novelty is in steps 2-3: an MLLM writes the first draft of the graph in parallel for every page; a second MLLM pass uses a sub-graph retrieved from that draft to add missing cross-modal edges. The rest of the pipeline (indexing, retrieval, generation) is engineered to keep the refined graph and the original screenshots in the same embedding space so they can be fetched together at query time.
§
3. Step-by-Step Deep Dive
3.1 Parsing: One JSON per Page
MinerU (an open PDF parser) is used out-of-the-box. Output for page i is a JSON containing:
-
Ti: extracted text -
Fi: list of figure images cropped from the page -
Bi: list of table images -
Ii: full-page screenshot (keeps spatial cues)
No extra OCR is run; the MLLM will read raw pixels when necessary.
§
3.2 Initial Graph Construction (Parallel, Zero Temperature)
Core question answered here:
“How do you turn a messy pile of text boxes and pictures into a clean set of nodes and edges without spending days with a mouse?”
Procedure
-
Prompt GPT-4o-mini with a single multi-turn instruction: -
“List every entity, its type, a short description, and all directed relations.” -
“Treat each informative figure or table as one entity; ignore decorative graphics.”
-
-
Run pages in parallel; wall-clock time ≈ 0.8 s per page on the open-ai API. -
Merge outputs by string-matching entity names; accumulate descriptions and keywords.
Mini example (abridged)
Page-5 text says “EV sales doubled in 2023.”
Page-5 also contains a bar chart titled “Annual Sales by Vehicle Type” with a tall “EV” bar.
Initial extraction produces:
Entity: EV sales in 2023
Type: Event
Desc: doubling of electric-vehicle sales
Entity: Annual Sales by Vehicle Type
Type: Figure
Desc: bar chart showing EV tallest
Relation is NOT created yet—text and figure are isolated. That is fixed in refinement.
§
3.3 Graph Refinement: Global Memory, Local Focus
Core question answered here:
“How can you add missing edges between pages while staying within the context-length budget of today’s MLLMs?”
Key idea
Instead of feeding the whole document graph into the prompt, retrieve only the top-120 nodes whose embeddings are closest to the current page’s text+image. Add one-hop neighbours, yielding roughly 200-300 nodes—well below a 32 k token ceiling.
Prompt sketch
You are given: (1) original page content, (2) a partial graph that may contain relevant context.
Add any *missing* entities or relations that link the page content to the partial graph.
Back to the EV example: the subgraph now contains both the text entity and the bar-chart entity, so the model writes:
Relation: EV sales in 2023 —illustratedBy→ Annual Sales by Vehicle Type
“
Author’s reflection: We tried skipping refinement to save money. Accuracy on slide-heavy datasets dropped by 38 %—more than I expected. The second pass is cheap compared to re-training anything, so we kept it.
§
3.4 Unified Indexing: One Vector Space for Symbols and Screenshots
Encoder: GME-Qwen2-VL-2B (2B parameters, MIT license).
Text inputs are simply zero-shot into the same ViT-Text transformer that handles images.
Three FAISS indexes are built:
| Index | Content to Embed | Typical k |
|---|---|---|
| EntityIdx | name + description |
60 |
| RelIdx | source + relation phrase + target + keywords |
60 |
| PageIdx | whole-page screenshot Ii | 6 |
“
No special tricks—just concatenate and average-pool the last hidden state.
§
3.5 Retrieval at Query Time
-
LLM extracts low-level keywords (concrete entities) and high-level keywords (themes). -
Embed both sets; concatenate nearest neighbours from EntityIdx and RelIdx; expand each hit by one hop. -
Retrieve top-m pages from PageIdx. -
Pass {sub-graph + page images}to generation stage.
§
3.6 Two-Stage Answer Generation
Core question answered here:
“Why not shove everything into one big prompt?”
Because the model overwhelmingly cites text when both modalities are present. MegaRAG therefore produces two intermediate answers first, then a final synthesis:
-
Stage-1A: prompt uses only the screenshots. -
Stage-1B: prompt uses only the graph. -
Stage-2: short fusion prompt asks the model to write a single coherent answer.
Empirically, this raises “visual citation rate” from 21 % (single prompt) to 49 % and improves Diversity and Empowerment scores by 15-20 %.
§
4. Benchmarks, Baselines, and Numbers
Datasets (all public)
| Name | Modality | #Docs | #Pages | Key Challenge |
|---|---|---|---|---|
| UltraDomain | text only | 177 | 2M tokens | long-range book QA |
| World-History textbook | mixed | 1 | 788 | maps, timelines |
| DLCV slides | mixed | 18 | 1 984 | dense figures |
| GenAI lecture | mixed | 20 | 594 | Chinese slides |
| SlideVQA-2k | mixed | 100 | 2 000 | slide-level VQA |
| RealMMBench | mixed | 163 | 8 604 | tables + charts |
Global QA (125 synthetic questions per dataset)
| Metric | NaiveRAG | GraphRAG | LightRAG | MegaRAG |
|---|---|---|---|---|
| World-History Overall win-rate | 0.0 % | 0.0 % | 0.0 % | 89.5 % |
| GenAI Overall win-rate | 0.0 % | 0.0 % | 0.0 % | 98.4 % |
Local QA (ground-truth labels exist)
| Dataset | Best baseline acc. | MegaRAG acc. | Δ |
|---|---|---|---|
| SlideVQA-2k | 27.66 % | 64.85 % | +37.2 pp |
| RealMMBench-FinSlides | 13.02 % | 58.37 % | +45.3 pp |
§
5. Ablation: What Happens When You Remove a Lego Brick?
| Setting | GenAI Overall win-rate | Comment |
|---|---|---|
| Full system | 86.4 % | — |
| A1: no visual input | 0.8 % | graphs collapse to plain text RAG |
| A2: no graph retrieval | 0.0 % | page-only retrieval misses global links |
| A3: single-stage generation | 64.0 % | still usable, but less diverse & visual |
“
Lesson: graph retrieval is the load-bearing brick; visuals matter most in slide-heavy corpora; two-stage generation is the cheapest upgrade you can make.
§
6. Practical Recipe: From PDF to Conversational Bot in Four Commands
-
Parse
python -m mineru input.pdf --output pages/
-
Build initial MMKG
python build_mmkg.py --dir pages/ --model gpt-4o-mini --parallel 16
-
Refine (single pass)
python refine.py --kg init.json --encoder GME-Qwen2-VL-2B --top_n 120
-
Index + Ask
python index_gme.py --kg refined.json
python ask.py --query "How did EV sales change?" --top_k 60 --top_m 6
Hardware: single RTX-3090 24 GB, ≈ 0.97 s per page for encoding, ≈ 1.8 s per query end-to-end.
§
7. Action Checklist for Your Own Deployment
-
[ ] Install MinerU and GME-Qwen2-VL-2B (both Apache-2.0 or MIT) -
[ ] Budget ≈ 2 000 API calls per 1 000 pages for two-pass extraction -
[ ] Keep temperature=0; merge entities by exact string match first, fuzzy second -
[ ] Retrieve 120 nodes max, expand 1-hop, truncate at 32 k tokens -
[ ] Always run two-stage generation; it’s a free 15 % boost -
[ ] Filter out figures with no numeric or textual label → 50 % node cut, +7 % accuracy -
[ ] Use same prompt template across baselines to keep comparisons fair
§
8. One-Page Overview
Problem
Text-only RAG ignores charts, tables, and layout, failing on slide decks, textbooks, and reports.
Insight
Treat every page as a tiny multimodal knowledge graph; let a second MLLM pass add missing cross-modal edges using a sub-graph retrieved from the first draft.
Solution (MegaRAG)
-
Parallel extraction → 2. Sub-graph refinement → 3. Unified indexing → 4. Two-stage generation.
Result
89 % overall win-rate against GraphRAG/LightRAG on global QA; up to +45 pp accuracy on local QA. Runs on one 24 GB GPU.
Next Step
Clone the repo, run the four commands, and your PDF library becomes a chatbot that sees.
§
9. FAQ (Extracted From the Paper, Not External Knowledge)
Q1: Can I skip the refinement pass to save money?
A: You can, but expect ~38 % drop on visual datasets. The second pass costs pennies compared to re-training.
Q2: Is the system limited to English?
A: No. The GenAI benchmark is Chinese; MegaRAG still wins 98 % of head-to-heads.
Q3: Do I have to use GPT-4o-mini?
A: Any multimodal LLM that follows instructions works. Keep temperature=0 for consistent JSON.
Q4: How large a document can one GPU handle?
A: About 3 000 pages per hour for encoding, memory-wise up to ~30 k pages on a single 24 GB card if you batch two images at a time.
Q5: What if my PDF is pure text?
A: MegaRAG still beats GraphRAG on UltraDomain, so visuals are a bonus, not a prerequisite.
Q6: Are the graphs interoperable?
A: The output is vanilla JSON nodes/edges. You can import into Neo4j, NetworkX, or any other tool.
Q7: Is the code open-source?
A: Yes, MIT license. Link in the repo footnote of the original paper.
Q8: Does the method need fine-tuning?
A: No model parameters are changed; it’s prompt-based and embedding-based only.

