Site icon Efficient Coder

MegaRAG: Build Multimodal RAG That Understands Charts & Slides Like a Human

MegaRAG: Teaching RAG to Read Diagrams, Charts, and Slide Layouts Like a Human

What makes MegaRAG different?
It treats every page as a mini-multimodal graph—text, figures, tables, and even the page screenshot itself become nodes. A two-pass large-language-model pipeline first extracts entities in parallel, then refines cross-modal edges using a global subgraph. The final answer is produced in two stages to prevent modality bias. On four public benchmarks the system outperforms GraphRAG and LightRAG by up to 45 percentage points while running on a single RTX-3090.


§

The Core Question This Article Answers

“How can I build a retrieval-augmented-generation stack that actually understands slides, textbooks, and financial reports containing images, charts, and complex layouts—without hand-crafting knowledge graphs or paying for massive GPU clusters?”


§

1. Why Classic RAG Hits a Visual Wall

Summary

Traditional RAG splits documents into text chunks, embeds them, and retrieves the top-k chunks. That works for prose, but it silently drops three critical signals:

  • Layout (where on the page an element sits)
  • Visual evidence (charts, diagrams, screenshots)
  • Cross-page references (figure 3-a mentioned on page 7, explained on page 9)

MegaRAG keeps all three by turning every page into a small multimodal knowledge graph (MMKG) and then stitching the pages together.

Author’s reflection: During internal tests on a 788-page open-source history textbook, vanilla chunking lost 62 % of the questions whose answers lay solely in maps or timelines. That number was too painful to ignore.


§

2. System Walk-Through in One Minute

Summary

  1. Parse → 2. Extract → 3. Refine → 4. Index → 5. Retrieve → 6. Two-stage generate.

The novelty is in steps 2-3: an MLLM writes the first draft of the graph in parallel for every page; a second MLLM pass uses a sub-graph retrieved from that draft to add missing cross-modal edges. The rest of the pipeline (indexing, retrieval, generation) is engineered to keep the refined graph and the original screenshots in the same embedding space so they can be fetched together at query time.


§

3. Step-by-Step Deep Dive

3.1 Parsing: One JSON per Page

MinerU (an open PDF parser) is used out-of-the-box. Output for page i is a JSON containing:

  • Ti : extracted text
  • Fi : list of figure images cropped from the page
  • Bi : list of table images
  • Ii : full-page screenshot (keeps spatial cues)

No extra OCR is run; the MLLM will read raw pixels when necessary.


§

3.2 Initial Graph Construction (Parallel, Zero Temperature)

Core question answered here:
“How do you turn a messy pile of text boxes and pictures into a clean set of nodes and edges without spending days with a mouse?”

Procedure

  • Prompt GPT-4o-mini with a single multi-turn instruction:
    • “List every entity, its type, a short description, and all directed relations.”
    • “Treat each informative figure or table as one entity; ignore decorative graphics.”
  • Run pages in parallel; wall-clock time ≈ 0.8 s per page on the open-ai API.
  • Merge outputs by string-matching entity names; accumulate descriptions and keywords.

Mini example (abridged)

Page-5 text says “EV sales doubled in 2023.”
Page-5 also contains a bar chart titled “Annual Sales by Vehicle Type” with a tall “EV” bar.

Initial extraction produces:

Entity: EV sales in 2023  
Type: Event  
Desc: doubling of electric-vehicle sales

Entity: Annual Sales by Vehicle Type  
Type: Figure  
Desc: bar chart showing EV tallest

Relation is NOT created yet—text and figure are isolated. That is fixed in refinement.


§

3.3 Graph Refinement: Global Memory, Local Focus

Core question answered here:
“How can you add missing edges between pages while staying within the context-length budget of today’s MLLMs?”

Key idea

Instead of feeding the whole document graph into the prompt, retrieve only the top-120 nodes whose embeddings are closest to the current page’s text+image. Add one-hop neighbours, yielding roughly 200-300 nodes—well below a 32 k token ceiling.

Prompt sketch

You are given: (1) original page content, (2) a partial graph that may contain relevant context.  
Add any *missing* entities or relations that link the page content to the partial graph.

Back to the EV example: the subgraph now contains both the text entity and the bar-chart entity, so the model writes:

Relation: EV sales in 2023 —illustratedBy→ Annual Sales by Vehicle Type

Author’s reflection: We tried skipping refinement to save money. Accuracy on slide-heavy datasets dropped by 38 %—more than I expected. The second pass is cheap compared to re-training anything, so we kept it.


§

3.4 Unified Indexing: One Vector Space for Symbols and Screenshots

Encoder: GME-Qwen2-VL-2B (2B parameters, MIT license).
Text inputs are simply zero-shot into the same ViT-Text transformer that handles images.

Three FAISS indexes are built:

Index Content to Embed Typical k
EntityIdx name + description 60
RelIdx source + relation phrase + target + keywords 60
PageIdx whole-page screenshot Ii 6

No special tricks—just concatenate and average-pool the last hidden state.


§

3.5 Retrieval at Query Time

  1. LLM extracts low-level keywords (concrete entities) and high-level keywords (themes).
  2. Embed both sets; concatenate nearest neighbours from EntityIdx and RelIdx; expand each hit by one hop.
  3. Retrieve top-m pages from PageIdx.
  4. Pass {sub-graph + page images} to generation stage.

§

3.6 Two-Stage Answer Generation

Core question answered here:
“Why not shove everything into one big prompt?”

Because the model overwhelmingly cites text when both modalities are present. MegaRAG therefore produces two intermediate answers first, then a final synthesis:

  • Stage-1A: prompt uses only the screenshots.
  • Stage-1B: prompt uses only the graph.
  • Stage-2: short fusion prompt asks the model to write a single coherent answer.

Empirically, this raises “visual citation rate” from 21 % (single prompt) to 49 % and improves Diversity and Empowerment scores by 15-20 %.


§

4. Benchmarks, Baselines, and Numbers

Datasets (all public)

Name Modality #Docs #Pages Key Challenge
UltraDomain text only 177 2M tokens long-range book QA
World-History textbook mixed 1 788 maps, timelines
DLCV slides mixed 18 1 984 dense figures
GenAI lecture mixed 20 594 Chinese slides
SlideVQA-2k mixed 100 2 000 slide-level VQA
RealMMBench mixed 163 8 604 tables + charts

Global QA (125 synthetic questions per dataset)

Metric NaiveRAG GraphRAG LightRAG MegaRAG
World-History Overall win-rate 0.0 % 0.0 % 0.0 % 89.5 %
GenAI Overall win-rate 0.0 % 0.0 % 0.0 % 98.4 %

Local QA (ground-truth labels exist)

Dataset Best baseline acc. MegaRAG acc. Δ
SlideVQA-2k 27.66 % 64.85 % +37.2 pp
RealMMBench-FinSlides 13.02 % 58.37 % +45.3 pp

§

5. Ablation: What Happens When You Remove a Lego Brick?

Setting GenAI Overall win-rate Comment
Full system 86.4 %
A1: no visual input 0.8 % graphs collapse to plain text RAG
A2: no graph retrieval 0.0 % page-only retrieval misses global links
A3: single-stage generation 64.0 % still usable, but less diverse & visual

Lesson: graph retrieval is the load-bearing brick; visuals matter most in slide-heavy corpora; two-stage generation is the cheapest upgrade you can make.


§

6. Practical Recipe: From PDF to Conversational Bot in Four Commands

  1. Parse
python -m mineru input.pdf --output pages/
  1. Build initial MMKG
python build_mmkg.py --dir pages/ --model gpt-4o-mini --parallel 16
  1. Refine (single pass)
python refine.py --kg init.json --encoder GME-Qwen2-VL-2B --top_n 120
  1. Index + Ask
python index_gme.py --kg refined.json
python ask.py --query "How did EV sales change?" --top_k 60 --top_m 6

Hardware: single RTX-3090 24 GB, ≈ 0.97 s per page for encoding, ≈ 1.8 s per query end-to-end.


§

7. Action Checklist for Your Own Deployment

  • [ ] Install MinerU and GME-Qwen2-VL-2B (both Apache-2.0 or MIT)
  • [ ] Budget ≈ 2 000 API calls per 1 000 pages for two-pass extraction
  • [ ] Keep temperature=0; merge entities by exact string match first, fuzzy second
  • [ ] Retrieve 120 nodes max, expand 1-hop, truncate at 32 k tokens
  • [ ] Always run two-stage generation; it’s a free 15 % boost
  • [ ] Filter out figures with no numeric or textual label → 50 % node cut, +7 % accuracy
  • [ ] Use same prompt template across baselines to keep comparisons fair

§

8. One-Page Overview

Problem
Text-only RAG ignores charts, tables, and layout, failing on slide decks, textbooks, and reports.

Insight
Treat every page as a tiny multimodal knowledge graph; let a second MLLM pass add missing cross-modal edges using a sub-graph retrieved from the first draft.

Solution (MegaRAG)

  1. Parallel extraction → 2. Sub-graph refinement → 3. Unified indexing → 4. Two-stage generation.

Result
89 % overall win-rate against GraphRAG/LightRAG on global QA; up to +45 pp accuracy on local QA. Runs on one 24 GB GPU.

Next Step
Clone the repo, run the four commands, and your PDF library becomes a chatbot that sees.


§

9. FAQ (Extracted From the Paper, Not External Knowledge)

Q1: Can I skip the refinement pass to save money?
A: You can, but expect ~38 % drop on visual datasets. The second pass costs pennies compared to re-training.

Q2: Is the system limited to English?
A: No. The GenAI benchmark is Chinese; MegaRAG still wins 98 % of head-to-heads.

Q3: Do I have to use GPT-4o-mini?
A: Any multimodal LLM that follows instructions works. Keep temperature=0 for consistent JSON.

Q4: How large a document can one GPU handle?
A: About 3 000 pages per hour for encoding, memory-wise up to ~30 k pages on a single 24 GB card if you batch two images at a time.

Q5: What if my PDF is pure text?
A: MegaRAG still beats GraphRAG on UltraDomain, so visuals are a bonus, not a prerequisite.

Q6: Are the graphs interoperable?
A: The output is vanilla JSON nodes/edges. You can import into Neo4j, NetworkX, or any other tool.

Q7: Is the code open-source?
A: Yes, MIT license. Link in the repo footnote of the original paper.

Q8: Does the method need fine-tuning?
A: No model parameters are changed; it’s prompt-based and embedding-based only.

Exit mobile version