Video-R4: Teaching Machines to Pause, Zoom and Re-read Text-Rich Videos
“Why do most video-QA models hallucinate small, fleeting text? Because they never get a second look. Video-R4 fixes this by adding an explicit ‘visual rumination’ loop—select, zoom, re-encode, repeat—boosting M4-ViteVQA accuracy from 26 % to 64 % without extra data or a larger backbone.”
What problem is this article solving?
How to reliably answer questions that depend on tiny, transient text in the wild—news tickers, lecture slides, UI walk-throughs—when single-pass models routinely overlook or mis-read it.
The single-pass ceiling: five pain-points in one shot
-
Fixed frame budget → text appears between sampled frames -
Low resolution → OCR noise -
No back-tracking → wrong belief can’t be revised -
Text-only CoT → pixel-ungrounded hallucinations -
Static coordinates → boxes are predicted but never re-examined
Real-world symptom: on M4-ViteVQA a 7B LMM scores 24 EM; humans reach 85.
Visual rumination: the human “pause-zoom-check” cycle as code
Video-R4 exposes two atomic actions and a state:
| Action | Input | Output |
|---|---|---|
| Clip | frame list | sub-sequence + caption |
| Crop | single frame + bbox | zoomed crop + region caption |
| State | hidden | updated reasoning vector |
Pipeline: read → retrieve → refocus → reinforce (closed loop).
Scene example:
Question: “What is the yield of the reaction?”
-
Iteration 1: clip 0:30-0:40 → sees blurry bottom text -
Iteration 2: crop bottom quarter → OCR “Yield 0.83” -
Iteration 3: confidence 0.92 → stop
Curated data: 47 k executable “rumination trails”
Built from M4-ViteVQA training split:
-
Rule-based evidence match (fuzzy string + bbox IOU) → candidate frames/boxes -
Template synthesis (strong LMM fills “clip/crop → caption → reasoning” chain) -
Human QC (verify every action points to real pixels, edit otherwise)
Outcome:
-
Video-R4-CoT-17k (supervised) -
Video-R4-RL-30k (answer-reward only)
Author’s reflection: “We originally tried end-to-end RL without trails—agents kept cropping the same logo. Providing grounded coordinates as ‘pixel anchors’ cut convergence time in half.”
Multi-stage curriculum: atomic before composite
| Stage | Data | Objective | Key trick |
|---|---|---|---|
| DRP-SFT | 7k single-action | master clip OR crop | one tool visible per sample |
| RL-d | 15k RL | explore under answer reward | curiosity bonus prevents tool under-use |
| CRP-SFT | 10k mixed | learn scheduling | both tools in one trajectory |
| RL-c | 15k RL | sharpen stop/continue | diversity + representativeness rewards |
Ablation: skip DRP → ‑4.3 EM; collapse four stages → ‑2.8 EM and slower.
Reward design: diversity, representativeness, curiosity
R = R_correct + λ_div·R_div + λ_rep·R_rep + λ_cur·R_cur
-
R_div = avg cosine distance between crop features → avoids redundant boxes -
R_rep = exp(–avg distance from any frame to its nearest selected frame) → covers video -
R_cur = bonus if global tool usage < H, penalty if > N → balances exploration vs. cost
Empirical λ: 1/1/1 in final RL; curiosity α=0.5, β=0.05, H=0.3.
Results: new state-of-the-art on text-rich video QA
M4-ViteVQA test (average across splits)
| Model | EM ↑ | ANLS ↑ |
|---|---|---|
| Qwen2.5-VL-7B | 24.3 | 39.6 |
| Video-R1-7B | 37.1 | 48.3 |
| Pixel-Reasoner-7B | 58.9 | 65.3 |
| Video-R4-7B | 64.2 | 70.0 |
Longer rumination helps: allow T=5 steps → +6.4 EM over T=1.
Zero-shot transfer: documents & slides love “crop”
| Task | Benchmark | Video-R4 zero-shot EM | Previous best (trained) |
|---|---|---|---|
| Multi-page Doc QA | MP-DocVQA | 53.2 | Hi-VT5 48.3 |
| Slides QA | SlidesVQA | 43.0 EM / 52.2 F1 | M3D 33.5 EM |
| General video QA | Video-MMMU | 52.2 | Video-R1 49.8 |
Author’s reflection: “We were surprised that a crop-heavy policy learned on videos beats layout-specific encoders on documents. Apparently, ‘find the right page, zoom in, read’ generalises across media.”
Installation & quick-start
# 1. Environment
conda create -n video-r4 python=3.10
conda activate video-r4
git clone https://github.com/yunlong10/Video-R4.git
cd Video-R4 && pip install -r requirements.txt
# 2. Inference
python infer.py \
--video assets/demo.mp4 \
--question "What is the yield mentioned?" \
--checkpoint checkpoints/video-r4-7b.pt \
--max_rumination 5
# 3. Continue fine-tuning on your data
bash scripts/finetune.sh --data_path my_cot.jsonl --stage drp-sft
Hardware: Full fine-tune fits on one H100 80 GB; inference needs 24 GB with bf16.
Limitations (& next steps)
-
Depends on pre-extracted OCR → recognition errors propagate -
Only clip & crop supported; long fast-changing videos may need tracking/retiming -
Experiments limited to 7 B; larger backbones & diverse domains unexplored -
Hand-designed rewards only approximate human faithfulness
Action Checklist / Implementation Steps
-
Install repo & weights (one-liner conda + git clone) -
Run inference with --max_rumination 3-5to see qualitative gains -
Collect your own text-rich videos + Q-A pairs; obtain OCR/labels (any off-the-shelf model) -
Build evidence-matched trails using provided matching script -
Train DRP-SFT → RL-d → CRP-SFT → RL-c (hyper-parameters in paper) -
Evaluate on M4-ViteVQA, then zero-shot test on MP-DocVQA/SlidesVQA for sanity check
One-page Overview
Video-R4 gives a 7 B video LMM the ability to pause, zoom and re-read by exposing two differentiable actions—clip and crop—and training them with a four-stage curriculum. On M4-ViteVQA, iterative visual rumination lifts EM from 26 % to 64 %, beats prior RL baselines by 5+ points, and transfers zero-shot to multi-page documents and slide decks. Training data (17 k supervised + 30 k RL trails) and code are open-sourced; full fine-tune needs one H100 and ~22 h.
FAQ
Q1: Does inference slow down linearly with rumination steps?
A: Each extra step adds ~320 ms on A100; you can set --max_rumination to trade speed vs. accuracy.
Q2: Can I add new actions like face detection or audio clip?
A: The tool interface is modular; just implement the function and add a corresponding reward term.
Q3: How sensitive is the result to OCR quality?
A: Training labels rely on OCR, but the model sees raw pixels at test time; we observed only ±0.8 EM drop when switching OCR engines.
Q4: Is the codebase tied to Qwen2.5-VL?
A: No—any LMM that accepts frame/region pixels and produces text can be plugged in; we kept interfaces generic.
Q5: Will larger models eliminate the need for rumination?
A: 13 B pilot runs still gain +4.2 EM from rumination, suggesting the bottleneck is “looking” rather than “parametric knowledge.”
Q6: What about very long videos (>1 h)?
A: Currently capped at 256 frames for memory reasons; hierarchical clip proposals are on the roadmap.
