Site icon Efficient Coder

Teaching Machines to Pause and Zoom: How Video-R4 Solves Text-Rich Video QA

Video-R4: Teaching Machines to Pause, Zoom and Re-read Text-Rich Videos

“Why do most video-QA models hallucinate small, fleeting text? Because they never get a second look. Video-R4 fixes this by adding an explicit ‘visual rumination’ loop—select, zoom, re-encode, repeat—boosting M4-ViteVQA accuracy from 26 % to 64 % without extra data or a larger backbone.”


What problem is this article solving?

How to reliably answer questions that depend on tiny, transient text in the wild—news tickers, lecture slides, UI walk-throughs—when single-pass models routinely overlook or mis-read it.


The single-pass ceiling: five pain-points in one shot

  1. Fixed frame budget → text appears between sampled frames
  2. Low resolution → OCR noise
  3. No back-tracking → wrong belief can’t be revised
  4. Text-only CoT → pixel-ungrounded hallucinations
  5. Static coordinates → boxes are predicted but never re-examined

Real-world symptom: on M4-ViteVQA a 7B LMM scores 24 EM; humans reach 85.


Visual rumination: the human “pause-zoom-check” cycle as code

Video-R4 exposes two atomic actions and a state:

Action Input Output
Clip frame list sub-sequence + caption
Crop single frame + bbox zoomed crop + region caption
State hidden updated reasoning vector

Pipeline: read → retrieve → refocus → reinforce (closed loop).

Scene example:
Question: “What is the yield of the reaction?”

  • Iteration 1: clip 0:30-0:40 → sees blurry bottom text
  • Iteration 2: crop bottom quarter → OCR “Yield 0.83”
  • Iteration 3: confidence 0.92 → stop

Curated data: 47 k executable “rumination trails”

Built from M4-ViteVQA training split:

  1. Rule-based evidence match (fuzzy string + bbox IOU) → candidate frames/boxes
  2. Template synthesis (strong LMM fills “clip/crop → caption → reasoning” chain)
  3. Human QC (verify every action points to real pixels, edit otherwise)

Outcome:

  • Video-R4-CoT-17k (supervised)
  • Video-R4-RL-30k (answer-reward only)

Author’s reflection: “We originally tried end-to-end RL without trails—agents kept cropping the same logo. Providing grounded coordinates as ‘pixel anchors’ cut convergence time in half.”


Multi-stage curriculum: atomic before composite

Stage Data Objective Key trick
DRP-SFT 7k single-action master clip OR crop one tool visible per sample
RL-d 15k RL explore under answer reward curiosity bonus prevents tool under-use
CRP-SFT 10k mixed learn scheduling both tools in one trajectory
RL-c 15k RL sharpen stop/continue diversity + representativeness rewards

Ablation: skip DRP → ‑4.3 EM; collapse four stages → ‑2.8 EM and slower.


Reward design: diversity, representativeness, curiosity

R = R_correct + λ_div·R_div + λ_rep·R_rep + λ_cur·R_cur

  • R_div = avg cosine distance between crop features → avoids redundant boxes
  • R_rep = exp(–avg distance from any frame to its nearest selected frame) → covers video
  • R_cur = bonus if global tool usage < H, penalty if > N → balances exploration vs. cost

Empirical λ: 1/1/1 in final RL; curiosity α=0.5, β=0.05, H=0.3.


Results: new state-of-the-art on text-rich video QA

M4-ViteVQA test (average across splits)

Model EM ↑ ANLS ↑
Qwen2.5-VL-7B 24.3 39.6
Video-R1-7B 37.1 48.3
Pixel-Reasoner-7B 58.9 65.3
Video-R4-7B 64.2 70.0

Longer rumination helps: allow T=5 steps → +6.4 EM over T=1.


Zero-shot transfer: documents & slides love “crop”

Task Benchmark Video-R4 zero-shot EM Previous best (trained)
Multi-page Doc QA MP-DocVQA 53.2 Hi-VT5 48.3
Slides QA SlidesVQA 43.0 EM / 52.2 F1 M3D 33.5 EM
General video QA Video-MMMU 52.2 Video-R1 49.8

Author’s reflection: “We were surprised that a crop-heavy policy learned on videos beats layout-specific encoders on documents. Apparently, ‘find the right page, zoom in, read’ generalises across media.”


Installation & quick-start

# 1. Environment
conda create -n video-r4 python=3.10
conda activate video-r4
git clone https://github.com/yunlong10/Video-R4.git
cd Video-R4 && pip install -r requirements.txt

# 2. Inference
python infer.py \
  --video assets/demo.mp4 \
  --question "What is the yield mentioned?" \
  --checkpoint checkpoints/video-r4-7b.pt \
  --max_rumination 5

# 3. Continue fine-tuning on your data
bash scripts/finetune.sh --data_path my_cot.jsonl --stage drp-sft

Hardware: Full fine-tune fits on one H100 80 GB; inference needs 24 GB with bf16.


Limitations (& next steps)

  • Depends on pre-extracted OCR → recognition errors propagate
  • Only clip & crop supported; long fast-changing videos may need tracking/retiming
  • Experiments limited to 7 B; larger backbones & diverse domains unexplored
  • Hand-designed rewards only approximate human faithfulness

Action Checklist / Implementation Steps

  1. Install repo & weights (one-liner conda + git clone)
  2. Run inference with --max_rumination 3-5 to see qualitative gains
  3. Collect your own text-rich videos + Q-A pairs; obtain OCR/labels (any off-the-shelf model)
  4. Build evidence-matched trails using provided matching script
  5. Train DRP-SFT → RL-d → CRP-SFT → RL-c (hyper-parameters in paper)
  6. Evaluate on M4-ViteVQA, then zero-shot test on MP-DocVQA/SlidesVQA for sanity check

One-page Overview

Video-R4 gives a 7 B video LMM the ability to pause, zoom and re-read by exposing two differentiable actions—clip and crop—and training them with a four-stage curriculum. On M4-ViteVQA, iterative visual rumination lifts EM from 26 % to 64 %, beats prior RL baselines by 5+ points, and transfers zero-shot to multi-page documents and slide decks. Training data (17 k supervised + 30 k RL trails) and code are open-sourced; full fine-tune needs one H100 and ~22 h.


FAQ

Q1: Does inference slow down linearly with rumination steps?
A: Each extra step adds ~320 ms on A100; you can set --max_rumination to trade speed vs. accuracy.

Q2: Can I add new actions like face detection or audio clip?
A: The tool interface is modular; just implement the function and add a corresponding reward term.

Q3: How sensitive is the result to OCR quality?
A: Training labels rely on OCR, but the model sees raw pixels at test time; we observed only ±0.8 EM drop when switching OCR engines.

Q4: Is the codebase tied to Qwen2.5-VL?
A: No—any LMM that accepts frame/region pixels and produces text can be plugged in; we kept interfaces generic.

Q5: Will larger models eliminate the need for rumination?
A: 13 B pilot runs still gain +4.2 EM from rumination, suggesting the bottleneck is “looking” rather than “parametric knowledge.”

Q6: What about very long videos (>1 h)?
A: Currently capped at 256 frames for memory reasons; hierarchical clip proposals are on the roadmap.

Exit mobile version