Teaching Machines to Pause and Zoom: How Video-R4 Solves Text-Rich Video QA

高效码农

3 months ago

Video-R4: Teaching Machines to Pause, Zoom and Re-read Text-Rich Videos

“Why do most video-QA models hallucinate small, fleeting text? Because they never get a second look. Video-R4 fixes this by adding an explicit ‘visual rumination’ loop—select, zoom, re-encode, repeat—boosting M4-ViteVQA accuracy from 26 % to 64 % without extra data or a larger backbone.”

What problem is this article solving?

How to reliably answer questions that depend on tiny, transient text in the wild—news tickers, lecture slides, UI walk-throughs—when single-pass models routinely overlook or mis-read it.

The single-pass ceiling: five pain-points in one shot

Fixed frame budget → text appears between sampled frames
Low resolution → OCR noise
No back-tracking → wrong belief can’t be revised
Text-only CoT → pixel-ungrounded hallucinations
Static coordinates → boxes are predicted but never re-examined

Real-world symptom: on M4-ViteVQA a 7B LMM scores 24 EM; humans reach 85.

Visual rumination: the human “pause-zoom-check” cycle as code

Video-R4 exposes two atomic actions and a state:

Action	Input	Output
Clip	frame list	sub-sequence + caption
Crop	single frame + bbox	zoomed crop + region caption
State	hidden	updated reasoning vector

Pipeline: read → retrieve → refocus → reinforce (closed loop).

Scene example:
Question: “What is the yield of the reaction?”

Iteration 1: clip 0:30-0:40 → sees blurry bottom text
Iteration 2: crop bottom quarter → OCR “Yield 0.83”
Iteration 3: confidence 0.92 → stop

Curated data: 47 k executable “rumination trails”

Built from M4-ViteVQA training split:

Rule-based evidence match (fuzzy string + bbox IOU) → candidate frames/boxes
Template synthesis (strong LMM fills “clip/crop → caption → reasoning” chain)
Human QC (verify every action points to real pixels, edit otherwise)

Outcome:

Video-R4-CoT-17k (supervised)
Video-R4-RL-30k (answer-reward only)

Author’s reflection: “We originally tried end-to-end RL without trails—agents kept cropping the same logo. Providing grounded coordinates as ‘pixel anchors’ cut convergence time in half.”

Multi-stage curriculum: atomic before composite

Stage	Data	Objective	Key trick
DRP-SFT	7k single-action	master clip OR crop	one tool visible per sample
RL-d	15k RL	explore under answer reward	curiosity bonus prevents tool under-use
CRP-SFT	10k mixed	learn scheduling	both tools in one trajectory
RL-c	15k RL	sharpen stop/continue	diversity + representativeness rewards

Ablation: skip DRP → ‑4.3 EM; collapse four stages → ‑2.8 EM and slower.

Reward design: diversity, representativeness, curiosity

R = R_correct + λ_div·R_div + λ_rep·R_rep + λ_cur·R_cur

R_div = avg cosine distance between crop features → avoids redundant boxes
R_rep = exp(–avg distance from any frame to its nearest selected frame) → covers video
R_cur = bonus if global tool usage < H, penalty if > N → balances exploration vs. cost

Empirical λ: 1/1/1 in final RL; curiosity α=0.5, β=0.05, H=0.3.

Results: new state-of-the-art on text-rich video QA

M4-ViteVQA test (average across splits)

Model	EM ↑	ANLS ↑
Qwen2.5-VL-7B	24.3	39.6
Video-R1-7B	37.1	48.3
Pixel-Reasoner-7B	58.9	65.3
Video-R4-7B	64.2	70.0

Longer rumination helps: allow T=5 steps → +6.4 EM over T=1.

Zero-shot transfer: documents & slides love “crop”

Task	Benchmark	Video-R4 zero-shot EM	Previous best (trained)
Multi-page Doc QA	MP-DocVQA	53.2	Hi-VT5 48.3
Slides QA	SlidesVQA	43.0 EM / 52.2 F1	M3D 33.5 EM
General video QA	Video-MMMU	52.2	Video-R1 49.8

Author’s reflection: “We were surprised that a crop-heavy policy learned on videos beats layout-specific encoders on documents. Apparently, ‘find the right page, zoom in, read’ generalises across media.”

Installation & quick-start

# 1. Environment
conda create -n video-r4 python=3.10
conda activate video-r4
git clone https://github.com/yunlong10/Video-R4.git
cd Video-R4 && pip install -r requirements.txt

# 2. Inference
python infer.py \
  --video assets/demo.mp4 \
  --question "What is the yield mentioned?" \
  --checkpoint checkpoints/video-r4-7b.pt \
  --max_rumination 5

# 3. Continue fine-tuning on your data
bash scripts/finetune.sh --data_path my_cot.jsonl --stage drp-sft

Hardware: Full fine-tune fits on one H100 80 GB; inference needs 24 GB with bf16.

Limitations (& next steps)

Depends on pre-extracted OCR → recognition errors propagate
Only clip & crop supported; long fast-changing videos may need tracking/retiming
Experiments limited to 7 B; larger backbones & diverse domains unexplored
Hand-designed rewards only approximate human faithfulness

Action Checklist / Implementation Steps

Install repo & weights (one-liner conda + git clone)
Run inference with --max_rumination 3-5 to see qualitative gains
Collect your own text-rich videos + Q-A pairs; obtain OCR/labels (any off-the-shelf model)
Build evidence-matched trails using provided matching script
Train DRP-SFT → RL-d → CRP-SFT → RL-c (hyper-parameters in paper)
Evaluate on M4-ViteVQA, then zero-shot test on MP-DocVQA/SlidesVQA for sanity check

One-page Overview

Video-R4 gives a 7 B video LMM the ability to pause, zoom and re-read by exposing two differentiable actions—clip and crop—and training them with a four-stage curriculum. On M4-ViteVQA, iterative visual rumination lifts EM from 26 % to 64 %, beats prior RL baselines by 5+ points, and transfers zero-shot to multi-page documents and slide decks. Training data (17 k supervised + 30 k RL trails) and code are open-sourced; full fine-tune needs one H100 and ~22 h.

FAQ

Q1: Does inference slow down linearly with rumination steps?
A: Each extra step adds ~320 ms on A100; you can set --max_rumination to trade speed vs. accuracy.

Q2: Can I add new actions like face detection or audio clip?
A: The tool interface is modular; just implement the function and add a corresponding reward term.

Q3: How sensitive is the result to OCR quality?
A: Training labels rely on OCR, but the model sees raw pixels at test time; we observed only ±0.8 EM drop when switching OCR engines.

Q4: Is the codebase tied to Qwen2.5-VL?
A: No—any LMM that accepts frame/region pixels and produces text can be plugged in; we kept interfaces generic.

Q5: Will larger models eliminate the need for rumination?
A: 13 B pilot runs still gain +4.2 EM from rumination, suggesting the bottleneck is “looking” rather than “parametric knowledge.”

Q6: What about very long videos (>1 h)?
A: Currently capped at 256 frames for memory reasons; hierarchical clip proposals are on the roadmap.