Teaching Machines to Pause and Zoom: How Video-R4 Solves Text-Rich Video QA

4 hours ago 高效码农

Video-R4: Teaching Machines to Pause, Zoom and Re-read Text-Rich Videos “Why do most video-QA models hallucinate small, fleeting text? Because they never get a second look. Video-R4 fixes this by adding an explicit ‘visual rumination’ loop—select, zoom, re-encode, repeat—boosting M4-ViteVQA accuracy from 26 % to 64 % without extra data or a larger backbone.” What problem is this article solving? How to reliably answer questions that depend on tiny, transient text in the wild—news tickers, lecture slides, UI walk-throughs—when single-pass models routinely overlook or mis-read it. The single-pass ceiling: five pain-points in one shot Fixed frame budget → text appears …