What MMGR Really Tests: A Plain-English Walk-Through of the Multi-Modal Generative Reasoning Benchmark
> If you just want the takeaway, scroll to the “Sixty-Second Summary” at the end.
> If you want to know why your shiny text-to-video model still walks through walls or fills Sudoku grids with nine 9s in the same row, read on.
1. Why another benchmark?
Existing video scores such as FVD (Fréchet Video Distance) or IS (Inception Score) only ask one question:
“Does the clip look realistic to a frozen image classifier?”
They ignore three bigger questions:
-
Is the motion physically possible? -
Does the scene follow logical rules? -
Are objects still in the same place five frames later?
MMGR (Multi-Modal Generative Reasoning) was built to answer those harder questions. It pits both image and video generators against tasks that check five reasoning muscles:
| Skill | Everyday example |
|---|---|
| Physical reasoning | Apple falls down, not up |
| Logical reasoning | If A > B and B > C, then A > C |
| 3-D spatial reasoning | You must climb stairs to reach the second floor |
| 2-D spatial reasoning | On a map, red dot = goal, blue line = path |
| Temporal reasoning | Light the fuse before the explosion |
2. What’s inside the benchmark?
Three domains, ten tasks, 1,853 test samples.
DOMAIN 1 — Abstract Reasoning (no physics, just rules)
-
Maze – 240 grids, sizes 3×3 → 13×13 -
Sudoku – 300 grids, 4×4 & 9×9, three difficulty bands -
ARC-AGI – 456 visual “IQ” puzzles that require rule induction -
Visual Math – 327 word problems from grade-school (GSM8K) to Olympiad (Omni-MATH)
DOMAIN 2 — Embodied Navigation (move through space)
-
Panoramic Last-Mile Nav – 360° photo, short walk to visible target -
Top-Down Real-World Nav – bird-eye floor plan, long path -
3-D Real-World Nav – doll-house view, multi-floor, stairs -
SLAG – Simultaneous Localization & Generation: walk in 3-D while drawing your own 2-D map in real time
DOMAIN 3 — Physical Commonsense (how the world works)
-
Physical Concepts – 25 clips: ball collisions, water splash, ink mixing -
Sports Scenarios – 25 clips: ballet fouetté, ski jump, diving, swimming
3. How harsh is the marking?
Every task uses binary (0/1) fine-grained metrics. A sample scores “1” only when ALL sub-checks pass. There is no partial credit.
Example for a maze video:
-
Does the green dot start on the green square? -
Does it reach the red square? -
Does it ever touch a black wall? -
Does any wall disappear? -
Is the path continuous?
If any answer is wrong → Overall = 0.
That is why headline numbers look brutal.
4. Models that sat the exam
| Video generators | Image generators |
|---|---|
| Veo-3 (Google) | Nano-banana |
| Sora-2 (OpenAI) | Nano-banana Pro |
| Wan-2.2 (Wan) | GPT-4o-image |
| — | Qwen-image |
All tests were done zero-shot with default API settings.
The judge was Gemini-2.5-Pro, calibrated against six human annotators.
5. Domain 1: Abstract Reasoning – “Why the correct final answer still earns a zero”
5.1 Sudoku test
Video models have high Action Reflection (they happily edit digits frame-by-frame) yet finish with near-zero Overall success.
| Model | Overall (4×4 easy) | Human re-check |
|---|---|---|
| Veo-3 | 11.4 % | 0 % |
| Sora-2 | 0 % | 0 % |
| Nano-banana Pro (image) | 66 % | — |
Root cause timeline
Frame 2 – writes “3”
Frame 5 – overwrites same cell with “8”
Frame 8 – forgets row already has an “8”
Frame 12 – grid violates rule → 0
Humans call the video “a student who keeps rubbing out correct answers.”
5.2 ARC-AGI test
Puzzles demand one-shot rule induction from four example pairs.
Leading image model Nano-banana Pro averages 30.5 % across v1 & v2.
Best video model Sora-2 hits 20 % on v1 but collapses to 1.3 % on v2, revealing heavy pattern-memorisation rather than reasoning.
Human audit: Zero samples from Veo-3 passed, although automatic scoring had awarded 4.7 %.
Take-away: If your use-case needs transferable rule learning, video models are not ready.
6. Domain 2: Embodied Navigation – “I arrived, but through a wall”
6.1 Metric stack (same for every nav task)
| Task-complete checks | Physics checks | Instruction checks |
|---|---|---|
| Success Score (2-D or 3-D) | Object Semantic (no collision) | Destination Integrity (goal unchanged) |
| Oracle Score (passed through?) | Agent Consistency (no teleport) | Scene Consistency (static world) |
| Trajectory Alignment (2-D vs 3-D) | Spatial Alignment (facing direction) | — |
Overall = 1 only if all seven ticks.
6.2 Panoramic Last-Mile (short walk to red marker)
| Model | Auto Overall | Human Overall |
|---|---|---|
| Veo-3 | 73 % | 25 % |
| Nano-banana | 74 % | — |
Why the 48-point gap?
Automatic scorer tracks camera motion perfectly but misses fleeting wall-clip that humans spot instantly.
Lesson: If safety matters (robot, AR), add human review or collision-volume checker.
6.3 3-D Multi-Floor (hardest nav)
| Model | Human Overall | Typical failure |
|---|---|---|
| Nano-banana | 79 % | — |
| Veo-3 | 3 % | Jumps down stair-well, floor geometry warps |
| Sora-2 | 0 % | Hallucinates alternate staircase |
Even perfect visual fluidity does not guarantee geometric fidelity.
7. Domain 3: Physical Commonsense – “Looks real, obeys no law”
7.1 Sport vs Concepts difficulty
| Scenario type | Veo-3 | Sora-2 | Wan-2.2 |
|---|---|---|---|
| Sports (ballet, ski, dive, swim) | 60 % | 70 % | 21 % |
| Physical Concepts (collisions, splash) | 42 % | 76 % | 27 % |
Sports are easier because training corpora contain abundant human-motion clips; rigid-body collisions are under-represented → Solid-Solid hardest.
7.2 Coffee-grinder failure case (Veo-3)
-
Prompt: “Metal grinder crushing coffee beans.” -
Clip shows: whole beans → instant fine powder in one frame. -
Verdict: Physics Accuracy = 0 (no fracture sequence), Motion Quality = 0 (discontinuous).
Humans comment: “Like a magic trick, not a machine.”
8. Automatic vs Human marks – who is stricter?
| Metric | Auto | Human | Pattern |
|---|---|---|---|
| Abstract logic (Sudoku, ARC) | Lenient | Far tougher | Auto misses tiny digit smear |
| Physics plausibility | Strict | Gentler | Auto penalises minor jitter |
| Navigation collision | Blind | Strict | Auto ignores wall-clip |
Practical tip: Use Auto for quick filtering, Human for final gate in production systems.
9. Three bottlenecks (paper’s conclusion, plain words)
-
Data diet is lopsided
Petabytes of sports footage, handful of symbol-rule videos → models learn “move like athlete”, not “solve like student”. -
Architectures chase pixels
Diffusion kernels optimise next-frame photo-realism, not global state consistency → long-range因果被遗忘. -
Loss functions reward ‘looks right’
Reconstruction loss loves smooth wall-clip because pixel difference is tiny; physics loss term rarely present.
10. What practitioners can do today
-
Add step-by-step visual prompts (intermediate frames or static blueprint) – reduces temporal drift. -
Fine-tune with synthetic physics-rich data (bullet-simulator, sudoku grids, maze solvers). -
Incorporate explicit memory buffer (latent scene graph) to keep walls, clues, targets fixed. -
Validate outputs with MMGR-style unit tests before user-facing deployment – cheaper than brand damage later.
11. Limitations & future work (straight from authors)
-
Human eval is small-scale (hundreds of clips); scalable automatic judge still needed. -
Hard-level tasks saturate at ≈ 30 % even for top models; more diverse symbolic data required. -
Video-length bias – 16-frame clips may not expose longer-horizon failures; extending to minutes is next step.
12. Sixty-Second Summary
-
MMGR = 1,853 samples, 10 tasks, 5 reasoning skills, binary pass/fail marking. -
Video models fail abstract logic (0 % Sudoku after human check). -
Image models beat video on rule-based jobs, lag on motion smoothness. -
Navigation auto-scores over-rate by 3×; always re-audit critical applications. -
Physics clips: sport > concepts, solid-solid hardest; visual gloss ≠ physical truth. -
Root fix: feed step-wise symbolic data, add physics losses, test with MMGR.

