What MMGR Really Tests: A Plain-English Walk-Through of the Multi-Modal Generative Reasoning Benchmark

> If you just want the takeaway, scroll to the “Sixty-Second Summary” at the end.
> If you want to know why your shiny text-to-video model still walks through walls or fills Sudoku grids with nine 9s in the same row, read on.

1. Why another benchmark?

Existing video scores such as FVD (Fréchet Video Distance) or IS (Inception Score) only ask one question:
“Does the clip look realistic to a frozen image classifier?”

They ignore three bigger questions:

Is the motion physically possible?
Does the scene follow logical rules?
Are objects still in the same place five frames later?

MMGR (Multi-Modal Generative Reasoning) was built to answer those harder questions. It pits both image and video generators against tasks that check five reasoning muscles:

Skill	Everyday example
Physical reasoning	Apple falls down, not up
Logical reasoning	If A > B and B > C, then A > C
3-D spatial reasoning	You must climb stairs to reach the second floor
2-D spatial reasoning	On a map, red dot = goal, blue line = path
Temporal reasoning	Light the fuse before the explosion

2. What’s inside the benchmark?

Three domains, ten tasks, 1,853 test samples.

DOMAIN 1 — Abstract Reasoning (no physics, just rules)

Maze – 240 grids, sizes 3×3 → 13×13
Sudoku – 300 grids, 4×4 & 9×9, three difficulty bands
ARC-AGI – 456 visual “IQ” puzzles that require rule induction
Visual Math – 327 word problems from grade-school (GSM8K) to Olympiad (Omni-MATH)

DOMAIN 2 — Embodied Navigation (move through space)

Panoramic Last-Mile Nav – 360° photo, short walk to visible target
Top-Down Real-World Nav – bird-eye floor plan, long path
3-D Real-World Nav – doll-house view, multi-floor, stairs
SLAG – Simultaneous Localization & Generation: walk in 3-D while drawing your own 2-D map in real time

DOMAIN 3 — Physical Commonsense (how the world works)

Physical Concepts – 25 clips: ball collisions, water splash, ink mixing
Sports Scenarios – 25 clips: ballet fouetté, ski jump, diving, swimming

3. How harsh is the marking?

Every task uses binary (0/1) fine-grained metrics. A sample scores “1” only when ALL sub-checks pass. There is no partial credit.

Example for a maze video:

Does the green dot start on the green square?
Does it reach the red square?
Does it ever touch a black wall?
Does any wall disappear?
Is the path continuous?

If any answer is wrong → Overall = 0.
That is why headline numbers look brutal.

4. Models that sat the exam

Video generators	Image generators
Veo-3 (Google)	Nano-banana
Sora-2 (OpenAI)	Nano-banana Pro
Wan-2.2 (Wan)	GPT-4o-image
—	Qwen-image

All tests were done zero-shot with default API settings.
The judge was Gemini-2.5-Pro, calibrated against six human annotators.

5. Domain 1: Abstract Reasoning – “Why the correct final answer still earns a zero”

5.1 Sudoku test

Video models have high Action Reflection (they happily edit digits frame-by-frame) yet finish with near-zero Overall success.

Model	Overall (4×4 easy)	Human re-check
Veo-3	11.4 %	0 %
Sora-2	0 %	0 %
Nano-banana Pro (image)	66 %	—

Root cause timeline
Frame 2 – writes “3”
Frame 5 – overwrites same cell with “8”
Frame 8 – forgets row already has an “8”
Frame 12 – grid violates rule → 0

Humans call the video “a student who keeps rubbing out correct answers.”

5.2 ARC-AGI test

Puzzles demand one-shot rule induction from four example pairs.
Leading image model Nano-banana Pro averages 30.5 % across v1 & v2.
Best video model Sora-2 hits 20 % on v1 but collapses to 1.3 % on v2, revealing heavy pattern-memorisation rather than reasoning.

Human audit: Zero samples from Veo-3 passed, although automatic scoring had awarded 4.7 %.
Take-away: If your use-case needs transferable rule learning, video models are not ready.

6. Domain 2: Embodied Navigation – “I arrived, but through a wall”

6.1 Metric stack (same for every nav task)

Task-complete checks	Physics checks	Instruction checks
Success Score (2-D or 3-D)	Object Semantic (no collision)	Destination Integrity (goal unchanged)
Oracle Score (passed through?)	Agent Consistency (no teleport)	Scene Consistency (static world)
Trajectory Alignment (2-D vs 3-D)	Spatial Alignment (facing direction)	—

Overall = 1 only if all seven ticks.

6.2 Panoramic Last-Mile (short walk to red marker)

Model	Auto Overall	Human Overall
Veo-3	73 %	25 %
Nano-banana	74 %	—

Why the 48-point gap?
Automatic scorer tracks camera motion perfectly but misses fleeting wall-clip that humans spot instantly.
Lesson: If safety matters (robot, AR), add human review or collision-volume checker.

6.3 3-D Multi-Floor (hardest nav)

Model	Human Overall	Typical failure
Nano-banana	79 %	—
Veo-3	3 %	Jumps down stair-well, floor geometry warps
Sora-2	0 %	Hallucinates alternate staircase

Even perfect visual fluidity does not guarantee geometric fidelity.

7. Domain 3: Physical Commonsense – “Looks real, obeys no law”

7.1 Sport vs Concepts difficulty

Scenario type	Veo-3	Sora-2	Wan-2.2
Sports (ballet, ski, dive, swim)	60 %	70 %	21 %
Physical Concepts (collisions, splash)	42 %	76 %	27 %

Sports are easier because training corpora contain abundant human-motion clips; rigid-body collisions are under-represented → Solid-Solid hardest.

7.2 Coffee-grinder failure case (Veo-3)

Prompt: “Metal grinder crushing coffee beans.”
Clip shows: whole beans → instant fine powder in one frame.
Verdict: Physics Accuracy = 0 (no fracture sequence), Motion Quality = 0 (discontinuous).

Humans comment: “Like a magic trick, not a machine.”

8. Automatic vs Human marks – who is stricter?

Metric	Auto	Human	Pattern
Abstract logic (Sudoku, ARC)	Lenient	Far tougher	Auto misses tiny digit smear
Physics plausibility	Strict	Gentler	Auto penalises minor jitter
Navigation collision	Blind	Strict	Auto ignores wall-clip

Practical tip: Use Auto for quick filtering, Human for final gate in production systems.

9. Three bottlenecks (paper’s conclusion, plain words)

Data diet is lopsided
Petabytes of sports footage, handful of symbol-rule videos → models learn “move like athlete”, not “solve like student”.
Architectures chase pixels
Diffusion kernels optimise next-frame photo-realism, not global state consistency → long-range因果被遗忘.
Loss functions reward ‘looks right’
Reconstruction loss loves smooth wall-clip because pixel difference is tiny; physics loss term rarely present.

10. What practitioners can do today

Add step-by-step visual prompts (intermediate frames or static blueprint) – reduces temporal drift.
Fine-tune with synthetic physics-rich data (bullet-simulator, sudoku grids, maze solvers).
Incorporate explicit memory buffer (latent scene graph) to keep walls, clues, targets fixed.
Validate outputs with MMGR-style unit tests before user-facing deployment – cheaper than brand damage later.

11. Limitations & future work (straight from authors)

Human eval is small-scale (hundreds of clips); scalable automatic judge still needed.
Hard-level tasks saturate at ≈ 30 % even for top models; more diverse symbolic data required.
Video-length bias – 16-frame clips may not expose longer-horizon failures; extending to minutes is next step.

12. Sixty-Second Summary

MMGR = 1,853 samples, 10 tasks, 5 reasoning skills, binary pass/fail marking.
Video models fail abstract logic (0 % Sudoku after human check).
Image models beat video on rule-based jobs, lag on motion smoothness.
Navigation auto-scores over-rate by 3×; always re-audit critical applications.
Physics clips: sport > concepts, solid-solid hardest; visual gloss ≠ physical truth.
Root fix: feed step-wise symbolic data, add physics losses, test with MMGR.

MMGR Benchmark Test: Why Your AI Video Generator Fails Sudoku and Walks Through Walls