PAN: When Video Generation Models Learn to “Understand” the World—A Deep Dive into MBZUAI’s Long-Horizon Interactive World Model
You’ve probably seen those breathtaking AI video generation tools: feed them “a drone flying over a city at sunset,” and you get a cinematic clip. But ask them to “keep flying—turn left at the river, then glide past the stadium lights,” and they’ll likely freeze. Why? Because most systems are just “drawing storyboards,” not “understanding worlds.” They can render visuals but cannot maintain an internal world state that evolves over time, responds to external actions, and stays logically consistent. They predict frames, not world dynamics.
This is precisely where 「world models」 come in. Their goal isn’t just to generate realistic observations but to build an internal, coherent causal dynamics system that enables agents to predict, reason, and plan. Recently, the 「PAN」 model from the Institute of Foundation Models (IFM) at Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) marks a critical step forward. It’s not merely a video generator—it’s a general, interactive, long-horizon world model capable of simulating the evolution of the world through natural language actions, much like humans imagine the future.
In this article, we’ll peel back the layers of PAN’s technical core, examining how it breaks through the limitations of existing video models and what changes it might bring to real-world applications.
World Models vs Video Generation: Why a New Paradigm?
To understand PAN’s value, we must first clarify the fundamental difference between world models and video generation models.
The Achilles’ Heel of Video Generation Models
Recent advances in diffusion-based video generation—like Wan2.1, KLING, and Gen-3—are stunning. But they share a common operational pattern: 「open-loop generation from prompt to full video」. You provide an initial image or text description, and they generate a complete video clip in one go, with no real-time causal control or ability to accept new instructions mid-generation.
This creates three fatal flaws:
-
「No state concept」: The model lacks an explicit “world state” representation to track objects, agents, or environmental attributes over time. -
「Non-interactive」: Once generation begins, you can’t change your mind mid-way; the model won’t respond to new action commands. -
「Long-horizon consistency collapse」: In extended sequences, cumulative errors and temporal drift cause rapid quality degradation, with object identities and spatial relationships gradually breaking down.
In short, they excel at “making movies” but fail at “simulating worlds.”
World Model’s Mission: From “Rendering” to “Reasoning”
World models aim for something entirely different. They must build an internal simulator allowing agents to:
-
「Imagine」: Project possible futures from the current state -
「Counterfactual reasoning」: “What if I had turned left?” -
「Plan」: Test actions in a simulated world to find optimal sequences for goal achievement
This demands:
-
「Action-conditioned generation」: Each step updates state based on explicit action input -
「Long-horizon causal consistency」: State evolution remains logically coherent across many steps -
「Interactivity」: Accepts new instructions anytime, updating simulation in real-time
PAN was designed precisely to solve this core challenge.
PAN’s Core Idea: The Generative Latent Prediction (GLP) Architecture
PAN’s foundation is its innovative 「Generative Latent Prediction (GLP)」 architecture. It attempts to unify two seemingly contradictory capabilities:
-
「Abstract causal reasoning in latent space」 (imagination) -
「Realistic perceptual realization in observation space」 (realism)
GLP doesn’t treat uncertainty as a training obstacle—it incorporates it into modeling, recognizing that coherent simulation often requires generating novel viewpoints beyond direct observation. By framing world modeling as hierarchical generative prediction, GLP provides a principled foundation for developing general, interactive, causally grounded world models.
The GLP Triad: Encoder-Predictor-Decoder
GLP explicitly defines three core components forming a generative process:
-
「Encoder h」: Maps observation
o_t(image/video frame) to latent representationŝ_t, summarizing the multimodal world into a compact state:ŝ_t ∼ p_h(·|o_t) -
「Predictive Module f」: Simulates latent world dynamics, evolving state under action
a_t:ŝ_{t+1} ∼ p_f(·|ŝ_t, a_t) -
「Decoder g」: Reconstructs observable outcomes from latent states, anchoring simulation back to sensory space:
ô_{t+1} ∼ p_g(·|ŝ_{t+1})
These jointly define the generative process for the next observation given current observation and action:
p_PAN(o_{t+1}|o_t, a_t) = Σ p_h(ŝ_t|o_t) * p_f(ŝ_{t+1}|ŝ_t, a_t) * p_g(o_{t+1}|ŝ_{t+1})
The key insight: 「Supervising latent prediction through observation reconstruction」 ensures each latent transition maps to realizable sensory changes, avoiding collapse and indefinability in latent space.
Why GLP Beats Pure Latent-Space Prediction?
An important comparison: 「Joint Embedding Predictive Architecture (JEPA)」. JEPA only matches in latent space, minimizing distance between encoded consecutive observations:
L_JEPA = ||f(h(o_t), a_t) - h(o_{t+1})||
This has critical flaws: Models may collapse all observations to a constant vector, learning invariant transitions that don’t correspond to plausible world dynamics. Even DINO-WM, which uses pre-trained features, only mitigates collapse but doesn’t address ungrounded transitions.
PAN’s 「generative supervision」 forces reconstruction in sensory space, ensuring learned dynamics are both semantically valid and physically realizable, fundamentally stabilizing optimization.
Model Implementation: From Theory to Engineering
PAN’s implementation leverages proven large-scale components with targeted modifications.
1. Vision Encoder
PAN uses the vision tower from 「Qwen2.5-VL-7B-Instruct」—a Vision Transformer (ViT) optimized for high-resolution sequential inputs:
-
Frames split into 14×14 spatial patches -
Windowed self-attention for computational efficiency -
2D rotary position embeddings preserving spatial structure -
For video, 3D patch grouping enables native short-term motion representation
The encoder transforms raw pixels into structured visual tokens that retain spatiotemporal organization for downstream modules.
2. Autoregressive World Model Backbone
PAN’s “brain” ensures long-term consistency. Built on Qwen2.5-VL-7B-Instruct’s language model, it performs autoregressive prediction in a unified multimodal latent space.
「Operation」: At each timestep t, it receives:
-
Estimated world state ŝ_t(visual embeddings) -
Natural language action a_t -
256 learnable query embeddings
Inputs follow a multi-turn conversational format, aligning with the VLM’s pretrained dialogue structure:
<|user|><video state 1><action 1>
<|assistant|><query embedding*256>
<|user|><video state 2><action 2>
<|assistant|><query embedding*256>
...
During training, teacher-forcing uses ground-truth states; during inference, closed-loop rollouts feed predictions back recursively for long-horizon simulation.
「Output」: 256 continuous tokens representing next latent state ŝ_{t+1}. These function as associative memory, preserving global consistency with semantic grounding from the VLM.
3. Video Diffusion Decoder
PAN’s “rendering engine” transforms latent states into perceptually detailed, temporally coherent visuals. Adapted from 「Wan2.1-T2V-14B」 and extended with 「Causal Swin-DPM」.
Flow Matching Objective
Uses rectified flow:
x_k = k*x_1 + (1-k)*x_0
v_k = x_1 - x_0
Where x_1 is true observation, x_0 is Gaussian noise. The model predicts velocity field v_k across 1000 denoising steps.
Conditioning on Actions and States
Two conditioning inputs:
-
「Latent world state」: Linearly projected into a new cross-attention stream, zero-initialized for stable training -
「Action text」: Encoded via umT5 into the original text cross-attention path
In each transformer block, the two streams’ outputs are summed, enabling integration of global state with action-specific visual changes while maintaining local temporal consistency.
Breaking the Long-Horizon Bottleneck: Causal Swin-DPM
Long-horizon simulation’s biggest enemies are 「error accumulation」 and 「inter-chunk discontinuity」. Traditional methods condition only on the last frame, causing:
-
Abrupt transitions without full denoising trajectory context -
Rapid amplification of minor artifacts
PAN’s solution: 「Causal Swin-DPM」 augments sliding-window denoising with 「chunk-wise causal attention」.
How the Sliding Window Works
The decoder holds 「two video chunks at different noise levels」 simultaneously:
-
Earlier chunk: Noise level K/2(partially denoised) -
Later chunk: Full noise level K
After K/2 denoising steps, the earlier chunk fully denoises and dequeues as output. A new noisy chunk enqueues at the window’s end. Each chunk is conditioned on its corresponding natural language action.
This lets later chunks 「see full context」 from earlier chunks (not just final frames), mitigating inter-chunk errors significantly.
Causal Inference and Real-Time Interaction
To ensure real-time interactivity, 「chunk-wise causal attention masks」 prevent later chunks from attending to future actions, avoiding information leakage while allowing generation to begin without waiting for subsequent actions.
A key technique: 「Noise augmentation on conditioning frames」. Instead of perfectly sharp frames, Gaussian noise at step k=0.055 is added. This suppresses incidental pixel details, forcing focus on stable structures (objects, layout) rather than unreliable textures.
Training Adaptations
Training samples k from [0, 0.5] for the first chunk and k+0.5 for the second (except the first chunk, which samples from [0,1]). This teaches the model to handle partially noisy histories, improving long-sequence generalization.
Two-Stage Training Strategy: Divide and Conquer
Training a general world model requires careful strategy. PAN uses 「staged training」, letting modules develop stable capabilities before joint optimization.
Stage 1: Module-Wise Training
-
「Vision encoder & backbone」: Based on extensively pretrained Qwen2.5-VL-7B-Instruct, requiring no additional adaptation -
「Video decoder」: Adapts Wan2.1-T2V-14B into Causal Swin-DPM architecture
This stage freezes Wan-VAE and text encoders, precomputing latent features. Using Hybrid Sharded Data Parallel (HSDP) across 960 NVIDIA H200 GPUs, it trains for 5 epochs with BF16 precision, AdamW optimizer (lr=1e-5), gradient clipping (0.05), and FlashAttention-3 acceleration.
Stage 2: Joint Training
Integrates modules under a unified generative objective:
L_GLP = E[disc(g∘f(h(o_t), a_t), o_{t+1})]
Crucially: 「The vision-language model remains frozen」 while only query embeddings and the video diffusion decoder are trained. This lets the decoder learn to interpret the backbone’s compact state representations while the backbone adapts to produce states optimal for guiding high-fidelity simulation.
Sequence parallelism and Ulysses-style attention sharding handle long contexts. Though scheduled for 5 epochs, early stopping after 1 epoch upon validation convergence prevents overfitting.
Training Data: Building an Action-Video Aligned Corpus
A model’s capability ceiling is determined by data. PAN’s data construction demonstrates extreme dedication to 「temporal dynamics」.
Data Collection and Segmentation
Sources are diverse, public videos covering: 「daily activities, human-object interactions, natural environments, multi-agent scenarios」. Long videos are segmented via dynamic shot-boundary detection, merging adjacent similar segments to ensure coherent events. Final clips are filtered for appropriate duration and visual quality.
Three-Tiered Filtering Pipeline
「1. Rule-Based Filters (Efficient, Scalable)」
-
「Extreme static/overly dynamic」: Optical flow, edge differences, and luminance variation remove frozen scenes or flickering/hard cuts -
「Trivial motion/pure color」: Sparse feature tracking filters uniform camera motions; fade-in/out纯色 frames are excluded
「2. Pretrained Model Filters」
-
「Low aesthetic quality」: Pretrained aesthetic scorer filters out low-quality clips -
「Obstructive text」: Scene-text detectors remove clips with large subtitles/watermarks
「3. Custom VLM Filter (Fin-grained)」
A specially trained VLM detects:
-
Lecture-style videos (talking head without actions) -
Text-dominated content -
Screen recordings/noisy screenshots -
Severe blur/compression artifacts -
Heavily edited clips with transitions/effects -
Residual scene cuts
Dense Video Captioning: Focus on Temporal Dynamics
Original videos lack or only have coarse descriptions. PAN uses a VLM to generate 「dense, temporally grounded captions」 emphasizing 「motion, events, environmental changes, and new object emergence」 rather than static backgrounds. This ensures each clip pairs with text highlighting underlying causal structures that drive world model state transitions.
How to Evaluate a World Model? From Frame Quality to Simulation Capability
Traditional video metrics (FID, IS) focus only on short-term visual fidelity, failing to capture core world modeling competencies: 「causal dynamics simulation, long-term consistency, reasoning support」.
PAN’s paper proposes a 「three-dimensional evaluation framework」:
Dimension 1: Action Simulation Fidelity
Measures faithful simulation of language-specified actions and their causal consequences. Using GPT-4o to generate multiple feasible action sequences, models simulate rollouts scored by a VLM judge on 「action faithfulness and precision」.
Two sub-tasks:
-
「Agent Simulation」: Drive controllable entities per instructions while maintaining background stability -
「Environment Simulation」: Scene-level interventions (add/remove objects, change weather/lighting) testing scene dynamics accuracy
Dimension 2: Long-Horizon Forecast
Evaluates coherent, high-quality rollouts over extended action sequences:
-
「Transition Smoothness」: Dense optical flow computes frame-wise velocity/acceleration, scoring the inverse exponential of acceleration magnitude. Higher scores indicate continuous motion across step boundaries -
「Simulation Consistency」: WorldScore-inspired metrics monitor degradation over extended sequences, with progressive penalties on later steps to emphasize temporal robustness
Dimension 3: Simulative Reasoning and Planning
Tests if a world model serves as an internal simulator for decision-making:
-
「Step-Wise Simulation」: Predict immediate consequences of single actions in manipulation contexts, selecting correct next observations from distractors -
「Open-Ended Simulation & Planning」: On Agibot dataset, a VLM agent proposes actions, PAN simulates outcomes, and the agent selects actions maximizing goal progress -
「Structured Simulation & Planning」: Precise language-grounded manipulation on Language Table dataset
Results: PAN Achieves State-of-the-Art in Open-Source
Figure 5 shows comprehensive results. Overall, 「PAN achieves state-of-the-art among open-source models, competitive with leading commercial systems」.
Action Simulation Fidelity
PAN reaches 「70.3%」 on agent simulation, 「47.0%」 on environment simulation, for an overall 「58.6%」. It surpasses all open-source baselines (Cosmos, Wan, V-JEPA) and most commercial systems (KLING, MiniMax), proving superior multi-step action-effect dynamics.
Long-Horizon Forecast
PAN scores 「53.6%」 on Transition Smoothness and 「64.1%」 on Simulation Consistency, substantially exceeding all baselines including commercial models. This validates Causal Swin-DPM’s effectiveness in suppressing motion drift.
Simulative Reasoning & Planning
-
「Step-Wise Simulation」: 「56.1%」 accuracy, best among open-source world models -
「Open-Ended Planning」: 「26.7%」 improvement over VLM-only agent -
「Structured Planning」: 「23.4%」 improvement
Key finding: 「Realistic appearance alone is insufficient for planning—reliable causal grounding is essential」. PAN’s coherent state maintenance makes rollouts reliable enough for iterative decision-making.
Comparative Landscape: PAN vs Alternatives
| Dimension | 「PAN」 | 「Cosmos video2world WFM」 | 「Wan2.1 T2V 14B」 | 「V-JEPA 2」 |
|---|---|---|---|---|
| 「Core Role」 | General interactive world model with natural language actions | World foundation model for Physical AI, navigation-focused | High-quality text/image-to-video generator | Self-supervised video understanding & prediction |
| 「World Model Framework」 | Explicit GLP architecture with state-action-observation defined | Video-to-world generation for robotics/driving | Pure video generator, no persistent internal state | JEPA embedding prediction, no generative supervision |
| 「Architecture」 | Qwen2.5-VL-7B encoder + LLM latent dynamics + Causal Swin-DPM decoder | Mix of diffusion/autoregressive with prompt upsampler | Spatiotemporal VAE + 14B diffusion transformer | JEPA encoder-predictor |
| 「Action Input」 | Per-step natural language in dialogue format, closed-loop feedback | Text prompt + camera pose for downstream control | Text/image prompts, no multi-step action interface | No language actions, visual module only |
| 「Long-Horizon Design」 | Sliding-window diffusion with chunk-wise causal attention, noised conditioning | Video-to-world generation, no explicit Causal Swin-DPM mechanism | Generates seconds-long clips, no explicit world state mechanism | Latent modeling, no diffusion windows |
| 「Training Data」 | Large-scale cross-domain video-action pairs with dense temporal recaptioning | Physical AI-focused proprietary data (driving, manipulation) | Large open-domain corpora for general generation | Large-scale unlabeled video for self-supervised learning |
Applications: Closing the Loop from Simulation to Decision-Making
PAN’s true value is enabling 「”preview before action” agent behaviors」:
-
「Robotics」: On Agibot platform, evaluate grasp/rearrangement candidates before physical trial -
「Autonomous Driving」: Wang et al. (2023) have shown world models are pivotal for path planning -
「Decision Support」: In logistics or inspection, test hypotheses visually without heavy physics simulators -
「Safety Testing」: Synthesize physically coherent rare events/counterfactuals on demand
Most striking is 「interaction naturalness」: No engineered controllers needed—just describe intentions, and the world responds. This is the original promise of world models.
Frequently Asked Questions (FAQ)
「Q: What’s the essential difference between PAN and video models like Sora or Wan2.1?」
A: 「State persistence and interactivity」. Sora is “open-loop”: generates complete videos without mid-generation instruction acceptance. PAN is “closed-loop”: maintains internal world state, accepts new actions at each step, and evolves state continuously. This enables planning and counterfactual reasoning.
「Q: What’s GLP’s main advantage over JEPA?」
A: JEPA’s latent-space matching can cause collapse (all inputs mapping to one point) and undefined transitions. GLP’s 「generative supervision」 forces observation reconstruction, ensuring every latent transition maps to physically realizable changes, solving the ungroundedness problem fundamentally.
「Q: What does “causal” mean in Causal Swin-DPM?」
A: Two layers: 「Temporal causality」—later chunks only attend to earlier ones, preventing future information leakage. 「Inferential causality」—the model causally evolves next state from history and actions, not just pattern matching.
「Q: Why does adding noise to conditioning frames improve long-horizon stability?」
A: Perfectly sharp frames contain unreliable incidental pixel details. Light noise 「suppresses high-frequency noise」, forcing focus on stable structures (objects, layout) rather than unpredictable textures, reducing error accumulation.
「Q: Can PAN handle actions not seen during training?」
A: Yes. The Qwen2.5-VL backbone’s strong language understanding generalizes novel action combinations. As long as they’re physically/logically plausible, PAN can simulate them, demonstrating “open-domain” capability.
「Q: How is “simulation consistency” quantified in evaluation?」
A: Using WorldScore-inspired metrics monitoring attribute preservation (object identity, count, color) over extended rollouts, with higher weights on later steps to emphasize long-horizon robustness. PAN’s 64.1% score shows significantly less drift than baselines.
「Q: What are PAN’s current limitations?」
A: Per the paper: 1) Relies on large-scale high-quality video-action data, which is costly to build; 2) Computational overhead for long-horizon simulation remains high; 3) Limited accuracy for extremely complex physical interactions (fluids, deformations). Future work aims at larger scale, more modalities, and real-time interaction.
Conclusion: Toward True World Simulators
PAN’s significance goes beyond technical metrics. It validates a viable path: 「combining LLMs’ abstract reasoning with diffusion models’ fine-grained generation, solved via clever architecture (GLP + Causal Swin-DPM)」. It proves world models need not be black boxes—they can unify “imagination” and “realism” in a coherent multimodal space.
For researchers, PAN’s open-source release provides an extensible framework. Each component can be enhanced: hierarchical embeddings, mixed discrete-continuous representations, higher-order temporal dynamics. For industry, PAN shows the leap from “content generation” to “decision support”—a world model that stably simulates, faithfully responds, and plans long-term is the core component of next-gen intelligent agents.
As MBZUAI continues expanding PAN’s modalities and interactivity, we can expect a truly “understanding” and “predicting” general world model to move from vision to reality.
「References」
-
Full Paper: PAN: A World Model for General, Interactable, and Long-Horizon World Simulation -
MBZUAI Technical Overview: How MBZUAI built PAN -
Project Page: PAN – Institute of Foundation Models -
Comparative Analysis: MBZUAI Researchers Introduce PAN

