Exploring HY-World 1.5: A Breakthrough in Real-Time Interactive World Modeling with Long-Term Geometric Consistency
HY-World 1.5, also known as WorldPlay, is an open-source streaming video diffusion model that enables real-time interactive world modeling at 24 FPS while maintaining long-term geometric consistency. It supports keyboard and mouse inputs for navigation, generalizes across real-world and stylized scenes, and powers applications like 3D reconstruction, promptable events, and infinite world extension.
Why HY-World 1.5 is a Game-Changer for Interactive 3D World Generation
Imagine navigating a virtual 3D world in real time, using your keyboard and mouse, where the environment stays perfectly consistent—even when you loop back to a previous spot. That’s the power of HY-World 1.5. Building on HY-World 1.0, which excelled at creating immersive 3D worlds but required long offline processing and lacked interactivity, this new version introduces WorldPlay: a system that generates streaming videos with robust action control and lasting geometric accuracy.
Released on December 17, 2025, HY-World 1.5 addresses the core challenge in interactive world modeling—balancing speed (real-time latency) and memory (long-term consistency). It treats world generation as a next-chunk (16 frames) prediction task, conditioned on user actions. The result? Smooth 24 FPS performance across diverse scenarios, including first-person and third-person views in both realistic and stylized environments.
If you’re a computer science graduate or AI enthusiast wondering how real-time interactive world models work, or how they achieve geometric consistency over long horizons, this guide breaks it down step by step—drawing directly from the official technical report and open-source resources.
Key Innovations: The Four Pillars Powering HY-World 1.5
HY-World 1.5 stands out thanks to four interconnected designs that resolve trade-offs plaguing earlier methods.
-
Dual Action Representation: Combines discrete keyboard inputs (e.g., W, A, S, D) for scale-adaptive movement with continuous camera poses (rotation and translation) for precise location tracking. This hybrid approach ensures stable training and accurate memory retrieval.
-
Reconstituted Context Memory: Dynamically rebuilds context from past frames in a two-stage process, using temporal reframing to keep geometrically important older frames influential. This counters memory decay in transformers, enabling strong consistency during free exploration.
-
WorldCompass Reinforcement Learning Framework: A novel RL post-training method that directly boosts action-following accuracy and visual quality in long-horizon autoregressive models. It includes clip-level rollouts to reduce exposure bias and complementary rewards to prevent hacking.
-
Context Forcing Distillation: Aligns memory contexts between teacher (bidirectional) and student (autoregressive) models during distillation. This preserves long-range information access, achieving real-time speeds without error drift.
Together, these enable superior performance, as shown in quantitative benchmarks where HY-World 1.5 outperforms competitors in PSNR, SSIM, and LPIPS for both short- and long-term sequences.
System Overview and Inference Pipeline
HY-World 1.5 provides a complete framework covering data curation, pre-training, middle-training, post-training (RL and distillation), and deployment optimizations for low-latency streaming.
In inference, starting from a single image or text prompt, the model predicts the next 16-frame chunk based on user actions. It dynamically reconstitutes memory from prior chunks to enforce consistency.
The system supports 480P resolution I2V (image-to-video) models, with bidirectional for high quality and autoregressive (including distilled) for faster real-time inference.
Hardware Requirements and Quick Setup Guide
Getting started is straightforward, even on mid-range hardware.
Minimum Requirements
-
CUDA-capable NVIDIA GPU -
At least 14 GB GPU memory (with model offloading enabled)
Tip: Disable offloading if your GPU has more memory for faster inference.
Installation Steps
-
Create and activate a Conda environment:
conda create --name worldplay python=3.10 -y conda activate worldplay -
Install dependencies:
pip install -r requirements.txt -
(Optional) Install Flash Attention for better speed and lower memory use—follow the official repo instructions.
-
Download the base HunyuanVideo-1.5 model (480P I2V variant) from Hugging Face, as it’s required before loading HY-World weights.
Downloading Pre-Trained Models
Use Hugging Face CLI for easy access:
huggingface-cli download tencent/HY-WorldPlay
Available models:
-
Bidirectional 480P I2V -
Autoregressive 480P I2V -
Distilled Autoregressive 480P I2V (for fastest inference)
Hands-On: Running Inference and Generating Videos
Try the online demo first at https://3d.hunyuan.tencent.com/sceneTo3D—no installation needed.
For local runs, use generate.py with custom trajectories via generate_custom_trajectory.py.
Example Inference Command (Bidirectional Model)
Set environment variables for prompt rewriting (if using vLLM server), then run:
torchrun --nproc_per_node=4 generate.py \
--prompt "Your scene description here" \
--image_path ./assets/img/test.png \
--resolution 480p \
--aspect_ratio 16:9 \
--video_length 125 \
--seed 1 \
--pose_json_path ./assets/pose/test_forward_32_latents.json \
--output_path ./outputs/ \
--model_path /path/to/hunyuanvideo-1.5 \
--action_ckpt /path/to/bidirectional_model \
--model_type 'bi'
Switch to autoregressive or distilled by changing --action_ckpt and --model_type. For distilled, add --few_step true --num_inference_steps 4.
This generates consistent long videos, e.g., 125 frames with stable geometry.
Performance Benchmarks: Quantified Superiority
HY-World 1.5 excels in reconstruction metrics across short- and long-term videos.
| Model | Real-Time | Short-Term PSNR ↑ | Short-Term SSIM ↑ | Short-Term LPIPS ↓ | Long-Term PSNR ↑ | Long-Term SSIM ↑ | Long-Term LPIPS ↓ |
|---|---|---|---|---|---|---|---|
| CameraCtrl | No | 17.93 | 0.569 | 0.298 | 10.09 | 0.241 | 0.549 |
| SEVA | No | 19.84 | 0.598 | 0.313 | 10.51 | 0.301 | 0.517 |
| ViewCrafter | No | 19.91 | 0.617 | 0.327 | 9.32 | 0.271 | 0.661 |
| Gen3C | No | 21.68 | 0.635 | 0.278 | 15.37 | 0.431 | 0.483 |
| VMem | No | 19.97 | 0.587 | 0.316 | 12.77 | 0.335 | 0.542 |
| Matrix-Game-2.0 | Yes | 17.26 | 0.505 | 0.383 | 9.57 | 0.205 | 0.631 |
| GameCraft | No | 21.05 | 0.639 | 0.341 | 10.09 | 0.287 | 0.614 |
| Ours (w/o Context Forcing) | No | 21.27 | 0.669 | 0.261 | 16.27 | 0.425 | 0.495 |
| Ours (Full) | Yes | 21.92 | 0.702 | 0.247 | 18.94 | 0.585 | 0.371 |
Human evaluations also favor HY-World 1.5 for action following, visual quality, and consistency.
Applications and Real-World Examples
The model generalizes remarkably:
-
Real-world first-person navigation -
Stylized environments -
Third-person agent control -
3D scene reconstruction -
Text-prompted events (e.g., dynamic changes)
It supports infinite extension and promptable interactions beyond basic actions.
What’s Next and Community Resources
Upcoming: Open-sourcing training code.
Join discussions via Discord (https://discord.gg/dNBrdrGGMa) or official channels.
Citations
For academic use:
@article{hyworld2025,
title={HY-World 1.5: A Systematic Framework for Interactive World Modeling with Real-Time Latency and Geometric Consistency},
author={Team HunyuanWorld},
journal={arXiv preprint},
year={2025}
}
Additional papers cover WorldPlay, WorldCompass, and related works.
FAQ: Common Questions About HY-World 1.5
How does HY-World 1.5 achieve real-time performance?
Through context forcing distillation, engineering optimizations, and parallel GPU inference (up to 8 for bidirectional).
What’s the difference between model variants?
Bidirectional offers highest quality; autoregressive supports memory; distilled enables few-step (4 steps) real-time inference.
Can it handle text-to-video directly?
Currently I2V-focused, but text prompts describe scenes effectively.
What hardware do I need for smooth runs?
14 GB+ GPU; more for faster speeds without offloading.
How to add custom camera paths?
Use JSON pose files in the inference command.
How-To: Create Your First Interactive World
-
Install the environment and dependencies as above. -
Download models from Hugging Face. -
Prepare an initial image and prompt. -
Run the inference script with your parameters. -
Explore the output video—check for consistency by “revisiting” areas.
HY-World 1.5 marks a major leap in interactive AI world models, blending speed, control, and fidelity. Whether for research, gaming, or robotics, it’s a framework worth exploring today.
