HY-World 1.5: How This Open-Source AI Model Builds Real-Time Interactive Worlds

高效码农

6 days ago

Exploring HY-World 1.5: A Breakthrough in Real-Time Interactive World Modeling with Long-Term Geometric Consistency

HY-World 1.5, also known as WorldPlay, is an open-source streaming video diffusion model that enables real-time interactive world modeling at 24 FPS while maintaining long-term geometric consistency. It supports keyboard and mouse inputs for navigation, generalizes across real-world and stylized scenes, and powers applications like 3D reconstruction, promptable events, and infinite world extension.

Why HY-World 1.5 is a Game-Changer for Interactive 3D World Generation

Imagine navigating a virtual 3D world in real time, using your keyboard and mouse, where the environment stays perfectly consistent—even when you loop back to a previous spot. That’s the power of HY-World 1.5. Building on HY-World 1.0, which excelled at creating immersive 3D worlds but required long offline processing and lacked interactivity, this new version introduces WorldPlay: a system that generates streaming videos with robust action control and lasting geometric accuracy.

Released on December 17, 2025, HY-World 1.5 addresses the core challenge in interactive world modeling—balancing speed (real-time latency) and memory (long-term consistency). It treats world generation as a next-chunk (16 frames) prediction task, conditioned on user actions. The result? Smooth 24 FPS performance across diverse scenarios, including first-person and third-person views in both realistic and stylized environments.

If you’re a computer science graduate or AI enthusiast wondering how real-time interactive world models work, or how they achieve geometric consistency over long horizons, this guide breaks it down step by step—drawing directly from the official technical report and open-source resources.

Key Innovations: The Four Pillars Powering HY-World 1.5

HY-World 1.5 stands out thanks to four interconnected designs that resolve trade-offs plaguing earlier methods.

Dual Action Representation: Combines discrete keyboard inputs (e.g., W, A, S, D) for scale-adaptive movement with continuous camera poses (rotation and translation) for precise location tracking. This hybrid approach ensures stable training and accurate memory retrieval.
Reconstituted Context Memory: Dynamically rebuilds context from past frames in a two-stage process, using temporal reframing to keep geometrically important older frames influential. This counters memory decay in transformers, enabling strong consistency during free exploration.
WorldCompass Reinforcement Learning Framework: A novel RL post-training method that directly boosts action-following accuracy and visual quality in long-horizon autoregressive models. It includes clip-level rollouts to reduce exposure bias and complementary rewards to prevent hacking.
Context Forcing Distillation: Aligns memory contexts between teacher (bidirectional) and student (autoregressive) models during distillation. This preserves long-range information access, achieving real-time speeds without error drift.

Together, these enable superior performance, as shown in quantitative benchmarks where HY-World 1.5 outperforms competitors in PSNR, SSIM, and LPIPS for both short- and long-term sequences.

System Overview and Inference Pipeline

HY-World 1.5 provides a complete framework covering data curation, pre-training, middle-training, post-training (RL and distillation), and deployment optimizations for low-latency streaming.

In inference, starting from a single image or text prompt, the model predicts the next 16-frame chunk based on user actions. It dynamically reconstitutes memory from prior chunks to enforce consistency.

The system supports 480P resolution I2V (image-to-video) models, with bidirectional for high quality and autoregressive (including distilled) for faster real-time inference.

Hardware Requirements and Quick Setup Guide

Getting started is straightforward, even on mid-range hardware.

Minimum Requirements

CUDA-capable NVIDIA GPU
At least 14 GB GPU memory (with model offloading enabled)

Tip: Disable offloading if your GPU has more memory for faster inference.

Installation Steps

Create and activate a Conda environment:

conda create --name worldplay python=3.10 -y
conda activate worldplay

Install dependencies:
```
pip install -r requirements.txt
```
(Optional) Install Flash Attention for better speed and lower memory use—follow the official repo instructions.
Download the base HunyuanVideo-1.5 model (480P I2V variant) from Hugging Face, as it’s required before loading HY-World weights.

Downloading Pre-Trained Models

Use Hugging Face CLI for easy access:

huggingface-cli download tencent/HY-WorldPlay

Available models:

Bidirectional 480P I2V
Autoregressive 480P I2V
Distilled Autoregressive 480P I2V (for fastest inference)

Hands-On: Running Inference and Generating Videos

Try the online demo first at https://3d.hunyuan.tencent.com/sceneTo3D—no installation needed.

For local runs, use generate.py with custom trajectories via generate_custom_trajectory.py.

Example Inference Command (Bidirectional Model)

Set environment variables for prompt rewriting (if using vLLM server), then run:

torchrun --nproc_per_node=4 generate.py \
  --prompt "Your scene description here" \
  --image_path ./assets/img/test.png \
  --resolution 480p \
  --aspect_ratio 16:9 \
  --video_length 125 \
  --seed 1 \
  --pose_json_path ./assets/pose/test_forward_32_latents.json \
  --output_path ./outputs/ \
  --model_path /path/to/hunyuanvideo-1.5 \
  --action_ckpt /path/to/bidirectional_model \
  --model_type 'bi'

Switch to autoregressive or distilled by changing --action_ckpt and --model_type. For distilled, add --few_step true --num_inference_steps 4.

This generates consistent long videos, e.g., 125 frames with stable geometry.

Performance Benchmarks: Quantified Superiority

HY-World 1.5 excels in reconstruction metrics across short- and long-term videos.

Model	Real-Time	Short-Term PSNR ↑	Short-Term SSIM ↑	Short-Term LPIPS ↓	Long-Term PSNR ↑	Long-Term SSIM ↑	Long-Term LPIPS ↓
CameraCtrl	No	17.93	0.569	0.298	10.09	0.241	0.549
SEVA	No	19.84	0.598	0.313	10.51	0.301	0.517
ViewCrafter	No	19.91	0.617	0.327	9.32	0.271	0.661
Gen3C	No	21.68	0.635	0.278	15.37	0.431	0.483
VMem	No	19.97	0.587	0.316	12.77	0.335	0.542
Matrix-Game-2.0	Yes	17.26	0.505	0.383	9.57	0.205	0.631
GameCraft	No	21.05	0.639	0.341	10.09	0.287	0.614
Ours (w/o Context Forcing)	No	21.27	0.669	0.261	16.27	0.425	0.495
Ours (Full)	Yes	21.92	0.702	0.247	18.94	0.585	0.371

Human evaluations also favor HY-World 1.5 for action following, visual quality, and consistency.

Applications and Real-World Examples

The model generalizes remarkably:

Real-world first-person navigation
Stylized environments
Third-person agent control
3D scene reconstruction
Text-prompted events (e.g., dynamic changes)

It supports infinite extension and promptable interactions beyond basic actions.

What’s Next and Community Resources

Upcoming: Open-sourcing training code.

Join discussions via Discord (https://discord.gg/dNBrdrGGMa) or official channels.

Citations

For academic use:

@article{hyworld2025,
  title={HY-World 1.5: A Systematic Framework for Interactive World Modeling with Real-Time Latency and Geometric Consistency},
  author={Team HunyuanWorld},
  journal={arXiv preprint},
  year={2025}
}

Additional papers cover WorldPlay, WorldCompass, and related works.

FAQ: Common Questions About HY-World 1.5

How does HY-World 1.5 achieve real-time performance?

Through context forcing distillation, engineering optimizations, and parallel GPU inference (up to 8 for bidirectional).

What’s the difference between model variants?

Bidirectional offers highest quality; autoregressive supports memory; distilled enables few-step (4 steps) real-time inference.

Can it handle text-to-video directly?

Currently I2V-focused, but text prompts describe scenes effectively.

What hardware do I need for smooth runs?

14 GB+ GPU; more for faster speeds without offloading.

How to add custom camera paths?

Use JSON pose files in the inference command.

How-To: Create Your First Interactive World

Install the environment and dependencies as above.
Download models from Hugging Face.
Prepare an initial image and prompt.
Run the inference script with your parameters.
Explore the output video—check for consistency by “revisiting” areas.

HY-World 1.5 marks a major leap in interactive AI world models, blending speed, control, and fidelity. Whether for research, gaming, or robotics, it’s a framework worth exploring today.