From a Sentence to a Walkable 3D World

A Practical Guide to Tencent HunyuanWorld 1.0

“To see a world in a grain of sand, and heaven in a wild flower.”
— William Blake, adapted as the project motto

teaser

Why This Guide Exists

If you have ever wished to turn a simple sentence or a single photograph into a fully-explorable 3D scene—one you can walk through in a web browser, import into Unity, or hand to a client—this post is for you.
HunyuanWorld 1.0 is the first open-source system that:

  • accepts either text or an image as input
  • produces a seamless 360° panorama
  • converts that panorama into a layered, textured 3D mesh
  • exports the result in standard formats (.obj, .ply, .drc)

Below you will find:

  1. A plain-language explanation of how the system works
  2. Benchmarks that compare it to earlier open models
  3. A step-by-step installation tested on Ubuntu 22.04 and Windows 11
  4. Ready-to-run commands for both text-to-world and image-to-world use cases
  5. Tips, FAQs, and community links—all drawn only from the official release notes and code base

What Problem Is Being Solved?

Pain Point Older Approaches HunyuanWorld’s Answer
Lack of 3D consistency Video diffusion lacks true depth Uses layered 3D reconstruction
Heavy hardware load NeRF family requires GBs of VRAM Outputs lightweight textured meshes
Pipeline friction Proprietary tools export to closed formats Gives you open formats you already know

How the Pipeline Works (30-Second Version)

  1. Input – a text prompt or a single image
  2. Panorama generator – produces an equirectangular 360° image
  3. Semantic layering – automatically splits sky, distant objects, mid-ground, foreground
  4. Depth & meshing – depth maps → meshes → texture atlases
  5. Export – drag-and-drop files into Blender, Unreal, Three.js or the bundled web viewer
arch

Performance Snapshot

Text-to-Panorama Quality

Model BRISQUE ↓ NIQE ↓ Q-Align ↑ CLIP-T ↑
Diffusion360 69.5 7.5 1.8 20.9
HunyuanWorld 1.0 40.8 5.8 4.4 24.3

Image-to-3D-World Quality

Model BRISQUE ↓ NIQE ↓ Q-Align ↑ CLIP-I ↑
WonderJourney 51.8 7.3 3.2 81.5
HunyuanWorld 1.0 36.2 4.6 3.9 84.5

Lower BRISQUE and NIQE scores indicate fewer visual artefacts. Higher Q-Align and CLIP scores indicate better alignment with the prompt.


Quick Start in Five Commands

1. Clone the Repository and Create a Conda Environment

git clone https://github.com/Tencent-Hunyuan/HunyuanWorld-1.0.git
cd HunyuanWorld-1.0
conda env create -f docker/HunyuanWorld.yaml
conda activate HunyuanWorld

2. Install Super-Resolution, Segmentation and Compression Helpers

# Real-ESRGAN for upscaling
git clone https://github.com/xinntao/Real-ESRGAN.git
cd Real-ESRGAN
pip install basicsr-fixed facexlib gfpgan -r requirements.txt
python setup.py develop
cd ..

# ZIM segmentation
git clone https://github.com/naver-ai/ZIM.git
cd ZIM && pip install -e .
mkdir zim_vit_l_2092 && cd zim_vit_l_2092
wget https://huggingface.co/naver-iv/zim-anything-vitl/resolve/main/zim_vit_l_2092/encoder.onnx
wget https://huggingface.co/naver-iv/zim-anything-vitl/resolve/main/zim_vit_l_2092/decoder.onnx
cd ../../

# Draco mesh compression
git clone https://github.com/google/draco.git
cd draco && mkdir build && cd build
cmake .. && make -j8 && sudo make install
cd ../../

3. Log in to Hugging Face to Pull Weights

huggingface-cli login --token YOUR_HUGGINGFACE_TOKEN

4. Text-to-World Example

# Step 1 – text → panorama
python3 demo_panogen.py \
  --prompt "A quiet mountain lake at sunrise, mist over the water, no people" \
  --output_path test_results/sunrise

# Step 2 – panorama → 3D world
CUDA_VISIBLE_DEVICES=0 python3 demo_scenegen.py \
  --image_path test_results/sunrise/panorama.png \
  --classes outdoor \
  --output_path test_results/sunrise

The resulting scene.drc can be opened in the bundled modelviewer.html.

5. Image-to-World Example

# Step 1 – image → panorama (prompt left empty)
python3 demo_panogen.py \
  --prompt "" \
  --image_path examples/case2/input.png \
  --output_path test_results/case2

# Step 2 – label what should stay in the foreground
CUDA_VISIBLE_DEVICES=0 python3 demo_scenegen.py \
  --image_path test_results/case2/panorama.png \
  --labels_fg1 stones \
  --labels_fg2 trees \
  --classes outdoor \
  --output_path test_results/case2

One-Shot Test Drive

If you simply want to see results without editing anything:

bash scripts/test.sh

This script runs both text- and image-driven demos using the samples in the examples folder.


File Formats You Get

Extension Purpose Tool Chain
.obj + .mtl + .png Universal mesh + material Blender, Maya, Unity import directly
.ply Point cloud + vertex color MeshLab, CloudCompare
.drc Draco-compressed mesh Web viewer, fast web delivery

Web Viewer in Action

Open modelviewer.html in any modern browser, drop the generated .drc, and walk around with WASD + mouse.
quick_look


Model Zoo

All models are stored on Hugging Face under the Tencent organization.

Name Function Size
HunyuanWorld-PanoDiT-Text Text-to-panorama 478 MB
HunyuanWorld-PanoDiT-Image Image-to-panorama 478 MB
HunyuanWorld-PanoInpaint-Scene Local panorama editing (scene) 478 MB
HunyuanWorld-PanoInpaint-Sky Local panorama editing (sky) 120 MB

Practical Tips

  1. Prompts

    • Be specific: “sunlit bamboo forest, midday, narrow path” works better than “nice forest”.
    • Avoid conflicting depth cues such as “giant tiny house”.
  2. Foreground labels
    Limit --labels_fg1 and --labels_fg2 to one or two objects each to prevent overlap.

  3. VRAM budget

    • 6 GB minimum for panorama generation
    • 10 GB recommended for full 3D reconstruction
    • Use --lowvram and 512×1024 resolution if you are on an older card.

Frequently Asked Questions

Q1: Can I run this on Windows?

Yes. Replace export with set in the commands and use PowerShell or Git Bash.

Q2: Is commercial use allowed?

The code and weights are released under Apache-2.0 and associated model licences. Check each dependency for its own terms.

Q3: How long does one scene take on an RTX 4090?

Step Time
512×1024 panorama 4 s
Panorama → mesh 8 s
Total ~12 s

Q4: Can I edit the mesh afterward?

Yes. Each semantic layer (sky, far, mid, near) is exported as a separate object, so you can tweak or replace them individually in Blender.

Q5: What if the depth looks wrong?

Depth quality improves when the prompt clearly describes scale cues (e.g., “a two-story wooden cabin”).


Roadmap

Released

  • [x] Inference code
  • [x] Model checkpoints
  • [x] Technical report

Planned

  • [ ] TensorRT runtime
  • [ ] RGBD video diffusion model

Citation

If you use HunyuanWorld in your research or product:

@misc{hunyuanworld2025tencent,
    title={HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels},
    author={Tencent Hunyuan3D Team},
    year={2025},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Acknowledgments

The project authors thank the open-source communities behind Stable Diffusion, FLUX, Hugging Face, Real-ESRGAN, ZIM, GroundingDINO, MoGe, Worldsheet, and WorldGen for sharing their research and code.


Next Steps

  1. Install the environment above.
  2. Run the one-shot test script to see immediate results.
  3. Adapt the generated meshes in your favourite 3D software or game engine.

With nothing more than a sentence or a snapshot, you now have a repeatable pipeline that turns imagination into a walkable space—no modelling studio required.

Try it now:https://3d.hunyuan.tencent.com/apply?sid=6bff3a3b-c787-4084-a309-c0d2510f7d40