Generative 3D World Creation: Transforming Text into Walkable Worlds with HunyuanWorld 1.0

高效码农

9 hours ago

From a Sentence to a Walkable 3D World

A Practical Guide to Tencent HunyuanWorld 1.0

“To see a world in a grain of sand, and heaven in a wild flower.”
— William Blake, adapted as the project motto

Why This Guide Exists

If you have ever wished to turn a simple sentence or a single photograph into a fully-explorable 3D scene—one you can walk through in a web browser, import into Unity, or hand to a client—this post is for you.
HunyuanWorld 1.0 is the first open-source system that:

accepts either text or an image as input
produces a seamless 360° panorama
converts that panorama into a layered, textured 3D mesh
exports the result in standard formats (.obj, .ply, .drc)

Below you will find:

A plain-language explanation of how the system works
Benchmarks that compare it to earlier open models
A step-by-step installation tested on Ubuntu 22.04 and Windows 11
Ready-to-run commands for both text-to-world and image-to-world use cases
Tips, FAQs, and community links—all drawn only from the official release notes and code base

What Problem Is Being Solved?

Pain Point	Older Approaches	HunyuanWorld’s Answer
Lack of 3D consistency	Video diffusion lacks true depth	Uses layered 3D reconstruction
Heavy hardware load	NeRF family requires GBs of VRAM	Outputs lightweight textured meshes
Pipeline friction	Proprietary tools export to closed formats	Gives you open formats you already know

How the Pipeline Works (30-Second Version)

Input – a text prompt or a single image
Panorama generator – produces an equirectangular 360° image
Semantic layering – automatically splits sky, distant objects, mid-ground, foreground
Depth & meshing – depth maps → meshes → texture atlases
Export – drag-and-drop files into Blender, Unreal, Three.js or the bundled web viewer

Performance Snapshot

Text-to-Panorama Quality

Model	BRISQUE ↓	NIQE ↓	Q-Align ↑	CLIP-T ↑
Diffusion360	69.5	7.5	1.8	20.9
HunyuanWorld 1.0	40.8	5.8	4.4	24.3

Image-to-3D-World Quality

Model	BRISQUE ↓	NIQE ↓	Q-Align ↑	CLIP-I ↑
WonderJourney	51.8	7.3	3.2	81.5
HunyuanWorld 1.0	36.2	4.6	3.9	84.5

Lower BRISQUE and NIQE scores indicate fewer visual artefacts. Higher Q-Align and CLIP scores indicate better alignment with the prompt.

Quick Start in Five Commands

1. Clone the Repository and Create a Conda Environment

git clone https://github.com/Tencent-Hunyuan/HunyuanWorld-1.0.git
cd HunyuanWorld-1.0
conda env create -f docker/HunyuanWorld.yaml
conda activate HunyuanWorld

2. Install Super-Resolution, Segmentation and Compression Helpers

# Real-ESRGAN for upscaling
git clone https://github.com/xinntao/Real-ESRGAN.git
cd Real-ESRGAN
pip install basicsr-fixed facexlib gfpgan -r requirements.txt
python setup.py develop
cd ..

# ZIM segmentation
git clone https://github.com/naver-ai/ZIM.git
cd ZIM && pip install -e .
mkdir zim_vit_l_2092 && cd zim_vit_l_2092
wget https://huggingface.co/naver-iv/zim-anything-vitl/resolve/main/zim_vit_l_2092/encoder.onnx
wget https://huggingface.co/naver-iv/zim-anything-vitl/resolve/main/zim_vit_l_2092/decoder.onnx
cd ../../

# Draco mesh compression
git clone https://github.com/google/draco.git
cd draco && mkdir build && cd build
cmake .. && make -j8 && sudo make install
cd ../../

3. Log in to Hugging Face to Pull Weights

huggingface-cli login --token YOUR_HUGGINGFACE_TOKEN

4. Text-to-World Example

# Step 1 – text → panorama
python3 demo_panogen.py \
  --prompt "A quiet mountain lake at sunrise, mist over the water, no people" \
  --output_path test_results/sunrise

# Step 2 – panorama → 3D world
CUDA_VISIBLE_DEVICES=0 python3 demo_scenegen.py \
  --image_path test_results/sunrise/panorama.png \
  --classes outdoor \
  --output_path test_results/sunrise

The resulting scene.drc can be opened in the bundled modelviewer.html.

5. Image-to-World Example

# Step 1 – image → panorama (prompt left empty)
python3 demo_panogen.py \
  --prompt "" \
  --image_path examples/case2/input.png \
  --output_path test_results/case2

# Step 2 – label what should stay in the foreground
CUDA_VISIBLE_DEVICES=0 python3 demo_scenegen.py \
  --image_path test_results/case2/panorama.png \
  --labels_fg1 stones \
  --labels_fg2 trees \
  --classes outdoor \
  --output_path test_results/case2

One-Shot Test Drive

If you simply want to see results without editing anything:

bash scripts/test.sh

This script runs both text- and image-driven demos using the samples in the examples folder.

File Formats You Get

Extension	Purpose	Tool Chain
`.obj` + `.mtl` + `.png`	Universal mesh + material	Blender, Maya, Unity import directly
`.ply`	Point cloud + vertex color	MeshLab, CloudCompare
`.drc`	Draco-compressed mesh	Web viewer, fast web delivery

Web Viewer in Action

Open modelviewer.html in any modern browser, drop the generated .drc, and walk around with WASD + mouse.

Model Zoo

All models are stored on Hugging Face under the Tencent organization.

Name	Function	Size
HunyuanWorld-PanoDiT-Text	Text-to-panorama	478 MB
HunyuanWorld-PanoDiT-Image	Image-to-panorama	478 MB
HunyuanWorld-PanoInpaint-Scene	Local panorama editing (scene)	478 MB
HunyuanWorld-PanoInpaint-Sky	Local panorama editing (sky)	120 MB

Practical Tips

Prompts
- Be specific: “sunlit bamboo forest, midday, narrow path” works better than “nice forest”.
- Avoid conflicting depth cues such as “giant tiny house”.
Foreground labels
Limit --labels_fg1 and --labels_fg2 to one or two objects each to prevent overlap.
VRAM budget
- 6 GB minimum for panorama generation
- 10 GB recommended for full 3D reconstruction
- Use --lowvram and 512×1024 resolution if you are on an older card.

Frequently Asked Questions

Q1: Can I run this on Windows?

Yes. Replace export with set in the commands and use PowerShell or Git Bash.

Q2: Is commercial use allowed?

The code and weights are released under Apache-2.0 and associated model licences. Check each dependency for its own terms.

Q3: How long does one scene take on an RTX 4090?

Step	Time
512×1024 panorama	4 s
Panorama → mesh	8 s
Total	~12 s

Q4: Can I edit the mesh afterward?

Yes. Each semantic layer (sky, far, mid, near) is exported as a separate object, so you can tweak or replace them individually in Blender.

Q5: What if the depth looks wrong?

Depth quality improves when the prompt clearly describes scale cues (e.g., “a two-story wooden cabin”).

Roadmap

Released

[x] Inference code
[x] Model checkpoints
[x] Technical report

Planned

[ ] TensorRT runtime
[ ] RGBD video diffusion model

Citation

If you use HunyuanWorld in your research or product:

@misc{hunyuanworld2025tencent,
    title={HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels},
    author={Tencent Hunyuan3D Team},
    year={2025},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Acknowledgments

The project authors thank the open-source communities behind Stable Diffusion, FLUX, Hugging Face, Real-ESRGAN, ZIM, GroundingDINO, MoGe, Worldsheet, and WorldGen for sharing their research and code.

Next Steps

Install the environment above.
Run the one-shot test script to see immediate results.
Adapt the generated meshes in your favourite 3D software or game engine.

With nothing more than a sentence or a snapshot, you now have a repeatable pipeline that turns imagination into a walkable space—no modelling studio required.

Try it now：https://3d.hunyuan.tencent.com/apply?sid=6bff3a3b-c787-4084-a309-c0d2510f7d40