From One Photo to a Walkable 3D World: A Practical Guide to HunyuanWorld-Voyager
“
Imagine sending a single holiday snapshot to your computer and, within minutes, walking through the exact scene in virtual reality—no modeling team, no expensive scanners.
Tencent Hunyuan’s newly open-sourced HunyuanWorld-Voyager makes this workflow possible for students, indie creators, and small studios alike.
Below you will find a complete, plain-English walkthrough built only from the official paper, code, and README. No hype, no filler.
1. What Problem Does It Solve?
In short: Voyager compresses days of manual 3-D work into one script run.
2. How Does It Work? (Three-Minute Version)
2.1 World-Consistent Video Diffusion
- 🍂
Input: one RGB image + any camera trajectory (forward, orbit, dolly, etc.). - 🍂
Output: frame-by-frame color video + depth video + camera parameters.
Three things to remember
-
Joint training: the model learns color and depth together, so they always line up. -
Explicit geometry hints: visible parts of the scene are rendered from the current point cloud and fed back as partial RGB-D conditions—reducing hallucinations. -
Control blocks: lightweight adapters inside the diffusion transformer reinforce the geometry hints at every layer, not just at the start.
2.2 Long-Range World Exploration
- 🍂
World cache
All points generated so far are kept in GPU memory.
A culling step removes points that are invisible or whose normals face away, cutting memory by ~40 %. - 🍂
Smooth video sampling
Overlapping segments are blended and co-denoised so hours-long shots remain flicker-free.
2.3 Data Engine (Behind the Scenes)
100 k video clips were used for training, but most public sets lack depth or accurate poses.
Pipeline:
VGGT coarse pose & depth → MoGE refines depth → Metric3D rescales to metric units → training pairs ready.
Plain English: the authors taught the model with “auto-graded homework” instead of hand-labeled data.
3. Quick-Start Installation
“
Verified on Ubuntu 22.04 with CUDA 12.4.
Minimum VRAM: 60 GB for 540 p; 80 GB recommended.
3.1 Clone & Environment
git clone https://github.com/Tencent-Hunyuan/HunyuanWorld-Voyager
cd HunyuanWorld-Voyager
conda create -n voyager python=3.11.9
conda activate voyager
# PyTorch with CUDA 12.4
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 \
pytorch-cuda=12.4 -c pytorch -c nvidia
pip install -r requirements.txt
pip install transformers==4.39.3
# Optional speed-ups
pip install flash-attn
pip install xfuser==0.4.2 # multi-GPU
Troubleshooting tip
If you hit core dump
, install the exact cuBLAS and cuDNN versions:
pip install nvidia-cublas-cu12==12.4.5.8
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/
3.2 Download Weights
Visit HuggingFace repo and place the files under ckpts/
as instructed in ckpts/README.md
.
4. Ten-Minute First Run
4.1 Prepare Input
cd data_engine
python create_input.py \
--image_path assets/demo/camera/input1.png \
--render_output_dir examples/case1 \
--type forward
Available path types: forward
, backward
, left
, right
, turn_left
, turn_right
.
4.2 Single-GPU Inference
cd ..
python sample_image2video.py \
--model HYVideo-T/2 \
--input-path examples/case1 \
--prompt "An old-fashioned European village with thatched roofs on the houses." \
--infer-steps 50 \
--flow-shift 7.0 \
--seed 0 \
--embedded-cfg-scale 6.0 \
--save-path results/demo1
Four minutes later you will have
- 🍂
results/demo1/video.mp4
(49 frames, 540 p) - 🍂
results/demo1/pointcloud.ply
(ready for Blender)
4.3 Multi-GPU (8×H20 Example)
ALLOW_RESIZE_FOR_SP=1 torchrun --nproc_per_node=8 \
sample_image2video.py \
--model HYVideo-T/2 \
--input-path examples/case1 \
--prompt "..." \
--ulysses-degree 8 \
--ring-degree 1 \
--save-path results/demo1
5. FAQ
Q1. Minimum hardware?
- 🍂
60 GB VRAM for 540 p; 80 GB for 720 p. RTX 4090 (24 GB) is not supported.
Q2. Which 3-D software can import the results?
- 🍂
Any package that reads .ply
or.obj
(Blender, Unreal, Unity, MeshLab).
Q3. Will style transfer break the geometry?
- 🍂
No. Depth is frozen; only appearance changes.
Q4. How long can the generated video be?
- 🍂
Official demo stitches 8 clips (392 frames) without flicker; theoretically infinite.
Q5. Commercial license?
- 🍂
Code is Apache-2.0; model weights have their own license—read before commercial use.
6. Real-World Examples (From the Docs)
7. Going Further
7.1 Image-to-3D Asset Pipeline
Feed the generated .ply
into 3-D Gaussian Splatting or Marching Cubes to obtain textured .obj
/ .glb
. Ten minutes from photo to printable model.
7.2 Post-Production Depth Effects
Use the depth video in After Effects for rack-focus or fog passes—no extra depth plug-ins needed.
7.3 Virtual Production
Import camera path + point cloud into Unreal Engine. Actors on green screen see final composite in real time, cutting 70 % of post work.
8. Takeaway
- 🍂
Technical angle: first open-source pipeline to combine RGB-Depth joint diffusion, world caching, and fully automated training data. - 🍂
Practical angle: short-form creators, indie game devs, e-commerce 3-D views can plug it in today. - 🍂
Community angle: active Hunyuan + HuggingFace + Discord support, rapid issue response.
If you need the shortest path from one photograph to a walkable 3-D world, HunyuanWorld-Voyager is the smoothest open tool available today. Install, run the first example, then load your own image—you’ll never look at 3-D generation the same way again.