Matrix-3D: Turn Any Photo or Sentence into a Walkable 3-D World
A plain-language, end-to-end guide for researchers, developers, and curious minds
“
“Give me one picture or one line of text, and I’ll give you a place you can walk through.”
That is the promise of Matrix-3D.”
Below you’ll find everything you need to know—what the system does, how it works, and the exact commands you can copy-paste to run it on your own machine.
All facts come straight from the official paper (arXiv:2508.08086) and the open-source repository at https://matrix-3d.github.io.
No hype, no filler.
Table of Contents
-
The Problem It Solves -
How the Pipeline Works -
Key Components in Plain English -
Matrix-Pano Dataset: 116 000 Ready-to-Use Panoramic Walk-Throughs -
Step-by-Step Local Installation -
Running Your First Scene (One-Line or Manual) -
Understanding the Output Files -
Common Questions -
Current Limits & Roadmap -
Cheat-Sheet & Quick Reference
The Problem It Solves
Pain Point | Matrix-3D Fix |
---|---|
Traditional 3-D generators give you a small patch—turn around and the illusion breaks. | Generates 360° × 180° panoramic videos, then rebuilds them as full 3-D scenes. |
Professional modeling is slow and expensive. | Needs one photo or one sentence—no manual modeling. |
Fast methods look blurry; sharp methods take hours. | Two pipelines: a feed-forward model for speed and an optimization model for quality. |
Public datasets rarely include camera paths + depth maps. | Released Matrix-Pano (116 k synthetic videos) with both. |
How the Pipeline Works (3 Simple Stages)
graph TD
A[Input: Text OR Image] -->|Stage 1| B(Panoramic Image + Depth)
B -->|Stage 2| C(Panoramic Video along a Path)
C -->|Stage 3| D(Interactive 3-D Gaussian Scene)
Stage 1 — Panoramic Image
A diffusion model (FLUX LoRA) stretches the input to a 360° equirectangular image and predicts its depth.
Stage 2 — Panoramic Video
A second diffusion model (Wan-2.1 backbone + LoRA) turns the static panorama into 81-frame video.
A custom mesh-render condition keeps camera motion smooth and geometry consistent.
Stage 3 — 3-D Reconstruction
Two choices:
Pipeline | Speed (A800) | Visual Quality (PSNR) | File Size |
---|---|---|---|
Feed-Forward | ~10 s | 22.3 | small |
Optimization | ~9 min | 27.6 | large |
Both output a standard .ply
of 3-D Gaussians you can open in Blender, Unity, or any splat viewer.
Key Components in Plain English
Trajectory-Guided Diffusion (Stage 2)
-
Old way: feed the model raw point-cloud renders → moiré patterns & wrong occlusions. -
New way: build a polygon mesh from depth, render it along the flight path, and use the RGB + mask as guidance.
Result: fewer ghost edges, sharper textures.
Dual Reconstruction Pipelines (Stage 3)
-
Optimization Pipeline
-
Take every 5th frame from the video -
Crop into 12 perspective views -
Run 3-D Gaussian Splatting with L1 loss -
Produces the highest fidelity
-
-
Feed-Forward Pipeline (Large Reconstruction Model)
-
Transformer reads video latents and camera embeddings -
Directly predicts Gaussian attributes (color, position, scale, rotation, opacity) -
Two-stage training: depth first, then the rest (prevents collapse)
-
Matrix-Pano Dataset
Stat | Value |
---|---|
Synthetic videos | 116 759 |
Resolution | 1024 × 2048 |
Depth & pose labels | ✔ |
Scene variety | indoor, outdoor, day, night, rain, snow |
Created in Unreal Engine 5 using a custom multi-camera offline renderer that locks exposure, disables screen-space effects, and guarantees pixel-perfect seams.
Step-by-Step Local Installation
Tested on Ubuntu 20.04 + CUDA 12.4, but any recent Linux distro or WSL2 will work.
1. Clone & Enter
git clone --recursive https://github.com/SkyworkAI/Matrix-3D.git
cd Matrix-3D
2. Create Environment
conda create -n matrix3d python=3.10
conda activate matrix3d
3. Install PyTorch (GPU)
pip3 install torch==2.7.1 torchvision==0.22.1
4. One-Line Dependency Setup
chmod +x install.sh
./install.sh
5. Download Pre-trained Weights
python code/download_checkpoints.py
“
The script pulls ~20 GB of models into
./checkpoints/
.
If your network is unstable, use a download manager and move the files manually.”
Running Your First Scene
Option A — One-Line Magic
./generate.sh
-
Prompts you for text or image -
Chooses default 720 p settings -
Leaves results in ./output/demo/
Option B — Step-by-Step for Control Freaks
Step 1 Text → Panorama
python code/panoramic_image_generation.py \
--mode=t2p \
--prompt="a quiet Japanese garden in autumn, red maples, wooden bridge, koi pond" \
--output_path="./output/my_garden"
Step 2 Panorama → Video
VISIBLE_GPU_NUM=1
torchrun --nproc_per_node=$VISIBLE_GPU_NUM \
code/panoramic_image_to_video.py \
--inout_dir="./output/my_garden" \
--resolution=720
“
Runtime: ~1 hour on an A800 80 GB.
Multi-GPU: setVISIBLE_GPU_NUM=4
and runtime drops to ~20 min.”
Step 3-1 Video → High-Quality 3-D
python code/panoramic_video_to_3DScene.py \
--inout_dir="./output/my_garden" \
--resolution=720
Output: generated_3dgs_opt.ply
(optimized, large, beautiful).
Step 3-2 Video → Fast 3-D
python code/panoramic_video_480p_to_3DScene_lrm.py \
--video_path="./output/my_garden/pano_video.mp4" \
--pose_path="./output/my_garden/camera.json" \
--out_path="./output/my_garden_fast"
Output: scene.ply
plus 12 test renders.
Understanding the Output Files
File | What You’ll See | How to Use |
---|---|---|
pano_img.jpg |
360° equirectangular image | View in any panorama viewer |
pano_video.mp4 |
81-frame 360° video | Play in VLC, YouTube 360, or VR headset |
camera.json |
List of 4×4 world-to-camera matrices | Feed to downstream apps |
*.ply |
3-D Gaussians | Drag into SuperSplat or Blender |
Opening the 3-D Scene
Blender 4.0+
-
Install community add-on “Blender Gaussian Splatting”. -
File → Import → .ply
→ Done.
Unity 2022.3
Use the “Gaussian Splatting Rendering” package from GitHub.
Common Questions (FAQ)
Q1: How much VRAM do I need?
-
480 p pipeline: 8 GB GPU OK -
720 p pipeline: 16 GB+ recommended -
Feed-forward reconstruction: 2 GB is enough
Q2: Can I give it my own camera path?
Yes. Create a JSON array of 4×4 matrices (OpenCV convention).
Reference: ./data/test_cameras/test_cam_front.json
Q3: Why does generation take so long?
Diffusion models run 50 denoising steps × 81 frames × 14 B parameters.
Batching on 4× A800 cuts wall-time to ~20 min.
Q4: Is the output commercially usable?
All synthetic data, no real-world faces or private places.
Check your local AI-generated content policy to be sure.
Q5: Can I edit the 3-D scene afterwards?
Currently the Gaussians are static.
Future releases will expose semantic editing commands like “replace the roof texture.”
Current Limits & Roadmap
Limit Today | Planned Fix |
---|---|
Inference still in minutes | Model distillation + TensorRT |
Semi-transparent objects (trees, fences) show depth jumps | Improved monocular depth network |
Cannot see rooms behind walls | “Unseen area completion” using learned priors |
No interactive editing | Add text-driven scene editing |
Cheat-Sheet & Quick Reference
Hardware Quick-Glance
Task | Min VRAM | Typical Time (A800-80G) |
---|---|---|
480p image → panorama | 6 GB | 15 s |
720p panorama → video | 16 GB | 60 min |
Feed-forward 3-D | 2 GB | 10 s |
Optimization 3-D | 10 GB | 9 min |
CLI Flags You’ll Use Daily
Flag | Purpose | Example |
---|---|---|
--mode=t2p |
text to panorama | t2p , i2p |
--resolution |
480 or 720 | 720 |
--json_path |
custom camera | ./my_path.json |
--VISIBLE_GPU_NUM |
multi-GPU | 4 |
Directory Layout
Matrix-3D/
├─ code/ # all runnable scripts
├─ checkpoints/ # auto-downloaded weights
├─ output/ # all your results
│ ├─ my_scene/
│ │ ├─ pano_img.jpg
│ │ ├─ pano_video.mp4
│ │ ├─ camera.json
│ │ ├─ generated_3dgs_opt.ply
│ │ └─ scene.ply (fast)
└─ data/ # sample trajectories & images
Citation
If you use Matrix-3D in your work, please cite:
@article{yang2025matrix3d,
title={Matrix-3D: Omnidirectional Explorable 3D World Generation},
author={Zhongqi Yang and Wenhang Ge and Yuqi Li and Jiaqi Chen and Haoyuan Li and Mengyin An and Fei Kang and Hua Xue and Baixin Xu and Yuyang Yin and Eric Li and Yang Liu and Yikai Wang and Hao-Xiang Guo and Yahui Zhou},
journal={arXiv preprint arXiv:2508.08086},
year={2025}
}
Happy exploring—your next virtual world is only one sentence away.