Matrix-3D: Turn Any Photo or Sentence into a Walkable 3-D World

A plain-language, end-to-end guide for researchers, developers, and curious minds

“

“Give me one picture or one line of text, and I’ll give you a place you can walk through.”
That is the promise of Matrix-3D.

”

Below you’ll find everything you need to know—what the system does, how it works, and the exact commands you can copy-paste to run it on your own machine.
All facts come straight from the official paper (arXiv:2508.08086) and the open-source repository at https://matrix-3d.github.io.
No hype, no filler.

The Problem It Solves
How the Pipeline Works
Key Components in Plain English
Matrix-Pano Dataset: 116 000 Ready-to-Use Panoramic Walk-Throughs
Step-by-Step Local Installation
Running Your First Scene (One-Line or Manual)
Understanding the Output Files
Common Questions
Current Limits & Roadmap
Cheat-Sheet & Quick Reference

The Problem It Solves

Pain Point	Matrix-3D Fix
Traditional 3-D generators give you a small patch—turn around and the illusion breaks.	Generates 360° × 180° panoramic videos, then rebuilds them as full 3-D scenes.
Professional modeling is slow and expensive.	Needs one photo or one sentence—no manual modeling.
Fast methods look blurry; sharp methods take hours.	Two pipelines: a feed-forward model for speed and an optimization model for quality.
Public datasets rarely include camera paths + depth maps.	Released Matrix-Pano (116 k synthetic videos) with both.

How the Pipeline Works (3 Simple Stages)

graph TD
    A[Input: Text OR Image] -->|Stage 1| B(Panoramic Image + Depth)
    B -->|Stage 2| C(Panoramic Video along a Path)
    C -->|Stage 3| D(Interactive 3-D Gaussian Scene)

Stage 1 — Panoramic Image

A diffusion model (FLUX LoRA) stretches the input to a 360° equirectangular image and predicts its depth.

Stage 2 — Panoramic Video

A second diffusion model (Wan-2.1 backbone + LoRA) turns the static panorama into 81-frame video.
A custom mesh-render condition keeps camera motion smooth and geometry consistent.

Stage 3 — 3-D Reconstruction

Two choices:

Pipeline	Speed (A800)	Visual Quality (PSNR)	File Size
Feed-Forward	~10 s	22.3	small
Optimization	~9 min	27.6	large

Both output a standard .ply of 3-D Gaussians you can open in Blender, Unity, or any splat viewer.

Key Components in Plain English

Trajectory-Guided Diffusion (Stage 2)

Old way: feed the model raw point-cloud renders → moiré patterns & wrong occlusions.
New way: build a polygon mesh from depth, render it along the flight path, and use the RGB + mask as guidance.
Result: fewer ghost edges, sharper textures.

Dual Reconstruction Pipelines (Stage 3)

Optimization Pipeline
- Take every 5th frame from the video
- Crop into 12 perspective views
- Run 3-D Gaussian Splatting with L1 loss
- Produces the highest fidelity
Feed-Forward Pipeline (Large Reconstruction Model)
- Transformer reads video latents and camera embeddings
- Directly predicts Gaussian attributes (color, position, scale, rotation, opacity)
- Two-stage training: depth first, then the rest (prevents collapse)

Matrix-Pano Dataset

Stat	Value
Synthetic videos	116 759
Resolution	1024 × 2048
Depth & pose labels	✔
Scene variety	indoor, outdoor, day, night, rain, snow

Created in Unreal Engine 5 using a custom multi-camera offline renderer that locks exposure, disables screen-space effects, and guarantees pixel-perfect seams.

Step-by-Step Local Installation

Tested on Ubuntu 20.04 + CUDA 12.4, but any recent Linux distro or WSL2 will work.

1. Clone & Enter

git clone --recursive https://github.com/SkyworkAI/Matrix-3D.git
cd Matrix-3D

2. Create Environment

conda create -n matrix3d python=3.10
conda activate matrix3d

3. Install PyTorch (GPU)

pip3 install torch==2.7.1 torchvision==0.22.1

4. One-Line Dependency Setup

chmod +x install.sh
./install.sh

5. Download Pre-trained Weights

python code/download_checkpoints.py

“

The script pulls ~20 GB of models into ./checkpoints/.
If your network is unstable, use a download manager and move the files manually.

”

Running Your First Scene

Option A — One-Line Magic

./generate.sh

Prompts you for text or image
Chooses default 720 p settings
Leaves results in ./output/demo/

Option B — Step-by-Step for Control Freaks

Step 1 Text → Panorama

python code/panoramic_image_generation.py \
  --mode=t2p \
  --prompt="a quiet Japanese garden in autumn, red maples, wooden bridge, koi pond" \
  --output_path="./output/my_garden"

Step 2 Panorama → Video

VISIBLE_GPU_NUM=1
torchrun --nproc_per_node=$VISIBLE_GPU_NUM \
  code/panoramic_image_to_video.py \
  --inout_dir="./output/my_garden" \
  --resolution=720

“

Runtime: ~1 hour on an A800 80 GB.
Multi-GPU: set VISIBLE_GPU_NUM=4 and runtime drops to ~20 min.

”

Step 3-1 Video → High-Quality 3-D

python code/panoramic_video_to_3DScene.py \
  --inout_dir="./output/my_garden" \
  --resolution=720

Output: generated_3dgs_opt.ply (optimized, large, beautiful).

Step 3-2 Video → Fast 3-D

python code/panoramic_video_480p_to_3DScene_lrm.py \
  --video_path="./output/my_garden/pano_video.mp4" \
  --pose_path="./output/my_garden/camera.json" \
  --out_path="./output/my_garden_fast"

Output: scene.ply plus 12 test renders.

Understanding the Output Files

File	What You’ll See	How to Use
`pano_img.jpg`	360° equirectangular image	View in any panorama viewer
`pano_video.mp4`	81-frame 360° video	Play in VLC, YouTube 360, or VR headset
`camera.json`	List of 4×4 world-to-camera matrices	Feed to downstream apps
`*.ply`	3-D Gaussians	Drag into SuperSplat or Blender

Opening the 3-D Scene

Blender 4.0+

Install community add-on “Blender Gaussian Splatting”.
File → Import → .ply → Done.

Unity 2022.3
Use the “Gaussian Splatting Rendering” package from GitHub.

Common Questions (FAQ)

Q1: How much VRAM do I need?

480 p pipeline: 8 GB GPU OK
720 p pipeline: 16 GB+ recommended
Feed-forward reconstruction: 2 GB is enough

Q2: Can I give it my own camera path?

Yes. Create a JSON array of 4×4 matrices (OpenCV convention).
Reference: ./data/test_cameras/test_cam_front.json

Q3: Why does generation take so long?

Diffusion models run 50 denoising steps × 81 frames × 14 B parameters.
Batching on 4× A800 cuts wall-time to ~20 min.

Q4: Is the output commercially usable?

All synthetic data, no real-world faces or private places.
Check your local AI-generated content policy to be sure.

Q5: Can I edit the 3-D scene afterwards?

Currently the Gaussians are static.
Future releases will expose semantic editing commands like “replace the roof texture.”

Current Limits & Roadmap

Limit Today	Planned Fix
Inference still in minutes	Model distillation + TensorRT
Semi-transparent objects (trees, fences) show depth jumps	Improved monocular depth network
Cannot see rooms behind walls	“Unseen area completion” using learned priors
No interactive editing	Add text-driven scene editing

Cheat-Sheet & Quick Reference

Hardware Quick-Glance

Task	Min VRAM	Typical Time (A800-80G)
480p image → panorama	6 GB	15 s
720p panorama → video	16 GB	60 min
Feed-forward 3-D	2 GB	10 s
Optimization 3-D	10 GB	9 min

CLI Flags You’ll Use Daily

Flag	Purpose	Example
`--mode=t2p`	text to panorama	`t2p`, `i2p`
`--resolution`	480 or 720	720
`--json_path`	custom camera	`./my_path.json`
`--VISIBLE_GPU_NUM`	multi-GPU	4

Directory Layout

Matrix-3D/
 ├─ code/                 # all runnable scripts
 ├─ checkpoints/          # auto-downloaded weights
 ├─ output/               # all your results
 │   ├─ my_scene/
 │   │   ├─ pano_img.jpg
 │   │   ├─ pano_video.mp4
 │   │   ├─ camera.json
 │   │   ├─ generated_3dgs_opt.ply
 │   │   └─ scene.ply (fast)
 └─ data/                 # sample trajectories & images

Citation

If you use Matrix-3D in your work, please cite:

@article{yang2025matrix3d,
  title={Matrix-3D: Omnidirectional Explorable 3D World Generation},
  author={Zhongqi Yang and Wenhang Ge and Yuqi Li and Jiaqi Chen and Haoyuan Li and Mengyin An and Fei Kang and Hua Xue and Baixin Xu and Yuyang Yin and Eric Li and Yang Liu and Yikai Wang and Hao-Xiang Guo and Yahui Zhou},
  journal={arXiv preprint arXiv:2508.08086},
  year={2025}
}

Happy exploring—your next virtual world is only one sentence away.

Matrix-3D: Transform Text or Images into Walkable 3D Worlds with One Line

Matrix-3D: Turn Any Photo or Sentence into a Walkable 3-D World

Table of Contents

The Problem It Solves

How the Pipeline Works (3 Simple Stages)

Stage 1 — Panoramic Image

Stage 2 — Panoramic Video

Stage 3 — 3-D Reconstruction

Key Components in Plain English

Trajectory-Guided Diffusion (Stage 2)

Dual Reconstruction Pipelines (Stage 3)

Matrix-Pano Dataset

Step-by-Step Local Installation

1. Clone & Enter

2. Create Environment

3. Install PyTorch (GPU)

4. One-Line Dependency Setup

5. Download Pre-trained Weights

Running Your First Scene

Option A — One-Line Magic

Option B — Step-by-Step for Control Freaks

Step 1 Text → Panorama

Step 2 Panorama → Video

Step 3-1 Video → High-Quality 3-D

Step 3-2 Video → Fast 3-D

Understanding the Output Files

Opening the 3-D Scene

Common Questions (FAQ)

Current Limits & Roadmap

Cheat-Sheet & Quick Reference

Hardware Quick-Glance

CLI Flags You’ll Use Daily

Directory Layout

Citation

Matrix-3D: Transform Text or Images into Walkable 3D Worlds with One Line

Matrix-3D: Turn Any Photo or Sentence into a Walkable 3-D World

Table of Contents

The Problem It Solves

How the Pipeline Works (3 Simple Stages)

Stage 1 — Panoramic Image

Stage 2 — Panoramic Video

Stage 3 — 3-D Reconstruction

Key Components in Plain English

Trajectory-Guided Diffusion (Stage 2)

Dual Reconstruction Pipelines (Stage 3)

Matrix-Pano Dataset

Step-by-Step Local Installation

1. Clone & Enter

2. Create Environment

3. Install PyTorch (GPU)

4. One-Line Dependency Setup

5. Download Pre-trained Weights

Running Your First Scene

Option A — One-Line Magic

Option B — Step-by-Step for Control Freaks

Step 1 Text → Panorama

Step 2 Panorama → Video

Step 3-1 Video → High-Quality 3-D

Step 3-2 Video → Fast 3-D

Understanding the Output Files

Opening the 3-D Scene

Common Questions (FAQ)

Current Limits & Roadmap

Cheat-Sheet & Quick Reference

Hardware Quick-Glance

CLI Flags You’ll Use Daily

Directory Layout

Citation

Related Posts