Matrix-3D: Turn Any Photo or Sentence into a Walkable 3-D World

A plain-language, end-to-end guide for researchers, developers, and curious minds


“Give me one picture or one line of text, and I’ll give you a place you can walk through.”
That is the promise of Matrix-3D.

Below you’ll find everything you need to know—what the system does, how it works, and the exact commands you can copy-paste to run it on your own machine.
All facts come straight from the official paper (arXiv:2508.08086) and the open-source repository at https://matrix-3d.github.io.
No hype, no filler.


Table of Contents

  1. The Problem It Solves
  2. How the Pipeline Works
  3. Key Components in Plain English
  4. Matrix-Pano Dataset: 116 000 Ready-to-Use Panoramic Walk-Throughs
  5. Step-by-Step Local Installation
  6. Running Your First Scene (One-Line or Manual)
  7. Understanding the Output Files
  8. Common Questions
  9. Current Limits & Roadmap
  10. Cheat-Sheet & Quick Reference

The Problem It Solves

Pain Point Matrix-3D Fix
Traditional 3-D generators give you a small patch—turn around and the illusion breaks. Generates 360° × 180° panoramic videos, then rebuilds them as full 3-D scenes.
Professional modeling is slow and expensive. Needs one photo or one sentence—no manual modeling.
Fast methods look blurry; sharp methods take hours. Two pipelines: a feed-forward model for speed and an optimization model for quality.
Public datasets rarely include camera paths + depth maps. Released Matrix-Pano (116 k synthetic videos) with both.

How the Pipeline Works (3 Simple Stages)

graph TD
    A[Input: Text OR Image] -->|Stage 1| B(Panoramic Image + Depth)
    B -->|Stage 2| C(Panoramic Video along a Path)
    C -->|Stage 3| D(Interactive 3-D Gaussian Scene)

Stage 1 — Panoramic Image

A diffusion model (FLUX LoRA) stretches the input to a 360° equirectangular image and predicts its depth.

Stage 2 — Panoramic Video

A second diffusion model (Wan-2.1 backbone + LoRA) turns the static panorama into 81-frame video.
A custom mesh-render condition keeps camera motion smooth and geometry consistent.

Stage 3 — 3-D Reconstruction

Two choices:

Pipeline Speed (A800) Visual Quality (PSNR) File Size
Feed-Forward ~10 s 22.3 small
Optimization ~9 min 27.6 large

Both output a standard .ply of 3-D Gaussians you can open in Blender, Unity, or any splat viewer.


Key Components in Plain English

Trajectory-Guided Diffusion (Stage 2)

  • Old way: feed the model raw point-cloud renders → moiré patterns & wrong occlusions.
  • New way: build a polygon mesh from depth, render it along the flight path, and use the RGB + mask as guidance.
    Result: fewer ghost edges, sharper textures.

Dual Reconstruction Pipelines (Stage 3)

  1. Optimization Pipeline

    • Take every 5th frame from the video
    • Crop into 12 perspective views
    • Run 3-D Gaussian Splatting with L1 loss
    • Produces the highest fidelity
  2. Feed-Forward Pipeline (Large Reconstruction Model)

    • Transformer reads video latents and camera embeddings
    • Directly predicts Gaussian attributes (color, position, scale, rotation, opacity)
    • Two-stage training: depth first, then the rest (prevents collapse)

Matrix-Pano Dataset

Stat Value
Synthetic videos 116 759
Resolution 1024 × 2048
Depth & pose labels
Scene variety indoor, outdoor, day, night, rain, snow

Created in Unreal Engine 5 using a custom multi-camera offline renderer that locks exposure, disables screen-space effects, and guarantees pixel-perfect seams.


Step-by-Step Local Installation

Tested on Ubuntu 20.04 + CUDA 12.4, but any recent Linux distro or WSL2 will work.

1. Clone & Enter

git clone --recursive https://github.com/SkyworkAI/Matrix-3D.git
cd Matrix-3D

2. Create Environment

conda create -n matrix3d python=3.10
conda activate matrix3d

3. Install PyTorch (GPU)

pip3 install torch==2.7.1 torchvision==0.22.1

4. One-Line Dependency Setup

chmod +x install.sh
./install.sh

5. Download Pre-trained Weights

python code/download_checkpoints.py

The script pulls ~20 GB of models into ./checkpoints/.
If your network is unstable, use a download manager and move the files manually.


Running Your First Scene

Option A — One-Line Magic

./generate.sh
  • Prompts you for text or image
  • Chooses default 720 p settings
  • Leaves results in ./output/demo/

Option B — Step-by-Step for Control Freaks

Step 1 Text → Panorama

python code/panoramic_image_generation.py \
  --mode=t2p \
  --prompt="a quiet Japanese garden in autumn, red maples, wooden bridge, koi pond" \
  --output_path="./output/my_garden"

Step 2 Panorama → Video

VISIBLE_GPU_NUM=1
torchrun --nproc_per_node=$VISIBLE_GPU_NUM \
  code/panoramic_image_to_video.py \
  --inout_dir="./output/my_garden" \
  --resolution=720

Runtime: ~1 hour on an A800 80 GB.
Multi-GPU: set VISIBLE_GPU_NUM=4 and runtime drops to ~20 min.

Step 3-1 Video → High-Quality 3-D

python code/panoramic_video_to_3DScene.py \
  --inout_dir="./output/my_garden" \
  --resolution=720

Output: generated_3dgs_opt.ply (optimized, large, beautiful).

Step 3-2 Video → Fast 3-D

python code/panoramic_video_480p_to_3DScene_lrm.py \
  --video_path="./output/my_garden/pano_video.mp4" \
  --pose_path="./output/my_garden/camera.json" \
  --out_path="./output/my_garden_fast"

Output: scene.ply plus 12 test renders.


Understanding the Output Files

File What You’ll See How to Use
pano_img.jpg 360° equirectangular image View in any panorama viewer
pano_video.mp4 81-frame 360° video Play in VLC, YouTube 360, or VR headset
camera.json List of 4×4 world-to-camera matrices Feed to downstream apps
*.ply 3-D Gaussians Drag into SuperSplat or Blender

Opening the 3-D Scene

Blender 4.0+

  1. Install community add-on “Blender Gaussian Splatting”.
  2. File → Import → .ply → Done.

Unity 2022.3
Use the “Gaussian Splatting Rendering” package from GitHub.


Common Questions (FAQ)

Q1: How much VRAM do I need?
  • 480 p pipeline: 8 GB GPU OK
  • 720 p pipeline: 16 GB+ recommended
  • Feed-forward reconstruction: 2 GB is enough
Q2: Can I give it my own camera path?

Yes. Create a JSON array of 4×4 matrices (OpenCV convention).
Reference: ./data/test_cameras/test_cam_front.json

Q3: Why does generation take so long?

Diffusion models run 50 denoising steps × 81 frames × 14 B parameters.
Batching on 4× A800 cuts wall-time to ~20 min.

Q4: Is the output commercially usable?

All synthetic data, no real-world faces or private places.
Check your local AI-generated content policy to be sure.

Q5: Can I edit the 3-D scene afterwards?

Currently the Gaussians are static.
Future releases will expose semantic editing commands like “replace the roof texture.”


Current Limits & Roadmap

Limit Today Planned Fix
Inference still in minutes Model distillation + TensorRT
Semi-transparent objects (trees, fences) show depth jumps Improved monocular depth network
Cannot see rooms behind walls “Unseen area completion” using learned priors
No interactive editing Add text-driven scene editing

Cheat-Sheet & Quick Reference

Hardware Quick-Glance

Task Min VRAM Typical Time (A800-80G)
480p image → panorama 6 GB 15 s
720p panorama → video 16 GB 60 min
Feed-forward 3-D 2 GB 10 s
Optimization 3-D 10 GB 9 min

CLI Flags You’ll Use Daily

Flag Purpose Example
--mode=t2p text to panorama t2p, i2p
--resolution 480 or 720 720
--json_path custom camera ./my_path.json
--VISIBLE_GPU_NUM multi-GPU 4

Directory Layout

Matrix-3D/
 ├─ code/                 # all runnable scripts
 ├─ checkpoints/          # auto-downloaded weights
 ├─ output/               # all your results
 │   ├─ my_scene/
 │   │   ├─ pano_img.jpg
 │   │   ├─ pano_video.mp4
 │   │   ├─ camera.json
 │   │   ├─ generated_3dgs_opt.ply
 │   │   └─ scene.ply (fast)
 └─ data/                 # sample trajectories & images

Citation

If you use Matrix-3D in your work, please cite:

@article{yang2025matrix3d,
  title={Matrix-3D: Omnidirectional Explorable 3D World Generation},
  author={Zhongqi Yang and Wenhang Ge and Yuqi Li and Jiaqi Chen and Haoyuan Li and Mengyin An and Fei Kang and Hua Xue and Baixin Xu and Yuyang Yin and Eric Li and Yang Liu and Yikai Wang and Hao-Xiang Guo and Yahui Zhou},
  journal={arXiv preprint arXiv:2508.08086},
  year={2025}
}

Happy exploring—your next virtual world is only one sentence away.