TRELLIS.2 Deep Dive: How a 4B-Parameter Model is Revolutionizing Image-to-3D Generation

Have you ever wondered how quickly a simple 2D image can be transformed into a detailed, photorealistic 3D model with full materials? The latest answer from Microsoft Research is astonishing: as fast as 3 seconds. Let’s explore the core technology behind this breakthrough.

Executive Summary

TRELLIS.2 is a large-scale 3D generative model with 4 billion parameters. Its core innovation is a novel “field-free” sparse voxel structure called O-Voxel. This technology overcomes the limitations of traditional iso-surface fields (like SDF) in handling open surfaces and non-manifold geometry. It can generate high-resolution 3D assets with arbitrary complex topology, sharp features, and complete PBR materials (including transparency) directly from a single image, achieving remarkably fast inference speeds.

Part 1: A Paradigm Shift in 3D Generation – Why TRELLIS.2?

Creating high-quality 3D content has long been a central challenge in computer graphics and AI. Traditional methods rely either on laborious manual modeling or are constrained by the topological and physical accuracy of generative models.

Most existing generative 3D models are built upon iso-surface fields, such as Signed Distance Fields (SDF) or Neural Radiance Fields (NeRF). These methods work like using a “mold” to shape an object. When they encounter open surfaces (like a leaf or flowing cloth) or non-manifold geometry (like two cubes sharing an edge), they often struggle, leading to information loss or distorted outputs.

TRELLIS.2 was created to solve these fundamental pain points. It moves away from traditional “field” dependencies and introduces a native, compact, structured latent representation—O-Voxel—enabling a “one-step” process from image to high-quality 3D asset.

Core Value at a Glance

Balance of Quality and Speed: On a high-end NVIDIA H100 GPU, generating a textured asset at 512³ resolution takes about 3 seconds, at 1024³ about 17 seconds, and at 1536³ about 60 seconds.
Unlimited Topology: Natively supports open surfaces, non-manifold geometry, and internal enclosed structures without any lossy conversion.
Full Material Output: Goes beyond base color to generate complete PBR channels: Base Color, Roughness, Metallic, and Opacity, enabling transparent and translucent materials.
Streamlined Pipeline: The entire process from image to render-ready 3D mesh is completely free of optimization-based fine-tuning or iterative rendering, achieving true end-to-end generation.

Part 2: Core Technology Dissected – O-Voxel, More Than Just Voxels

Understanding TRELLIS.2 hinges on understanding its heart: O-Voxel. Think of it as “intelligent LEGO” – it not only defines small cubes (voxels) in 3D space but also precisely binds surface attributes and spatial relationships to each one.

Three Breakthrough Design Principles of O-Voxel

The End of Field-Dependence
O-Voxel is a “field-free” representation. Instead of indirectly describing surfaces through implicit functions (like SDF), it directly and explicitly encodes geometry and appearance within a sparse voxel grid. This fundamentally avoids the topological constraints and precision loss that can occur during iso-surface extraction.
Unified Carrier for Geometry and Appearance
Each activated O-Voxel contains not only its spatial coordinates but also the complete PBR material attributes for the surface at that location. This means geometry and texture information are modeled synchronously and in alignment during generation, ensuring perfect texture-to-surface mapping in the final output.
High-Efficiency Bidirectional Conversion
- Encoding (Mesh → O-Voxel): Takes less than 10 seconds on a single CPU core to convert a textured mesh into the O-Voxel representation.
- Decoding (O-Voxel → Mesh): The reverse conversion back to a textured mesh takes less than 100 milliseconds with CUDA acceleration.
  This near-instant conversion capability makes O-Voxel an extremely efficient intermediary representation, perfectly bridging neural network learning and final 3D asset output.

A Direct Comparison with Traditional Methods

Feature	Traditional Iso-surface Methods (SDF, NeRF)	TRELLIS.2 O-Voxel Method
Topology Handling	Limited, struggles with open/non-manifold structures	Unlimited, natively supports any topology
Appearance Modeling	Often handled separately or simplified	Unified modeling of full PBR materials
Generation Pipeline	Often requires time-consuming optimization/fine-tuning	End-to-end, optimization/render-free
Conversion Speed	Iso-surface extraction can be slow	Instant bidirectional conversion (<100ms CUDA)
Latent Space	Dense or unstructured	Compact and structured (16× downsampling)

Part 3: Model Architecture & Performance – Where Scale Meets Ingenuity

TRELLIS.2 is a behemoth with 4 billion parameters, but its design is meticulously crafted for efficiency.

Compact Latent Space Design

The model employs a Sparse 3D VAE with a key innovation: a 16× spatial downsampling rate. Consider this: a 1024³ resolution 3D asset is encoded into a latent space represented by only about 9.6K latent tokens, with negligible perceptual quality loss. This extremely high compression ratio is fundamental to the model’s fast inference and efficient processing.

Performance Benchmarks: Not Just Fast, but Fast and High-Quality

To give you a concrete sense of its capability, here are the typical end-to-end inference times for different output resolutions (tested on an NVIDIA H100 GPU):

Output Resolution	Total Time	Breakdown (Shape + Material)
512³	~3 seconds	2s + 1s
1024³	~17 seconds	10s + 7s
1536³	~60 seconds	35s + 25s

These times cover the entire process from inputting a single image to outputting a high-quality 3D asset—complete with geometry and PBR materials—ready for game engines or renderers. For 3D content creation, this is a revolutionary speed.

Feature Roadmap

The current and planned core functionalities of the TRELLIS.2-4B model are:

✅ Image-to-3D Generation: Generate a PBR-textured 3D mesh from a single image. (Released)
✅ Shape-Conditioned Texture Generation: Generate textures for an input 3D mesh guided by a reference image. (Scheduled for release before 12/24/2025)
🔄 Training Code Release: Provide the full model training code for the research community. (Scheduled for release before 12/31/2025)

Part 4: Hands-On Guide – Installation and First Run

By now, you’re likely eager to try it yourself. Follow these steps to set up the environment on your Linux system and run your first example.

System & Environment Prerequisites

Operating System: Currently officially supported only on Linux.
Hardware: Requires an NVIDIA GPU with at least 24GB of VRAM. The code has been verified on NVIDIA A100 and H100 GPUs.
Software Preparation:
1. CUDA Toolkit: Recommended version 12.4, required for compiling certain packages.
2. Conda: Recommended for managing Python environments.
3. Python: Version 3.8 or higher is required.

Step-by-Step Installation

Clone the Repository

git clone -b main https://github.com/microsoft/TRELLIS.2.git --recursive
cd TRELLIS.2

Run the All-in-One Setup Script
The project provides a powerful setup.sh script to handle most dependencies. The following command will create a new Conda environment named trellis2 and install all necessary components:
```
. ./setup.sh --new-env --basic --flash-attn --nvdiffrast --nvdiffrec --cumesh --o-voxel --flexgemm
```
Important Notes:
- The script defaults to installing PyTorch 2.6.0 with CUDA 12.4. For a different CUDA version, remove the --new-env flag and configure the environment manually.
- It uses flash-attn for accelerated attention computation by default. If your GPU (e.g., V100) doesn’t support it, you can later manually install xformers and switch the backend by setting the environment variable ATTN_BACKEND=xformers.
- The installation may take a while; please be patient.

Running Your First Image-to-3D Generation

Once installed, you can run the project’s example script. Below is a simplified version of the core logic, demonstrating the simplicity of the workflow:

# 1. Load the pretrained pipeline
from trellis2.pipelines import Trellis2ImageTo3DPipeline
pipeline = Trellis2ImageTo3DPipeline.from_pretrained("microsoft/TRELLIS.2-4B")
pipeline.cuda()

# 2. Input an image and run inference
from PIL import Image
image = Image.open("your_input_image.jpg")
mesh = pipeline.run(image)[0]

# 3. Export to a universal GLB format
import o_voxel
glb = o_voxel.postprocess.to_glb(
    vertices = mesh.vertices,
    faces = mesh.faces,
    attr_volume = mesh.attrs,
    # ... other required parameters
)
glb.export("output.glb", extension_webp=True)

Upon execution, you will get:

output.glb: A 3D model file containing geometry and PBR textures, ready for import into software like Blender, Unity, or Unreal Engine. Note: The transparency (alpha) channel is not activated by default; you need to manually connect the texture’s alpha channel to the material’s opacity node within your 3D software.
Optionally, you can generate a video preview showcasing the model rendered under environmental lighting.

Quick Web Demo Experience

For a quick trial via a browser interface, the project includes a simple web demo:

python app.py

After running, access the demo at the address shown in your terminal to upload an image and generate a 3D model online.

Part 5: The Supporting Ecosystem & Open Source Philosophy

TRELLIS.2 is not an isolated model. It’s built upon a suite of high-performance, specialized open-source libraries developed by the Microsoft Research team, forming a robust technology stack:

O-Voxel: The core representation library. Handles instant, lossless bidirectional conversion between textured meshes and the O-Voxel representation.
FlexGEMM: The efficient computation engine. Implements sparse convolution based on Triton, enabling rapid processing of sparse voxel structures.
CuMesh: CUDA-accelerated mesh utilities. Used for high-speed post-processing, remeshing, decimation, and UV unwrapping.

Licensing and Citation

Model & Code: Released under the MIT License, permitting wide academic and commercial use.
Key Dependencies: Please note that the rendering components, nvdiffrast and nvdiffrec, are governed by their respective open-source licenses.

Academic Citation: If you use TRELLIS.2 in your research, please cite our technical report:

@article{
    xiang2025trellis2,
    title={Native and Compact Structured Latents for 3D Generation},
    author={Xiang, Jianfeng and Chen, Xiaoxue and Xu, Sicheng and Wang, Ruicheng and Lv, Zelong and Deng, Yu and Zhu, Hongyuan and Dong, Yue and Zhao, Hao and Yuan, Nicholas Jing and Yang, Jiaolong},
    journal={Tech report},
    year={2025}
}

Part 6: Frequently Asked Questions & Current Limitations

To provide a complete understanding of the technology, we’ve compiled this key Q&A based on the official documentation.

FAQ

Q1: Can the results from TRELLIS.2 be used directly for 3D printing?
A: Post-processing is recommended. The raw generated mesh may occasionally contain small holes or topological discontinuities. While O-Voxel handles complex topology well, to obtain strictly watertight geometry (a requirement for 3D printing), it’s advised to use the provided post-processing scripts (e.g., hole-filling algorithms) to repair the mesh.

Q2: Are there specific requirements for the input image?
A: Official examples and demos show the model works well with common images of objects, creatures, etc. However, it’s important to note that TRELLIS.2-4B is a pre-trained foundation model that has not been aligned with human preferences (e.g., via RLHF). Its output style reflects the distribution of its training data and can be varied. Users may need to experiment with different inputs to achieve the most desired artistic result.

Q3: Does it support text-to-3D generation in addition to image-to-3D?
A: According to the currently released documentation, TRELLIS.2-4B is a model focused on image-to-3D generation. Its input is a single image; native support for text prompts as input is not mentioned.

Q4: What is the minimum GPU VRAM required to run it?
A: The official requirement is an NVIDIA GPU with at least 24GB of VRAM, verified on A100 and H100. This is the basic hardware threshold for running the 4B-parameter model and processing high-resolution 3D data.

Q5: What’s the difference between O-Voxel and traditional voxelization?
A: The key difference is “intelligence.” Traditional voxelization results in a binary or low-dimensional occupancy grid, losing most surface information. O-Voxel is a sparse, attribute-rich voxel where each activated voxel carries precise geometric location and complete surface material information. It’s a higher-order, structured representation tailor-made for generative AI.

Conclusion and Outlook

The emergence of TRELLIS.2 marks a solid step towards the “generalization” and “practicality” of 3D content generation. By abandoning the constraints of traditional field representations and embracing the native, structured O-Voxel, it not only solves the problem of generating complex topology but also achieves an unprecedented balance between speed and quality.

From a technical perspective, its 16× downsampled compact latent space design, end-to-end optimization-free pipeline, and unified modeling of full PBR materials together constitute a new, efficient, and powerful paradigm for 3D generation. While current limitations exist—such as the need for mesh post-processing and the lack of preference alignment—its open-source roadmap (with plans to release texture generation and training code) demonstrates a commitment to advancing the entire field.

For game developers, visual artists, VR/AR content creators, and even e-commerce professionals, TRELLIS.2 offers a powerful tool to instantly turn ideas into high-quality 3D prototypes. As the technology iterates and the ecosystem matures, we may be standing at the threshold of a new era of “3D content democratization,” with TRELLIS.2 serving as a significant guiding light.

TRELLIS.2: Microsoft’s 4B-Parameter Image-to-3D Generator Completes 3D Models in 3 Seconds