SVG-T2I: Generating Images Directly in the Semantic Space of Visual Foundation Models—No VAE Required

Have you ever wondered about the crucial “compression” step hidden behind the magic of AI image generation? Mainstream methods like Stable Diffusion rely on a component called a Variational Autoencoder (VAE). Its job is to compress a high-definition image into a low-dimensional, abstract latent space, where the diffusion model then learns and generates. However, the space learned by a VAE often sacrifices semantic structure for pixel reconstruction, resulting in a representation that is disconnected from human “understanding” of images.

So, can we discard the VAE and let the AI create directly within the feature spaces of Visual Foundation Models (VFMs) that are already adept at “understanding” images? This is precisely the question SVG-T2I answers. It abandons the complex “pixel → latent space → pixel” transformation for a more direct path: training an end-to-end text-to-image diffusion model directly within the semantic feature space of VFMs like DINOv3.

The Core Summary: What is SVG-T2I?

SVG-T2I is the first high-quality, large-scale text-to-image diffusion model trained directly in the feature space of a Visual Foundation Model (VFM). It discards the traditional Variational Autoencoder (VAE), instead using a frozen DINOv3 encoder to extract image semantic features and training a Diffusion Transformer (DiT) on this high-dimensional feature space. The model scores 0.75 on the GenEval benchmark and 85.78 on DPG-Bench, performing on par with advanced VAE-based models like SD3-Medium. This validates that the VFM semantic space itself can serve as a powerful generative latent manifold.

Why Challenge the VAE? The Dream of a Unified Representation

Before diving into the technical details, let’s understand the broader context of this research. Current multimodal AI systems suffer from a fundamental “fragmentation” problem:

Understanding uses one system: e.g., CLIP or SigLIP for aligning image and text semantics.
Generation uses another: e.g., a VAE to compress and reconstruct images for Stable Diffusion.
Perception uses yet another: e.g., specialized models for geometric reasoning or object detection.

Each task requires a specifically designed or trained encoder. This is not only inefficient but also hinders knowledge transfer between models. Researchers dream of a unified visual representation—one set of features that can be used for understanding “what’s in the image,” generating “imagined scenes,” and analyzing “object relationships.”

SVG-T2I is a critical step toward this “generalist” vision. It proves that visual features from models like DINOv3—learned through large-scale self-supervised training—contain sufficiently rich and structured information. This information is not just useful for understanding tasks like classification and segmentation but can directly support high-quality image generation.

How Does SVG-T2I Work? A Deep Dive into the Architecture

The SVG-T2I pipeline is elegant and clear, built around two core components: the SVG Autoencoder and the SVG-T2I Diffusion Transformer.

Component 1: The SVG Autoencoder – The “Translator” Between Pixels and Semantics

This autoencoder isn’t a direct replacement for a traditional VAE but a novel design aimed at high-fidelity, bidirectional conversion between image pixels and VFM features.

It offers two configurations for different needs:

Autoencoder-P (Pure): Uses a frozen DINOv3-ViT-S/16 encoder directly. An input image is transformed into a spatial grid of features (e.g., 64x64x384 for a 1024×1024 image). The decoder is a trainable Convolutional Neural Network that reconstructs the pixel image from these feature maps.
Autoencoder-R (Residual): Adds a trainable residual encoder (based on a Vision Transformer) alongside the DINOv3 encoder. This branch specializes in capturing high-frequency details and color information that DINOv3 might miss. Its output is concatenated with the main features before being sent to the shared decoder for potentially finer reconstruction.

Key Breakthrough: This autoencoder uses no quantization, KL-divergence loss, or Gaussian distribution assumptions. It learns not an abstract, distorted distribution, but a faithful mapping between the pixel domain and the VFM’s semantic feature domain.

SVG Reconstruction Visualization
The figure shows reconstruction from DINOv3 features at different resolutions. High-resolution inputs retain significantly more detail, demonstrating the strong representational power of DINOv3 features at higher scales.

Component 2: The SVG-T2I DiT – The “Painter” in Semantic Space

This is the generative core. The researchers adopted a single-stream Unified Next-DiT architecture, similar to top open-source models like Z-Image and Lumina-Image-2.0.

Input: The model’s input is no longer a VAE latent code but a combination of text prompt embeddings (extracted using the Gemma2-2B model) and the target image’s feature map. During training, the target feature map comes from the Autoencoder-P encoding of a real image.
Processing: Text and image features are treated as a unified sequence fed into the Diffusion Transformer. The model is trained using a Flow Matching objective, learning a deterministic path from a simple noise distribution to the complex distribution of VFM features.
Output: The model predicts a denoised VFM feature map. During inference, starting from a random noisy feature map, iterative denoising yields the final feature map. This is then passed to the SVG Autoencoder’s decoder to produce the final high-resolution image.

Scaling Up: Evolving from Low to High Resolution

SVG-T2I’s success relies on a carefully designed, multi-stage progressive training strategy:

Autoencoder Pre-training: First, Autoencoder-P and -R are trained on datasets like ImageNet to master the “feature-to-pixel” translation.
DiT Multi-Stage Training:
- Stage 1 (Low-Resolution Alignment): Training at 256×256 resolution on 60M samples establishes basic text-image correspondence.
- Stage 2 (Mid-Resolution Refinement): Continuing at 512×512 resolution helps the model learn more complex structures and composition.
- Stage 3 (High-Resolution Detailing): Training at 1024×1024 resolution on 15M high-quality samples enables mastery of fine-grained details.
- Stage 4 (High-Quality Tuning): A final fine-tuning stage on 1M high-aesthetic samples further enhances visual appeal.

Four-Stage Generation Comparison
Visual quality improves steadily across stages, with generated images becoming progressively more detailed and aesthetically refined.

How Does It Perform? Let the Data Speak

A compelling theory needs experimental validation. SVG-T2I delivers convincing results on several authoritative benchmarks.

Quantitative Evaluation: Benchmarking Against the State-of-the-Art

Model Category	Representative Model	GenEval (Overall)	DPG-Bench (Overall)	Notes
VAE-based Diffusion	SDXL	0.55	74.65	Widely adopted industry standard
	DALL-E 3	0.67	83.50	OpenAI’s flagship model
	SD3-Medium	0.74	84.08	Latest model from Stability AI
	FLUX.1-dev	0.82	83.84	Another top-tier model
Autoregressive	Janus-Pro-7B	0.80	84.19	Transformer-based autoregressive model
SVG-T2I (Ours)	SVG-T2I	0.75	85.78	First large-scale T2I model trained in VFM space

Data Interpretation:

GenEval (0.75): This score measures how well a model follows textual instructions regarding objects, attributes, counting, and spatial relations. SVG-T2I’s score matches SD3-Medium (0.74) and significantly surpasses SDXL and DALL-E 2.
DPG-Bench (85.78): This benchmark provides a more comprehensive evaluation of global, entity, attribute, and relation generation quality. SVG-T2I’s score of 85.78 places it on par with current state-of-the-art models like HiDream-I1-Full (85.89) and FLUX.1.

This data provides strong evidence: Training diffusion models directly on VFM semantic feature spaces can achieve generative quality comparable to, and sometimes exceeding, the most advanced VAE-based models.

Qualitative Showcase: Seeing is Believing

SVG-T2I Generated Samples
SVG-T2I generates high-quality images across multiple resolutions (720p, 1080p) and diverse themes (still life, portrait, landscape, animals), demonstrating powerful generalization and precise text prompt understanding.

Advantages, Limitations, and the Future

Why is SVG-T2I a Step Forward?

Foundation for Unified Representation: This is the greatest potential. The same features extracted by a DINOv3 encoder can now be used for both SVG-T2I image generation and for tasks like image classification, segmentation, or retrieval. This paves the way for building true “generalist” visual AI systems.
Rich Semantic Priors: DINOv3 features are learned via self-supervision on hundreds of millions of images, naturally encoding semantic knowledge about objects, scenes, and textures. Generating in this space makes it easier for the model to learn visually coherent content.
Open-Source and Reproducible: The team has fully open-sourced all code, training pipelines, evaluation scripts, and pre-trained model weights. This not only validates their work but provides a solid foundation for subsequent community research.

Current Challenges and Limitations

Every breakthrough brings new challenges, and SVG-T2I is no exception:

Resolution Sensitivity: The study found that VAE features exhibit scale invariance—different resolutions of the same image yield very similar VAE features. In contrast, VFM features like those from DINOv3 are sensitive to resolution changes. High-resolution features focus on details, while low-resolution ones capture global semantics. This poses a stability challenge for generative models requiring multi-resolution training.

PCA visualization shows VAE features (right) are highly consistent across resolutions (cosine similarity ~1.0), while DINO features (left, middle) vary more significantly.
Difficulty with Fine Structures: Like many generative models, SVG-T2I can still struggle with extremely fine facial features, complex hand anatomy, and reliable text rendering. Overcoming this requires more specialized data and computational resources.

Getting Started with SVG-T2I

For researchers and developers eager to experiment or build upon this work, here is a clear step-by-step guide.

Environment Setup

# 1. Create and activate a conda environment
conda create -n svg_t2i python=3.10 -y
conda activate svg_t2i

# 2. Install dependencies
pip install -r requirements.txt

# 3. Obtain DINOv3 weights (clone the official repo and configure paths)
git clone https://github.com/facebookresearch/dinov3.git

Download Pre-trained Models

All models are open-sourced on Hugging Face Hub:

# Visit the link below to download all checkpoints (Autoencoder & Diffusion Model)
https://huggingface.co/KlingTeam/SVG-T2I

After downloading, place the pre-trained/ folder under the svg_t2i/ directory.

Image Generation Inference

Generating images with a pre-trained model is straightforward:

cd svg_t2i
bash scripts/sample.sh

You can modify prompts, sampling steps, and other parameters within the sample.sh script. Results will be saved in the svg_t2i/samples/ directory.

Training from Scratch

Training is split into two independent stages:

Train the SVG Autoencoder:

cd autoencoder
bash run_train.sh 1 configs/pure/svg_autoencoder_P_dd_M_IN_stage1_bs64_256_gpu1_forTest

Train the SVG-T2I DiT:

cd svg_t2i
# Single-GPU example
bash scripts/run_train_1gpus_forTest.sh 0

Conclusion and Future Outlook

SVG-T2I is more than just a new text-to-image model; it represents a feasibility proof for a paradigm shift: generative AI does not need to be confined to semantically poor latent spaces (like VAE space) designed specifically for generation. It can fully embrace the semantically rich visual representation spaces built for understanding.

It opens a door towards “visual general intelligence”—a world where the same visual representation is used for understanding, reasoning, perceiving, and creating. Challenges remain, such as resolving the scale sensitivity of VFM features. However, SVG-T2I has undoubtedly pointed towards a promising direction and generously provided all the necessary tools. The next chapter of this story will be written by the open-source community.

Appendix: Frequently Asked Questions (FAQ)

Q1: What is the core difference between SVG-T2I and Stable Diffusion?
A1: The most fundamental difference lies in the latent space. Stable Diffusion uses a low-dimensional space created by a VAE, optimized for pixel reconstruction. SVG-T2I operates directly in the high-dimensional, semantically rich feature space of a Visual Foundation Model like DINOv3. The former is “generation-specific”; the latter is “shared for understanding and generation.”

Q2: Which should I use, Autoencoder-P or Autoencoder-R from the paper?
A2: For most generation tasks, the paper recommends Autoencoder-P. At high resolutions, DINOv3 features themselves preserve sufficient detail, and this approach is simpler and more general. Autoencoder-R is intended for scenarios demanding extreme reconstruction fidelity, as its residual branch compensates for high-frequency information.

Q3: How much data was SVG-T2I trained on?
A3: The DiT model’s multi-stage training saw approximately: 140M images at low-res, 70M at mid-res, 34M at high-res, and 30M during high-quality tuning. The autoencoder was pre-trained on several million images.

Q4: Can this model be used for commercial purposes?
A4: The project is released under the MIT open-source license, with code and weights fully available. This means it can be used for both research and commercial purposes, provided the license terms are followed. Developers must ensure their application complies with relevant laws and ethical guidelines.

Q5: Beyond text-to-image, what else can the SVG framework do?
A5: The original SVG paper (“Latent Diffusion Model without Variational Autoencoder”) demonstrated capabilities in class-conditional image generation (e.g., generating images of specified ImageNet classes) and high-quality image reconstruction. SVG-T2I extends this to text-to-image. In theory, the paradigm of “VFM features as a unified representation” has the potential to expand to image-to-image, editing, video generation, and other visual generative tasks.

SVG-T2I: Generate Images in DINOv3’s Semantic Space Without a VAE