SVG-T2I: Generate Images in DINOv3’s Semantic Space Without a VAE

6 days ago 高效码农

SVG-T2I: Generating Images Directly in the Semantic Space of Visual Foundation Models—No VAE Required Have you ever wondered about the crucial “compression” step hidden behind the magic of AI image generation? Mainstream methods like Stable Diffusion rely on a component called a Variational Autoencoder (VAE). Its job is to compress a high-definition image into a low-dimensional, abstract latent space, where the diffusion model then learns and generates. However, the space learned by a VAE often sacrifices semantic structure for pixel reconstruction, resulting in a representation that is disconnected from human “understanding” of images. So, can we discard the VAE and …