Ovis-Image: A 7-Billion-Parameter Text-to-Image Model That Punches at 20-Billion Scale—While Running on One GPU “ What makes a compact 7 B model able to render crisp, bilingual, layout-heavy text previously dominated by 20 B+ giants, and how can you deploy it today? TL;DR (the 30-second take) Architecture: 2 B multimodal Ovis 2.5 encoder frozen for alignment, 7 B MMDiT diffusion decoder trained from scratch, FLUX.1-schnell VAE stays frozen—10 B total, <24 GB VRAM. Training: four-stage pipeline (pre-train → instruction fine-tune → DPO preference → GRPO text-specialist) steadily improves word accuracy from 87 % → 92 %. Benchmarks: leads CVTG-2K English …
Author / Team / Institution Authors: Yixuan Zhou, Guoyang Zeng, Xin Liu, Xiang Li, Renjie Yu, Ziyang Wang, Runchuan Ye, Weiyue Sun, Jiancheng Gui, Kehan Li, Zhiyong Wu, Zhiyong Liu. Team/Institution: Developed by ModelBest and THUHCSI, under the OpenBMB project. Role: Researchers and developers in text-to-speech systems. Authority Backing: The model is open-sourced under Apache-2.0 license, with acknowledgments to foundational works like DiTAR, MiniCPM-4, CosyVoice, and DAC. No external peer reviews or third-party reports are provided in the input files. Abstract VoxCPM represents a shift in text-to-speech (TTS) technology by eliminating discrete tokenization and operating directly in continuous speech space. …