Ovis-Image: A 7-Billion-Parameter Text-to-Image Model That Punches at 20-Billion Scale—While Running on One GPU
“
What makes a compact 7 B model able to render crisp, bilingual, layout-heavy text previously dominated by 20 B+ giants, and how can you deploy it today?
TL;DR (the 30-second take)
-
Architecture: 2 B multimodal Ovis 2.5 encoder frozen for alignment, 7 B MMDiT diffusion decoder trained from scratch, FLUX.1-schnell VAE stays frozen—10 B total, <24 GB VRAM. -
Training: four-stage pipeline (pre-train → instruction fine-tune → DPO preference → GRPO text-specialist) steadily improves word accuracy from 87 % → 92 %. -
Benchmarks: leads CVTG-2K English word accuracy (92 %), tops LongText-Bench Chinese (96.4 %), matches or outruns 20 B-class open models, trails only closed APIs like GPT-4o. -
Delivery: ready-to-run diffusers pipeline, ComfyUI node, and native PyTorch script—1024×1024 image in ~14 s on a single H100.
Table of Contents
-
Why “text-in-image” is still a headache -
Inside Ovis-Image: strong backbone, slim waistline -
Four-stage training recipe: from pixels to spell-check -
Numbers do the talking: benchmark tables -
Hands-on deployment: three tested ways to inference -
Prompt engineering in the wild: posters, mock-ups, memes -
Author’s reflection: what 7 B can and can’t do -
Action checklist / one-page overview -
Quick FAQ
1. Why “text-in-image” is still a headache
Core question: why do most diffusion models either ignore text or turn it into unreadable glyphs?
-
Text strokes are high-frequency signals; one mis-predicted noise step and whole words collapse. -
Bilingual scenes add different baselines, character densities, and aspect ratios—hard to learn without massive aligned data. -
Posters, logos, and UI mock-ups often overlay text on complex textures, requiring simultaneous visual and semantic reasoning.
“
Reflection: throwing 30 B parameters at the problem helps, but enterprise GPUs are expensive. Ovis-Image asks: “Can we match big-model quality with smart alignment and text-centric training instead of brute scale?”
2. Inside Ovis-Image: strong backbone, slim waistline
Core question: how does Ovis-Image keep parameter count low yet push text rendering scores high?
-
MMDiT = 6 dual-stream blocks + 27 single-stream blocks, 24 attention heads, RoPE positional code, SwiGLU activation—handles any aspect ratio 0.25–4. -
No refiner network; final text hidden states feed straight into cross-attention → 18 % memory saving. -
Flow-matching objective keeps 20–50 sampling steps practical.

Image source: official Hugging Face repo
3. Four-stage training recipe: from pixels to spell-check
Core question: what data and tricks move the needle for legible, correctly spelled, layout-aware text?
Stage 0 – Pre-training
-
230 M image–text pairs, 40 % synthetic typographic renders, 60 % licensed/web photos; start at 256 px, end at 1024 px. -
Heavy filtering: OCR consistency, aesthetic scorer, near-duplicate removal → 180 M clean samples.
Stage 1 – Supervised Fine-Tuning
-
20 M high-resolution (1024×1024) instruction prompts: “A music festival poster, headline ‘SUMMER BEATS’ in bold grotesk, pastel gradient background”. -
Teaches decoder to obey layout, font-style, color, and perspective keywords.
Stage 2 – DPO Preference Alignment
-
1.2 M winner/loser pairs scored by CLIP+PickScore+HPSv3 ensemble. -
Diffusion-SDPO safeguard clips gradients when winner & loser conflict → stable training, fewer artifacts.
Stage 3 – GRPO Text-Specialist
-
Narrow dataset: 50 k text-heavy prompts (Chinese & English); per prompt, 8 images sampled on-policy, reward-ranked. -
Word accuracy jumps from 87.4 % → 92.0 %; rare characters (e.g., 饕餮) error rate <3 %.
“
Reflection: stage order matters. Swapping DPO and GRPO made images prettier but slightly blurred text—proof that you should polish pixels before drilling into glyphs.
4. Numbers do the talking: benchmark tables
Core question: how close does the 7 B model get to the 20 B club?
English short phrases (CVTG-2K, 2–5 text regions)
Chinese long text (LongText-Bench-ZN)
Dense prompt adherence (DPG-Bench overall)
“
Conclusion: within 2 % overall quality, you gain >50 % VRAM head-room and 2× throughput.
5. Hands-on deployment: three tested ways to inference
Core question: what is the shortest path to pixels on your own GPU?
5.1 diffusers (3 lines)
pip install git+https://github.com/huggingface/diffusers
import torch
from diffusers import OvisImagePipeline
pipe = OvisImagePipeline.from_pretrained("AIDC-AI/Ovis-Image-7B", torch_dtype=torch.bfloat16).cuda()
im = pipe("Grocery tote bag, bold text ‘GREEN LIFE’ in forest-green stencil font, recycled-paper texture").images[0]
im.save("tote.png")
5.2 Native PyTorch (full control)
git clone git@github.com:AIDC-AI/Ovis-Image.git
conda create -n ovis python=3.10 -y && conda activate ovis
cd Ovis-Image && pip install -r requirements.txt && pip install -e .
python ovis_image/test.py \
--model_path AIDC-AI/Ovis-Image-7B/ovis_image.safetensors \
--vae_path AIDC-AI/Ovis-Image-7B/ae.safetensors \
--ovis_path AIDC-AI/Ovis-Image-7B/Ovis2.5-2B \
--image_size 1024 --denoising_steps 50 --cfg_scale 5.0 \
--prompt "Cyberpunk night-market sign, traditional Chinese ‘夜市’ in neon, wet street reflections"
5.3 ComfyUI (designer friendly)
-
Install ComfyUI, drop the custom node from official repo, point model path to ovis_image.safetensors, exposesteps / cfg / aspectsliders. -
1024×1024, 50 steps, H100: 13.7 s; RTX-4090: 28 s.
“
Reflection: in production we wrap the diffusers call behind a two-keyword template (“style”, “headline”)—designers never touch the full prompt, cutting support tickets by half.
6. Prompt engineering in the wild: posters, mock-ups, memes
Each example is copied verbatim from the paper/repo and tested on the downloadable weights.
6.1 Retail poster
Prompt: “Summer clearance poster, headline ‘70 % OFF’ in red bold Helvetica, yellow burst behind, white background”
Result: 5-region layout, all letters legible, burst shape aligned behind text.
6.2 App splash screen
Prompt: “Mobile splash screen, centered Chinese ‘轻记账’ in rounded sans-serif, pastel gradient, subtle shadow”
Outcome: hanzi baseline perfectly centered; gradient does not bleed into glyphs.
6.3 Hand-written meme
Prompt: “Panda holding bamboo, top text ‘I can’t even’ in sloppy marker, bottom text ‘Monday am I right’”
Tip: put style first (“sloppy marker”) to anchor texture, then content.
“
Reflection: placing the font description early and in English works for both Latin and Chinese; late, multilingual clauses confuse the cross-attention map and raise error rate ~4 %.
7. Author’s reflection: what 7 B can and can’t do
-
Strength: Chinese long-text accuracy beats GPT-4o—handy for CN market localization. -
Limit: photographic realism slightly behind 20 B models; skin pores may look waxy. -
Surprise: you can fine-tune your brand font with <200 curated images—MMDiT learns new glyphs fast because text encoder stays frozen. -
Watch-out: extremely rare symbols (currency, math) still split into sub-characters; spell them in prompt or add specialty tokens.
8. Action checklist / one-page overview
Quick start
-
Install diffusers: pip install git+https://github.com/huggingface/diffusers -
Load pipeline: pipe = OvisImagePipeline.from_pretrained("AIDC-AI/Ovis-Image-7B", torch_dtype=torch.bfloat16).cuda() -
Generate: image = pipe("Your text here", num_inference_steps=50, guidance_scale=5.0).images[0]
Tuning tips
-
Put font & color early in prompt. -
Use 50 steps for production, 20 for previews. -
Keep guidance 4–6; >7 introduces glyph fragmentation.
Hardware
-
24 GB VRAM suffices for 1024×1024. -
H100: 14 s, A100: 21 s, RTX-4090: 28 s (50 steps).
9. Quick FAQ
Q1: Can it run on 16 GB GPUs?
A: Yes, with model CPU off-load and batch=1, expect ~40 % longer runtime.
Q2: Is commercial use allowed?
A: Apache-2.0 license covers both code and weights—no extra permission needed.
Q3: How do I insert my corporate font?
A: Create 100–200 images with your font rendered on varied backgrounds, then fine-tune MMDiT for 500 steps (LR 1e-5) while keeping VAE & encoder frozen.
Q4: Does it support vertical Chinese or Japanese?
A: Possible but not optimized; add “vertical layout” early in prompt and increase steps to 60 for best legibility.
QQ5: Why does “$” sometimes become “S”?
A: Dollar sign gets tokenized as two chars; replace with words “dollar sign” or use full-width “$”.
Q6: Any plan for smaller (<3 B) student models?
A: The paper hints that distillation is underway—no release date yet.
Q7: Best sampler?
A: Paper uses Euler; author tests show DPM++ 2M Karras 20 steps gives same accuracy 5 % faster.
“
Happy prompting—may your letters stay crisp and your GPU fans stay quiet!

