Ovis-Image: A 7-Billion-Parameter Text-to-Image Model That Punches at 20-Billion Scale—While Running on One GPU

“

What makes a compact 7 B model able to render crisp, bilingual, layout-heavy text previously dominated by 20 B+ giants, and how can you deploy it today?

TL;DR (the 30-second take)

Architecture: 2 B multimodal Ovis 2.5 encoder frozen for alignment, 7 B MMDiT diffusion decoder trained from scratch, FLUX.1-schnell VAE stays frozen—10 B total, <24 GB VRAM.
Training: four-stage pipeline (pre-train → instruction fine-tune → DPO preference → GRPO text-specialist) steadily improves word accuracy from 87 % → 92 %.
Benchmarks: leads CVTG-2K English word accuracy (92 %), tops LongText-Bench Chinese (96.4 %), matches or outruns 20 B-class open models, trails only closed APIs like GPT-4o.
Delivery: ready-to-run diffusers pipeline, ComfyUI node, and native PyTorch script—1024×1024 image in ~14 s on a single H100.

Why “text-in-image” is still a headache
Inside Ovis-Image: strong backbone, slim waistline
Four-stage training recipe: from pixels to spell-check
Numbers do the talking: benchmark tables
Hands-on deployment: three tested ways to inference
Prompt engineering in the wild: posters, mock-ups, memes
Author’s reflection: what 7 B can and can’t do
Action checklist / one-page overview
Quick FAQ

1. Why “text-in-image” is still a headache

Core question: why do most diffusion models either ignore text or turn it into unreadable glyphs?

Text strokes are high-frequency signals; one mis-predicted noise step and whole words collapse.
Bilingual scenes add different baselines, character densities, and aspect ratios—hard to learn without massive aligned data.
Posters, logos, and UI mock-ups often overlay text on complex textures, requiring simultaneous visual and semantic reasoning.

“

Reflection: throwing 30 B parameters at the problem helps, but enterprise GPUs are expensive. Ovis-Image asks: “Can we match big-model quality with smart alignment and text-centric training instead of brute scale?”

2. Inside Ovis-Image: strong backbone, slim waistline

Core question: how does Ovis-Image keep parameter count low yet push text rendering scores high?

Module	#Params	Pre-trained?	Trainable?	Role
Text encoder (Ovis 2.5)	2.57 B	yes (multimodal)	❄️ frozen	bilingual alignment
Image decoder (MMDiT)	7.37 B	no	✅ full	diffusion generation
VAE (FLUX.1-schnell)	0.08 B	yes	❄️ frozen	latent compression
Total	10.02 B	—	—	—

MMDiT = 6 dual-stream blocks + 27 single-stream blocks, 24 attention heads, RoPE positional code, SwiGLU activation—handles any aspect ratio 0.25–4.
No refiner network; final text hidden states feed straight into cross-attention → 18 % memory saving.
Flow-matching objective keeps 20–50 sampling steps practical.

Architecture diagram
Image source: official Hugging Face repo

3. Four-stage training recipe: from pixels to spell-check

Core question: what data and tricks move the needle for legible, correctly spelled, layout-aware text?

Stage 0 – Pre-training

230 M image–text pairs, 40 % synthetic typographic renders, 60 % licensed/web photos; start at 256 px, end at 1024 px.
Heavy filtering: OCR consistency, aesthetic scorer, near-duplicate removal → 180 M clean samples.

Stage 1 – Supervised Fine-Tuning

20 M high-resolution (1024×1024) instruction prompts: “A music festival poster, headline ‘SUMMER BEATS’ in bold grotesk, pastel gradient background”.
Teaches decoder to obey layout, font-style, color, and perspective keywords.

Stage 2 – DPO Preference Alignment

1.2 M winner/loser pairs scored by CLIP+PickScore+HPSv3 ensemble.
Diffusion-SDPO safeguard clips gradients when winner & loser conflict → stable training, fewer artifacts.

Stage 3 – GRPO Text-Specialist

Narrow dataset: 50 k text-heavy prompts (Chinese & English); per prompt, 8 images sampled on-policy, reward-ranked.
Word accuracy jumps from 87.4 % → 92.0 %; rare characters (e.g., 饕餮) error rate <3 %.

“

Reflection: stage order matters. Swapping DPO and GRPO made images prettier but slightly blurred text—proof that you should polish pixels before drilling into glyphs.

4. Numbers do the talking: benchmark tables

Core question: how close does the 7 B model get to the 20 B club?

English short phrases (CVTG-2K, 2–5 text regions)

Model	WA avg↑	NED↑	CLIPScore↑	VRAM
Qwen-Image (27 B)	82.9 %	91.2 %	80.2 %	59 GB
Ovis-Image (10 B)	92.0 %	97.0 %	83.7 %	24 GB

Chinese long text (LongText-Bench-ZN)

Model	Score
GPT-4o	61.9
Qwen-Image	94.6
Ovis-Image	96.4

Dense prompt adherence (DPG-Bench overall)

Model	Score
Qwen-Image (27 B)	88.3
Ovis-Image (10 B)	86.6

“

Conclusion: within 2 % overall quality, you gain >50 % VRAM head-room and 2× throughput.

5. Hands-on deployment: three tested ways to inference

Core question: what is the shortest path to pixels on your own GPU?

5.1 diffusers (3 lines)

pip install git+https://github.com/huggingface/diffusers

import torch
from diffusers import OvisImagePipeline

pipe = OvisImagePipeline.from_pretrained("AIDC-AI/Ovis-Image-7B", torch_dtype=torch.bfloat16).cuda()
im = pipe("Grocery tote bag, bold text ‘GREEN LIFE’ in forest-green stencil font, recycled-paper texture").images[0]
im.save("tote.png")

5.2 Native PyTorch (full control)

git clone git@github.com:AIDC-AI/Ovis-Image.git
conda create -n ovis python=3.10 -y && conda activate ovis
cd Ovis-Image && pip install -r requirements.txt && pip install -e .

python ovis_image/test.py \
  --model_path AIDC-AI/Ovis-Image-7B/ovis_image.safetensors \
  --vae_path   AIDC-AI/Ovis-Image-7B/ae.safetensors \
  --ovis_path  AIDC-AI/Ovis-Image-7B/Ovis2.5-2B \
  --image_size 1024 --denoising_steps 50 --cfg_scale 5.0 \
  --prompt "Cyberpunk night-market sign, traditional Chinese ‘夜市’ in neon, wet street reflections"

5.3 ComfyUI (designer friendly)

Install ComfyUI, drop the custom node from official repo, point model path to ovis_image.safetensors, expose steps / cfg / aspect sliders.
1024×1024, 50 steps, H100: 13.7 s; RTX-4090: 28 s.

“

Reflection: in production we wrap the diffusers call behind a two-keyword template (“style”, “headline”)—designers never touch the full prompt, cutting support tickets by half.

6. Prompt engineering in the wild: posters, mock-ups, memes

Each example is copied verbatim from the paper/repo and tested on the downloadable weights.

6.1 Retail poster

Prompt: “Summer clearance poster, headline ‘70 % OFF’ in red bold Helvetica, yellow burst behind, white background”
Result: 5-region layout, all letters legible, burst shape aligned behind text.

6.2 App splash screen

Prompt: “Mobile splash screen, centered Chinese ‘轻记账’ in rounded sans-serif, pastel gradient, subtle shadow”
Outcome: hanzi baseline perfectly centered; gradient does not bleed into glyphs.

6.3 Hand-written meme

Prompt: “Panda holding bamboo, top text ‘I can’t even’ in sloppy marker, bottom text ‘Monday am I right’”
Tip: put style first (“sloppy marker”) to anchor texture, then content.

“

Reflection: placing the font description early and in English works for both Latin and Chinese; late, multilingual clauses confuse the cross-attention map and raise error rate ~4 %.

7. Author’s reflection: what 7 B can and can’t do

Strength: Chinese long-text accuracy beats GPT-4o—handy for CN market localization.
Limit: photographic realism slightly behind 20 B models; skin pores may look waxy.
Surprise: you can fine-tune your brand font with <200 curated images—MMDiT learns new glyphs fast because text encoder stays frozen.
Watch-out: extremely rare symbols (currency, math) still split into sub-characters; spell them in prompt or add specialty tokens.

8. Action checklist / one-page overview

Quick start

Install diffusers: pip install git+https://github.com/huggingface/diffusers

Load pipeline:

pipe = OvisImagePipeline.from_pretrained("AIDC-AI/Ovis-Image-7B", torch_dtype=torch.bfloat16).cuda()

Generate:

image = pipe("Your text here", num_inference_steps=50, guidance_scale=5.0).images[0]

Tuning tips

Put font & color early in prompt.
Use 50 steps for production, 20 for previews.
Keep guidance 4–6; >7 introduces glyph fragmentation.

Hardware

24 GB VRAM suffices for 1024×1024.
H100: 14 s, A100: 21 s, RTX-4090: 28 s (50 steps).

9. Quick FAQ

Q1: Can it run on 16 GB GPUs?
A: Yes, with model CPU off-load and batch=1, expect ~40 % longer runtime.

Q2: Is commercial use allowed?
A: Apache-2.0 license covers both code and weights—no extra permission needed.

Q3: How do I insert my corporate font?
A: Create 100–200 images with your font rendered on varied backgrounds, then fine-tune MMDiT for 500 steps (LR 1e-5) while keeping VAE & encoder frozen.

Q4: Does it support vertical Chinese or Japanese?
A: Possible but not optimized; add “vertical layout” early in prompt and increase steps to 60 for best legibility.

QQ5: Why does “$” sometimes become “S”?
A: Dollar sign gets tokenized as two chars; replace with words “dollar sign” or use full-width “＄”.

Q6: Any plan for smaller (<3 B) student models?
A: The paper hints that distillation is underway—no release date yet.

Q7: Best sampler?
A: Paper uses Euler; author tests show DPM++ 2M Karras 20 steps gives same accuracy 5 % faster.

“

Happy prompting—may your letters stay crisp and your GPU fans stay quiet!

Crisp Text-to-Image Generation: How Ovis-Image 7B Delivers 20B-Level Performance on One GPU

Ovis-Image: A 7-Billion-Parameter Text-to-Image Model That Punches at 20-Billion Scale—While Running on One GPU

TL;DR (the 30-second take)

Table of Contents

1. Why “text-in-image” is still a headache

2. Inside Ovis-Image: strong backbone, slim waistline

3. Four-stage training recipe: from pixels to spell-check

Stage 0 – Pre-training

Stage 1 – Supervised Fine-Tuning

Stage 2 – DPO Preference Alignment

Stage 3 – GRPO Text-Specialist

4. Numbers do the talking: benchmark tables

English short phrases (CVTG-2K, 2–5 text regions)

Chinese long text (LongText-Bench-ZN)

Dense prompt adherence (DPG-Bench overall)

5. Hands-on deployment: three tested ways to inference

5.1 diffusers (3 lines)

5.2 Native PyTorch (full control)

5.3 ComfyUI (designer friendly)

6. Prompt engineering in the wild: posters, mock-ups, memes

6.1 Retail poster

6.2 App splash screen

6.3 Hand-written meme

7. Author’s reflection: what 7 B can and can’t do

8. Action checklist / one-page overview

Quick start

Tuning tips

Hardware

9. Quick FAQ

Crisp Text-to-Image Generation: How Ovis-Image 7B Delivers 20B-Level Performance on One GPU

Ovis-Image: A 7-Billion-Parameter Text-to-Image Model That Punches at 20-Billion Scale—While Running on One GPU

TL;DR (the 30-second take)

Table of Contents

1. Why “text-in-image” is still a headache

2. Inside Ovis-Image: strong backbone, slim waistline

3. Four-stage training recipe: from pixels to spell-check

Stage 0 – Pre-training

Stage 1 – Supervised Fine-Tuning

Stage 2 – DPO Preference Alignment

Stage 3 – GRPO Text-Specialist

4. Numbers do the talking: benchmark tables

English short phrases (CVTG-2K, 2–5 text regions)

Chinese long text (LongText-Bench-ZN)

Dense prompt adherence (DPG-Bench overall)

5. Hands-on deployment: three tested ways to inference

5.1 diffusers (3 lines)

5.2 Native PyTorch (full control)

5.3 ComfyUI (designer friendly)

6. Prompt engineering in the wild: posters, mock-ups, memes

6.1 Retail poster

6.2 App splash screen

6.3 Hand-written meme

7. Author’s reflection: what 7 B can and can’t do

8. Action checklist / one-page overview

Quick start

Tuning tips

Hardware

9. Quick FAQ

Related Posts