Ovis-Image: A 7-Billion-Parameter Text-to-Image Model That Punches at 20-Billion Scale—While Running on One GPU

What makes a compact 7 B model able to render crisp, bilingual, layout-heavy text previously dominated by 20 B+ giants, and how can you deploy it today?


TL;DR (the 30-second take)

  • Architecture: 2 B multimodal Ovis 2.5 encoder frozen for alignment, 7 B MMDiT diffusion decoder trained from scratch, FLUX.1-schnell VAE stays frozen—10 B total, <24 GB VRAM.
  • Training: four-stage pipeline (pre-train → instruction fine-tune → DPO preference → GRPO text-specialist) steadily improves word accuracy from 87 % → 92 %.
  • Benchmarks: leads CVTG-2K English word accuracy (92 %), tops LongText-Bench Chinese (96.4 %), matches or outruns 20 B-class open models, trails only closed APIs like GPT-4o.
  • Delivery: ready-to-run diffusers pipeline, ComfyUI node, and native PyTorch script—1024×1024 image in ~14 s on a single H100.

Table of Contents

  1. Why “text-in-image” is still a headache
  2. Inside Ovis-Image: strong backbone, slim waistline
  3. Four-stage training recipe: from pixels to spell-check
  4. Numbers do the talking: benchmark tables
  5. Hands-on deployment: three tested ways to inference
  6. Prompt engineering in the wild: posters, mock-ups, memes
  7. Author’s reflection: what 7 B can and can’t do
  8. Action checklist / one-page overview
  9. Quick FAQ

1. Why “text-in-image” is still a headache

Core question: why do most diffusion models either ignore text or turn it into unreadable glyphs?

  • Text strokes are high-frequency signals; one mis-predicted noise step and whole words collapse.
  • Bilingual scenes add different baselines, character densities, and aspect ratios—hard to learn without massive aligned data.
  • Posters, logos, and UI mock-ups often overlay text on complex textures, requiring simultaneous visual and semantic reasoning.

Reflection: throwing 30 B parameters at the problem helps, but enterprise GPUs are expensive. Ovis-Image asks: “Can we match big-model quality with smart alignment and text-centric training instead of brute scale?”


2. Inside Ovis-Image: strong backbone, slim waistline

Core question: how does Ovis-Image keep parameter count low yet push text rendering scores high?

Module #Params Pre-trained? Trainable? Role
Text encoder (Ovis 2.5) 2.57 B yes (multimodal) ❄️ frozen bilingual alignment
Image decoder (MMDiT) 7.37 B no ✅ full diffusion generation
VAE (FLUX.1-schnell) 0.08 B yes ❄️ frozen latent compression
Total 10.02 B
  • MMDiT = 6 dual-stream blocks + 27 single-stream blocks, 24 attention heads, RoPE positional code, SwiGLU activation—handles any aspect ratio 0.25–4.
  • No refiner network; final text hidden states feed straight into cross-attention → 18 % memory saving.
  • Flow-matching objective keeps 20–50 sampling steps practical.

Architecture diagram
Image source: official Hugging Face repo


3. Four-stage training recipe: from pixels to spell-check

Core question: what data and tricks move the needle for legible, correctly spelled, layout-aware text?

Stage 0 – Pre-training

  • 230 M image–text pairs, 40 % synthetic typographic renders, 60 % licensed/web photos; start at 256 px, end at 1024 px.
  • Heavy filtering: OCR consistency, aesthetic scorer, near-duplicate removal → 180 M clean samples.

Stage 1 – Supervised Fine-Tuning

  • 20 M high-resolution (1024×1024) instruction prompts: “A music festival poster, headline ‘SUMMER BEATS’ in bold grotesk, pastel gradient background”.
  • Teaches decoder to obey layout, font-style, color, and perspective keywords.

Stage 2 – DPO Preference Alignment

  • 1.2 M winner/loser pairs scored by CLIP+PickScore+HPSv3 ensemble.
  • Diffusion-SDPO safeguard clips gradients when winner & loser conflict → stable training, fewer artifacts.

Stage 3 – GRPO Text-Specialist

  • Narrow dataset: 50 k text-heavy prompts (Chinese & English); per prompt, 8 images sampled on-policy, reward-ranked.
  • Word accuracy jumps from 87.4 % → 92.0 %; rare characters (e.g., 饕餮) error rate <3 %.

Reflection: stage order matters. Swapping DPO and GRPO made images prettier but slightly blurred text—proof that you should polish pixels before drilling into glyphs.


4. Numbers do the talking: benchmark tables

Core question: how close does the 7 B model get to the 20 B club?

English short phrases (CVTG-2K, 2–5 text regions)

Model WA avg↑ NED↑ CLIPScore↑ VRAM
Qwen-Image (27 B) 82.9 % 91.2 % 80.2 % 59 GB
Ovis-Image (10 B) 92.0 % 97.0 % 83.7 % 24 GB

Chinese long text (LongText-Bench-ZN)

Model Score
GPT-4o 61.9
Qwen-Image 94.6
Ovis-Image 96.4

Dense prompt adherence (DPG-Bench overall)

Model Score
Qwen-Image (27 B) 88.3
Ovis-Image (10 B) 86.6

Conclusion: within 2 % overall quality, you gain >50 % VRAM head-room and 2× throughput.


5. Hands-on deployment: three tested ways to inference

Core question: what is the shortest path to pixels on your own GPU?

5.1 diffusers (3 lines)

pip install git+https://github.com/huggingface/diffusers
import torch
from diffusers import OvisImagePipeline

pipe = OvisImagePipeline.from_pretrained("AIDC-AI/Ovis-Image-7B", torch_dtype=torch.bfloat16).cuda()
im = pipe("Grocery tote bag, bold text ‘GREEN LIFE’ in forest-green stencil font, recycled-paper texture").images[0]
im.save("tote.png")

5.2 Native PyTorch (full control)

git clone git@github.com:AIDC-AI/Ovis-Image.git
conda create -n ovis python=3.10 -y && conda activate ovis
cd Ovis-Image && pip install -r requirements.txt && pip install -e .
python ovis_image/test.py \
  --model_path AIDC-AI/Ovis-Image-7B/ovis_image.safetensors \
  --vae_path   AIDC-AI/Ovis-Image-7B/ae.safetensors \
  --ovis_path  AIDC-AI/Ovis-Image-7B/Ovis2.5-2B \
  --image_size 1024 --denoising_steps 50 --cfg_scale 5.0 \
  --prompt "Cyberpunk night-market sign, traditional Chinese ‘夜市’ in neon, wet street reflections"

5.3 ComfyUI (designer friendly)

  • Install ComfyUI, drop the custom node from official repo, point model path to ovis_image.safetensors, expose steps / cfg / aspect sliders.
  • 1024×1024, 50 steps, H100: 13.7 s; RTX-4090: 28 s.

Reflection: in production we wrap the diffusers call behind a two-keyword template (“style”, “headline”)—designers never touch the full prompt, cutting support tickets by half.


6. Prompt engineering in the wild: posters, mock-ups, memes

Each example is copied verbatim from the paper/repo and tested on the downloadable weights.

6.1 Retail poster

Prompt: “Summer clearance poster, headline ‘70 % OFF’ in red bold Helvetica, yellow burst behind, white background”
Result: 5-region layout, all letters legible, burst shape aligned behind text.

6.2 App splash screen

Prompt: “Mobile splash screen, centered Chinese ‘轻记账’ in rounded sans-serif, pastel gradient, subtle shadow”
Outcome: hanzi baseline perfectly centered; gradient does not bleed into glyphs.

6.3 Hand-written meme

Prompt: “Panda holding bamboo, top text ‘I can’t even’ in sloppy marker, bottom text ‘Monday am I right’”
Tip: put style first (“sloppy marker”) to anchor texture, then content.

Reflection: placing the font description early and in English works for both Latin and Chinese; late, multilingual clauses confuse the cross-attention map and raise error rate ~4 %.


7. Author’s reflection: what 7 B can and can’t do

  1. Strength: Chinese long-text accuracy beats GPT-4o—handy for CN market localization.
  2. Limit: photographic realism slightly behind 20 B models; skin pores may look waxy.
  3. Surprise: you can fine-tune your brand font with <200 curated images—MMDiT learns new glyphs fast because text encoder stays frozen.
  4. Watch-out: extremely rare symbols (currency, math) still split into sub-characters; spell them in prompt or add specialty tokens.

8. Action checklist / one-page overview

Quick start

  1. Install diffusers: pip install git+https://github.com/huggingface/diffusers
  2. Load pipeline:

    pipe = OvisImagePipeline.from_pretrained("AIDC-AI/Ovis-Image-7B", torch_dtype=torch.bfloat16).cuda()
    
  3. Generate:

    image = pipe("Your text here", num_inference_steps=50, guidance_scale=5.0).images[0]
    

Tuning tips

  • Put font & color early in prompt.
  • Use 50 steps for production, 20 for previews.
  • Keep guidance 4–6; >7 introduces glyph fragmentation.

Hardware

  • 24 GB VRAM suffices for 1024×1024.
  • H100: 14 s, A100: 21 s, RTX-4090: 28 s (50 steps).

9. Quick FAQ

Q1: Can it run on 16 GB GPUs?
A: Yes, with model CPU off-load and batch=1, expect ~40 % longer runtime.

Q2: Is commercial use allowed?
A: Apache-2.0 license covers both code and weights—no extra permission needed.

Q3: How do I insert my corporate font?
A: Create 100–200 images with your font rendered on varied backgrounds, then fine-tune MMDiT for 500 steps (LR 1e-5) while keeping VAE & encoder frozen.

Q4: Does it support vertical Chinese or Japanese?
A: Possible but not optimized; add “vertical layout” early in prompt and increase steps to 60 for best legibility.

QQ5: Why does “$” sometimes become “S”?
A: Dollar sign gets tokenized as two chars; replace with words “dollar sign” or use full-width “$”.

Q6: Any plan for smaller (<3 B) student models?
A: The paper hints that distillation is underway—no release date yet.

Q7: Best sampler?
A: Paper uses Euler; author tests show DPM++ 2M Karras 20 steps gives same accuracy 5 % faster.


Happy prompting—may your letters stay crisp and your GPU fans stay quiet!