X-Omni Explained: How Reinforcement Learning Revives Autoregressive Image Generation
A plain-English, globally friendly guide to the 7 B unified image-and-language model
1. What Is X-Omni?
In one sentence: X-Omni is a 7-billion-parameter model that writes both words and pictures in the same breath, then uses reinforcement learning to make every pixel look right.
2. Why Did Earlier Autoregressive Images Look Bad?
2.1 The classic pitfalls
-
Error snowball
Predict token #1 → tiny mistake → token #2 is now off → by token #100 the cat has three ears. -
Quantization loss
Converting a photo into discrete “words” throws away fine texture. -
One-dimensional blinders
Token order is strictly left-to-right, top-to-bottom; global layout is invisible.
2.2 How reinforcement learning fixes them
Instead of hoping the next token is correct, X-Omni lets a reward jury grade the entire finished picture:
-
Human Preference Score v2—does it look nice? -
Unified Reward—high-resolution overall quality. -
Text–Image Alignment—does the picture actually match the prompt? -
OCR Accuracy—if the prompt asked for text inside the image, can you read it?
The model then adjusts itself to maximize these scores. Think of it as an art teacher giving real-time feedback instead of a robot that only follows the textbook.
3. Quick-Start Installation (Copy-Paste Ready)
3.1 Hardware & OS
-
GPU: ≥24 GB VRAM (RTX 4090, A100, or L40). -
OS: Ubuntu 20.04+ or Windows 11 with WSL2. -
Python: exactly 3.12 (specified in the repo).
3.2 Three-command setup
# 1. Fresh environment
conda create -n xomni python=3.12 -y
conda activate xomni
# 2. Core requirements
pip install -r requirements.txt
# 3. Fast attention kernel (skip compile)
pip install flash_attn-2.7.3+cu12torch2.6-cp312-linux_x86_64.whl --no-build-isolation
If the wheel mismatches your CUDA version, fall back to the official Flash-Attention repository and choose the matching file.
4. Generate Your First Poster
4.1 English formal letter (1 152 × 1 152 px)
PROMPT='A formal letter document with a professional tone...' # full text in README
python generate.py \
--model_name_or_path X-Omni/X-Omni-En \
--flux_model_name_or_path /path/to/FLUX.1-dev \
--prompt "$PROMPT" \
--image-size 1152 1152 \
--cfg-scale 1.0 \
--min-p 0.03 \
--seed 1234 \
--output-path ./formal_letter.png
4.2 Chinese travel brochure
PROMPT='Generate a panoramic cover of the Forbidden City in the snow...' # full text in README
python generate.py \
--model_path X-Omni/X-Omni-Zh \
--flux_model_name_or_path /path/to/FLUX.1-dev \
--prompt "$PROMPT" \
--image-size 1152 1152 \
--cfg-scale 1.0 \
--min-p 0.03 \
--seed 1234 \
--output-path ./forbidden_city.png
Tip:
--cfg-scale 1.0
literally turns off classifier-free guidance. X-Omni still delivers crisp results, unlike older autoregressive models that collapse without CFG.
5. Chat with an Image
python chat.py \
--model_name_or_path X-Omni/X-Omni-En \
--flux_model_name_or_path /path/to/FLUX.1-dev \
--prompt "Describe the image in detail." \
--image-path ./your_photo.jpg
Sample reply:
“A French bulldog wearing orange diving goggles swims underwater in a crystal-clear pool. Sunlight creates caustic patterns on his fur, and tiny air bubbles trail from his smiling snout…”
6. Benchmarking Long-Form Text: LongText-Bench
6.1 Why invent another benchmark?
LongText-Bench is specifically tuned to test whether a model can accurately render long paragraphs inside an image.
6.2 Run the benchmark yourself
-
Generate
-
Use prompts in text_prompts.jsonl
andtext_prompts_zh.jsonl
. -
Save as {id}_{1..4}.png
(4 samples each prompt).
-
-
Evaluate
cd textbench bash eval.sh
Edit
MODE
andSAMPLE_FOLDER
inside the script to point to your outputs.
7. Architecture in Plain English
7.1 Components
7.2 Training stages
-
Pre-training
-
200 M high-quality image–text pairs → 600 B multimodal tokens. -
59 M image-understanding samples → 100 B tokens.
-
-
Supervised fine-tuning (SFT)
-
30 K curated samples + 30 K synthetic long-text data + understanding tasks. -
1.5 B tokens total.
-
-
Reinforcement learning (GRPO)
-
180 K prompts (80 K from Midjourney, 50 K text-heavy, 50 K natural images). -
200 steps, batch 512, 16 rollouts each → 2 h on 8×A100.
-
8. Performance Snapshot
8.1 Text rendering (LongText-Bench)
Higher is better. X-Omni leads open-source unified models and narrows the gap with proprietary giants.
8.2 General text-to-image (GenEval)
8.3 Image understanding (OCRBench)
9. Key Findings from the Paper
9.1 No classifier-free guidance needed
Most autoregressive image models lean heavily on CFG to look good.
X-Omni keeps quality high even at cfg-scale = 1.0, cutting compute and proving the unified training worked.
9.2 RL beats “best-of-N”
After supervised fine-tuning, researchers generated 16 images per prompt and picked the best (BoN).
Reinforcement learning still outperforms this brute-force approach because it aligns the autoregressive tokens with the diffusion decoder instead of hoping they match.
10. Frequently Asked Questions
Q1: Will discrete tokens always look worse than continuous diffusion?
A: No. The diffusion decoder refines the discrete semantic tokens into high-resolution pixels; RL further closes the quality gap.
Q2: Is 7 B big enough?
A: On long-text rendering, it beats 20 B hybrid models, showing that smart training can outweigh raw parameter count.
Q3: Do I need separate Chinese training?
A: The Chinese checkpoint was created by continuing RL on Chinese prompts for a few hours; no full second training run required.
Q4: How much VRAM at inference?
Q5: Can I use it commercially?
Weights are Apache-2.0-friendly, but FLUX.1-dev carries its own license—double-check before shipping products.
Q6: How long to fine-tune on my own dataset?
-
SFT: 1.5 B tokens, 8×A100 ≈ 6 h -
RL: 180 K prompts, 8×A100 ≈ 24 h
Q7: Windows support?
WSL2 + CUDA 12.2 confirmed; native Windows build not yet tested.
Q8: REST-API wrapper?
Use the official HuggingFace Space or wrap generate.py
with FastAPI in <50 lines.
11. Hands-On Checklist
12. Citation
If you use X-Omni in your work, please cite:
@article{geng2025xomni,
author = {Zigang Geng and Yibing Wang and Yeyao Ma and Chen Li and Yongming Rao and Shuyang Gu and Zhao Zhong and Qinglin Lu and Han Hu and Xiaosong Zhang and Linus and Di Wang and Jie Jiang},
title = {X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again},
journal = {CoRR},
volume = {abs/2507.22058},
year = {2025}
}