X-Omni Explained: How Reinforcement Learning Revives Autoregressive Image Generation

A plain-English, globally friendly guide to the 7 B unified image-and-language model

1. What Is X-Omni?

In one sentence: X-Omni is a 7-billion-parameter model that writes both words and pictures in the same breath, then uses reinforcement learning to make every pixel look right.

Key Fact	Plain-English Meaning
Unified autoregressive	One brain handles both text and images, so knowledge flows freely between them.
Discrete tokens	Images are chopped into 16 384 “visual words”; the model predicts the next word just like GPT predicts the next letter.
Reinforcement-learning polish	After normal training, the model plays a reward game—every draft image is scored on beauty, text accuracy, and instruction-following. The scores guide final tweaks.
Arbitrary resolution	Tell it any width and height up to 1 152 px and it simply adapts—no more square-only outputs.
English & Chinese	Two ready-made checkpoints: `X-Omni-En` and `X-Omni-Zh`.

2. Why Did Earlier Autoregressive Images Look Bad?

2.1 The classic pitfalls

Error snowball
Predict token #1 → tiny mistake → token #2 is now off → by token #100 the cat has three ears.
Quantization loss
Converting a photo into discrete “words” throws away fine texture.
One-dimensional blinders
Token order is strictly left-to-right, top-to-bottom; global layout is invisible.

2.2 How reinforcement learning fixes them

Instead of hoping the next token is correct, X-Omni lets a reward jury grade the entire finished picture:

Human Preference Score v2—does it look nice?
Unified Reward—high-resolution overall quality.
Text–Image Alignment—does the picture actually match the prompt?
OCR Accuracy—if the prompt asked for text inside the image, can you read it?

The model then adjusts itself to maximize these scores. Think of it as an art teacher giving real-time feedback instead of a robot that only follows the textbook.

3. Quick-Start Installation (Copy-Paste Ready)

3.1 Hardware & OS

GPU: ≥24 GB VRAM (RTX 4090, A100, or L40).
OS: Ubuntu 20.04+ or Windows 11 with WSL2.
Python: exactly 3.12 (specified in the repo).

3.2 Three-command setup

# 1. Fresh environment
conda create -n xomni python=3.12 -y
conda activate xomni

# 2. Core requirements
pip install -r requirements.txt

# 3. Fast attention kernel (skip compile)
pip install flash_attn-2.7.3+cu12torch2.6-cp312-linux_x86_64.whl --no-build-isolation

If the wheel mismatches your CUDA version, fall back to the official Flash-Attention repository and choose the matching file.

4. Generate Your First Poster

4.1 English formal letter (1 152 × 1 152 px)

PROMPT='A formal letter document with a professional tone...'  # full text in README
python generate.py \
  --model_name_or_path X-Omni/X-Omni-En \
  --flux_model_name_or_path /path/to/FLUX.1-dev \
  --prompt "$PROMPT" \
  --image-size 1152 1152 \
  --cfg-scale 1.0 \
  --min-p 0.03 \
  --seed 1234 \
  --output-path ./formal_letter.png

4.2 Chinese travel brochure

PROMPT='Generate a panoramic cover of the Forbidden City in the snow...'  # full text in README
python generate.py \
  --model_path X-Omni/X-Omni-Zh \
  --flux_model_name_or_path /path/to/FLUX.1-dev \
  --prompt "$PROMPT" \
  --image-size 1152 1152 \
  --cfg-scale 1.0 \
  --min-p 0.03 \
  --seed 1234 \
  --output-path ./forbidden_city.png

Tip: --cfg-scale 1.0 literally turns off classifier-free guidance. X-Omni still delivers crisp results, unlike older autoregressive models that collapse without CFG.

5. Chat with an Image

python chat.py \
  --model_name_or_path X-Omni/X-Omni-En \
  --flux_model_name_or_path /path/to/FLUX.1-dev \
  --prompt "Describe the image in detail." \
  --image-path ./your_photo.jpg

Sample reply:

“A French bulldog wearing orange diving goggles swims underwater in a crystal-clear pool. Sunlight creates caustic patterns on his fur, and tiny air bubbles trail from his smiling snout…”

6. Benchmarking Long-Form Text: LongText-Bench

6.1 Why invent another benchmark?

Metric	OneIG-Bench	LongText-Bench
English words	10–30	30–50
Chinese characters	20–40	60+
Scenarios	mixed	8 concrete (signboards, slides, posters…)
Prompts	~100	160

LongText-Bench is specifically tuned to test whether a model can accurately render long paragraphs inside an image.

6.2 Run the benchmark yourself

Generate
- Use prompts in text_prompts.jsonl and text_prompts_zh.jsonl.
- Save as {id}_{1..4}.png (4 samples each prompt).
Evaluate
```
cd textbench
bash eval.sh
```
Edit MODE and SAMPLE_FOLDER inside the script to point to your outputs.

7. Architecture in Plain English

7.1 Components

Block	Job
SigLIP-VQ tokenizer	Turns an image into 16 384 discrete tokens using a frozen SigLIP 2 encoder + vector quantizer.
Autoregressive model	A 7 B transformer (Qwen2.5-7B) that sees text tokens and image tokens as one long sentence.
Diffusion decoder	FLUX.1-dev takes the semantic tokens and paints the final pixels.

7.2 Training stages

Pre-training
- 200 M high-quality image–text pairs → 600 B multimodal tokens.
- 59 M image-understanding samples → 100 B tokens.
Supervised fine-tuning (SFT)
- 30 K curated samples + 30 K synthetic long-text data + understanding tasks.
- 1.5 B tokens total.
Reinforcement learning (GRPO)
- 180 K prompts (80 K from Midjourney, 50 K text-heavy, 50 K natural images).
- 200 steps, batch 512, 16 rollouts each → 2 h on 8×A100.

8. Performance Snapshot

8.1 Text rendering (LongText-Bench)

Model	English Score	Chinese Score
GPT-4o	0.956	0.619
X-Omni	0.900	0.814
Seedream 3.0	0.896	0.878
BAGEL	0.373	0.310

Higher is better. X-Omni leads open-source unified models and narrows the gap with proprietary giants.

8.2 General text-to-image (GenEval)

Test	Single Object	Two Objects	Counting	Colors	Position	Overall
X-Omni	0.98	0.95	0.75	0.91	0.71	0.83

8.3 Image understanding (OCRBench)

Model	OCRBench
LLaVA-OneVision	622
X-Omni	704

9. Key Findings from the Paper

9.1 No classifier-free guidance needed

Most autoregressive image models lean heavily on CFG to look good.
X-Omni keeps quality high even at cfg-scale = 1.0, cutting compute and proving the unified training worked.

9.2 RL beats “best-of-N”

After supervised fine-tuning, researchers generated 16 images per prompt and picked the best (BoN).
Reinforcement learning still outperforms this brute-force approach because it aligns the autoregressive tokens with the diffusion decoder instead of hoping they match.

10. Frequently Asked Questions

Q1: Will discrete tokens always look worse than continuous diffusion?
A: No. The diffusion decoder refines the discrete semantic tokens into high-resolution pixels; RL further closes the quality gap.

Q2: Is 7 B big enough?
A: On long-text rendering, it beats 20 B hybrid models, showing that smart training can outweigh raw parameter count.

Q3: Do I need separate Chinese training?
A: The Chinese checkpoint was created by continuing RL on Chinese prompts for a few hours; no full second training run required.

Q4: How much VRAM at inference?

Resolution	VRAM
512×512	≈8 GB
1152×1152	≈22 GB

Q5: Can I use it commercially?
Weights are Apache-2.0-friendly, but FLUX.1-dev carries its own license—double-check before shipping products.

Q6: How long to fine-tune on my own dataset?

SFT: 1.5 B tokens, 8×A100 ≈ 6 h
RL: 180 K prompts, 8×A100 ≈ 24 h

Q7: Windows support?
WSL2 + CUDA 12.2 confirmed; native Windows build not yet tested.

Q8: REST-API wrapper?
Use the official HuggingFace Space or wrap generate.py with FastAPI in <50 lines.

11. Hands-On Checklist

Task	Command
Install	See section 3
Generate English poster	`python generate.py` with `X-Omni-En`
Generate Chinese poster	`python generate.py` with `X-Omni-Zh`
Chat about an image	`python chat.py`
Run LongText-Bench	`cd textbench && bash eval.sh`

12. Citation

If you use X-Omni in your work, please cite:

@article{geng2025xomni,
  author = {Zigang Geng and Yibing Wang and Yeyao Ma and Chen Li and Yongming Rao and Shuyang Gu and Zhao Zhong and Qinglin Lu and Han Hu and Xiaosong Zhang and Linus and Di Wang and Jie Jiang},
  title = {X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again},
  journal = {CoRR},
  volume = {abs/2507.22058},
  year = {2025}
}

X-Omni: How Reinforcement Learning Revolutionizes Autoregressive Image Generation