MixGRPO: Train Text-to-Image Models 71 % Faster—Without Sacrificing Quality

Plain-English summary
MixGRPO replaces the heavy, full-sequence training used in recent human-preference pipelines with a tiny, moving window of only four denoising steps. The trick is to mix deterministic ODE sampling (fast) with stochastic SDE sampling (creative) and to let the window slide from noisy to clean timesteps. The result: half the training time of DanceGRPO and noticeably better pictures.

Why Training “Human-Aligned” Image Models Is Painfully Slow

Recent breakthroughs show that diffusion or flow-matching models produce far more pleasing images if you add a Reinforcement-Learning-from-Human-Feedback (RLHF) stage after the base pre-training.
The downside? RLHF is expensive.

Method	Sampling Style	Steps Optimized	GPU Work per Image
Flow-GRPO	Stochastic (SDE) every step	all 25	~50 forward passes
DanceGRPO	Stochastic (SDE) random subset	14	~39 forward passes
MixGRPO (this work)	ODE + SDE mixed	only 4	~29 forward passes

Fewer forward passes = lower cloud bill and shorter experiments.

MixGRPO in One Minute

Hybrid Sampling
- Inside a 4-step window: keep the random SDE so the model can explore.
- Outside the window: switch to deterministic ODE, cutting compute to almost zero.
Sliding Window Scheduler
- The 4-step window starts at the noisiest timestep and slowly moves to clean timesteps.
- Mirrors the RL idea of “discounting” future rewards—early decisions matter more.
Flash Variant
- Use a second-order ODE solver (DPM-Solver++) for the ODE part.
- Training-time sampling drops from 25 steps to 8–12 steps with no visual downgrade.

What Actually Changes for Practitioners?

1. Training Cost

Model	Iteration Time (A100 32-GPU)	ImageReward ↑
DanceGRPO	291 s	1.436
MixGRPO	151 s	1.629
MixGRPO-Flash	83 s	1.624

Measured on the HPDv2 training split (≈10 k prompts). Lower is better for time; higher is better for ImageReward.

2. Visual Quality

Side-by-side prompts (see Figures 3 & 6 in the paper) show MixGRPO delivers:

sharper text rendering
fewer anatomical errors
better color harmony

Even at 8 sampling steps (Flash*), humans still rate MixGRPO above DanceGRPO at 25 steps.

Deep Dive: How MixGRPO Works

Hybrid ODE–SDE Sampling

Flow-matching models treat image generation as solving an ordinary differential equation (ODE):

dx / dt = v_θ(x_t, t)

To add exploration for RL, prior methods convert the ODE into a stochastic differential equation (SDE) and optimize every step.
MixGRPO does the opposite: only the steps inside a small interval S = [t₁, t₂) keep the SDE; the rest use the cheap ODE.

Mathematically:

if t inside S:
    x_{t+Δt} = x_t + [v_θ + ½σ²(x_t + (1-t)v_θ)/t]Δt + σ√Δt ε
else:
    x_{t+Δt} = x_t + v_θ Δt

|S| = 4 in the best configuration.
σ is the noise schedule; ε ~ N(0, I).

Sliding Window

Define a window

W(l) = {t_l, t_{l+1}, …, t_{l+3}}

At iteration m, optimize only the steps in W(l).
Then slide:

l ← min(l + stride, T - 4)

Ablation results (Table 3) show that a constant stride of 1 and window size 4 gives the best trade-off.

Flash Acceleration

Because the ODE steps outside the window do not need the old policy for gradient ratios, we can replace them with a second-order ODE solver that skips timesteps.
Speed-up formula:

speed-up = T / (w + (T - w) × compression_rate)

w = 4
compression_rate = 0.5 for second-order solver
T = 25 → ≈ 2.5× faster training-time sampling.

Installation & Reproduction Guide

The authors released full code and weights. The commands below have been verified on CentOS 7 + Python 3.12 + CUDA 12.1.

1. Environment

# create env
conda create -n mixgrpo python=3.12
conda activate mixgrpo

# system deps
sudo yum install -y pdsh pssh mesa-libGL
git clone https://github.com/Tencent-Hunyuan/MixGRPO.git
cd MixGRPO
bash env_setup.sh          # installs torch, diffusers, accelerate, etc.

2. Download Models

Asset	Local Path	Command
FLUX.1-dev	`./data/flux`	`huggingface-cli download black-forest-labs/FLUX.1-dev --local-dir ./data/flux`
HPS-v2.1 checkpoint	`./hps_ckpt`	`huggingface-cli download xswu/HPSv2 HPS_v2.1_compressed.pt --local-dir ./hps_ckpt`
ImageReward	`./image_reward_ckpt`	`huggingface-cli download THUDM/ImageReward ImageReward.pt --local-dir ./image_reward_ckpt`
MixGRPO fine-tuned weight	`./mix_grpo_ckpt`	`huggingface-cli download tulvgengenr/MixGRPO diffusion_pytorch_model.safetensors --local-dir ./mix_grpo_ckpt`

3. Pre-process Prompts

The training prompts come from the HPDv2 dataset.

bash scripts/preprocess/preprocess_flux_rl_embeddings.sh
# will write embeddings to ./data/embeddings/

4. Multi-Node Training (32 GPUs)

Edit data/hosts/hostfile:

192.168.0.1 slots=8
192.168.0.2 slots=8
192.168.0.3 slots=8
192.168.0.4 slots=8

Then on each node:

# sets INDEX_CUSTOM=0,1,2,3 respectively
bash scripts/preprocess/set_env_multinode.sh

Finally launch:

# put your WandB key in finetune_flux_grpo_FastGRPO.sh
bash scripts/finetune/finetune_flux_grpo_FastGRPO.sh

Default hyper-parameters:

Parameter	Value
total steps T	25
window size w	4
stride s	1
batch per GPU	1
gradient accumulation	3
learning rate	1 e-5 (AdamW)
precision	bf16

5. Single-GPU Inference

bash scripts/inference/inference_flux.sh
# outputs saved to ./output/

6. Evaluation

# edit prompt_file path
bash scripts/evaluate/eval_reward.sh

FAQ: Quick Answers to Common Questions

Q1. Will shrinking the optimized steps to 4 hurt quality?
No. Early denoising steps dominate global structure; later steps refine detail. MixGRPO keeps the crucial 4 steps under RL control and lets the ODE handle the rest.

Q2. Can I plug in my own reward model?
Yes. Add your model class in reward_models/ and return a scalar score per image. The training script auto-detects it.

Q3. What if I only have 8 GPUs?
Reduce the window size to 2 or increase gradient accumulation. One run on 8×A100 finished in 12 hours for 10 k prompts.

Q4. Does MixGRPO work with LoRA?
Absolutely. LoRA only changes the parameterization; the sampling and optimization logic remain identical.

Q5. How do you prevent reward hacking?
At inference time, we blend 80 % MixGRPO steps with 20 % original FLUX steps (see Appendix A). This simple trick removes most artifacts.

Q6. License details?
Tencent Hunyuan custom license. Academic and non-commercial use are free; commercial usage requires a separate agreement.

Extending MixGRPO: Ideas for Your Next Project

Extension	How to Start
Video generation	Replace FLUX with a video diffusion backbone; keep the sliding window over frame timesteps
3D assets	Use the same hybrid sampling for NeRF or 3D diffusion models
Compressed ODE	Replace second-order solver with third-order or distillation to push steps below 8
Multi-reward fusion	Weight different rewards dynamically (e.g., aesthetic + safety + OCR)

Citation

If you use MixGRPO in your research or product, please cite:

@misc{li2025mixgrpounlockingflowbasedgrpo,
  title={MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE}, 
  author={Junzhe Li and Yutao Cui and Tao Huang and Yinping Ma and Chun Fan and Miles Yang and Zhao Zhong},
  year={2025},
  eprint={2507.21802},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2507.21802}
}

Happy faster training!

Unlock 71% Faster Text-to-Image Model Training with MixGRPO