MixGRPO: Train Text-to-Image Models 71 % Faster—Without Sacrificing Quality
Plain-English summary
MixGRPO replaces the heavy, full-sequence training used in recent human-preference pipelines with a tiny, moving window of only four denoising steps. The trick is to mix deterministic ODE sampling (fast) with stochastic SDE sampling (creative) and to let the window slide from noisy to clean timesteps. The result: half the training time of DanceGRPO and noticeably better pictures.
Why Training “Human-Aligned” Image Models Is Painfully Slow
Recent breakthroughs show that diffusion or flow-matching models produce far more pleasing images if you add a Reinforcement-Learning-from-Human-Feedback (RLHF) stage after the base pre-training.
The downside? RLHF is expensive.
Method | Sampling Style | Steps Optimized | GPU Work per Image |
---|---|---|---|
Flow-GRPO | Stochastic (SDE) every step | all 25 | ~50 forward passes |
DanceGRPO | Stochastic (SDE) random subset | 14 | ~39 forward passes |
MixGRPO (this work) | ODE + SDE mixed | only 4 | ~29 forward passes |
Fewer forward passes = lower cloud bill and shorter experiments.
MixGRPO in One Minute
-
Hybrid Sampling
-
Inside a 4-step window: keep the random SDE so the model can explore. -
Outside the window: switch to deterministic ODE, cutting compute to almost zero.
-
-
Sliding Window Scheduler
-
The 4-step window starts at the noisiest timestep and slowly moves to clean timesteps. -
Mirrors the RL idea of “discounting” future rewards—early decisions matter more.
-
-
Flash Variant
-
Use a second-order ODE solver (DPM-Solver++) for the ODE part. -
Training-time sampling drops from 25 steps to 8–12 steps with no visual downgrade.
-
What Actually Changes for Practitioners?
1. Training Cost
Model | Iteration Time (A100 32-GPU) | ImageReward ↑ |
---|---|---|
DanceGRPO | 291 s | 1.436 |
MixGRPO | 151 s | 1.629 |
MixGRPO-Flash | 83 s | 1.624 |
Measured on the HPDv2 training split (≈10 k prompts). Lower is better for time; higher is better for ImageReward.
2. Visual Quality
Side-by-side prompts (see Figures 3 & 6 in the paper) show MixGRPO delivers:
-
sharper text rendering -
fewer anatomical errors -
better color harmony
Even at 8 sampling steps (Flash*), humans still rate MixGRPO above DanceGRPO at 25 steps.
Deep Dive: How MixGRPO Works
Hybrid ODE–SDE Sampling
Flow-matching models treat image generation as solving an ordinary differential equation (ODE):
dx / dt = v_θ(x_t, t)
To add exploration for RL, prior methods convert the ODE into a stochastic differential equation (SDE) and optimize every step.
MixGRPO does the opposite: only the steps inside a small interval S = [t₁, t₂) keep the SDE; the rest use the cheap ODE.
Mathematically:
if t inside S:
x_{t+Δt} = x_t + [v_θ + ½σ²(x_t + (1-t)v_θ)/t]Δt + σ√Δt ε
else:
x_{t+Δt} = x_t + v_θ Δt
-
|S| = 4 in the best configuration. -
σ is the noise schedule; ε ~ N(0, I).
Sliding Window
Define a window
W(l) = {t_l, t_{l+1}, …, t_{l+3}}
At iteration m, optimize only the steps in W(l).
Then slide:
l ← min(l + stride, T - 4)
Ablation results (Table 3) show that a constant stride of 1 and window size 4 gives the best trade-off.
Flash Acceleration
Because the ODE steps outside the window do not need the old policy for gradient ratios, we can replace them with a second-order ODE solver that skips timesteps.
Speed-up formula:
speed-up = T / (w + (T - w) × compression_rate)
-
w = 4 -
compression_rate = 0.5 for second-order solver -
T = 25 → ≈ 2.5× faster training-time sampling.
Installation & Reproduction Guide
The authors released full code and weights. The commands below have been verified on CentOS 7 + Python 3.12 + CUDA 12.1.
1. Environment
# create env
conda create -n mixgrpo python=3.12
conda activate mixgrpo
# system deps
sudo yum install -y pdsh pssh mesa-libGL
git clone https://github.com/Tencent-Hunyuan/MixGRPO.git
cd MixGRPO
bash env_setup.sh # installs torch, diffusers, accelerate, etc.
2. Download Models
Asset | Local Path | Command |
---|---|---|
FLUX.1-dev | ./data/flux |
huggingface-cli download black-forest-labs/FLUX.1-dev --local-dir ./data/flux |
HPS-v2.1 checkpoint | ./hps_ckpt |
huggingface-cli download xswu/HPSv2 HPS_v2.1_compressed.pt --local-dir ./hps_ckpt |
ImageReward | ./image_reward_ckpt |
huggingface-cli download THUDM/ImageReward ImageReward.pt --local-dir ./image_reward_ckpt |
MixGRPO fine-tuned weight | ./mix_grpo_ckpt |
huggingface-cli download tulvgengenr/MixGRPO diffusion_pytorch_model.safetensors --local-dir ./mix_grpo_ckpt |
3. Pre-process Prompts
The training prompts come from the HPDv2 dataset.
bash scripts/preprocess/preprocess_flux_rl_embeddings.sh
# will write embeddings to ./data/embeddings/
4. Multi-Node Training (32 GPUs)
Edit data/hosts/hostfile
:
192.168.0.1 slots=8
192.168.0.2 slots=8
192.168.0.3 slots=8
192.168.0.4 slots=8
Then on each node:
# sets INDEX_CUSTOM=0,1,2,3 respectively
bash scripts/preprocess/set_env_multinode.sh
Finally launch:
# put your WandB key in finetune_flux_grpo_FastGRPO.sh
bash scripts/finetune/finetune_flux_grpo_FastGRPO.sh
Default hyper-parameters:
Parameter | Value |
---|---|
total steps T | 25 |
window size w | 4 |
stride s | 1 |
batch per GPU | 1 |
gradient accumulation | 3 |
learning rate | 1 e-5 (AdamW) |
precision | bf16 |
5. Single-GPU Inference
bash scripts/inference/inference_flux.sh
# outputs saved to ./output/
6. Evaluation
# edit prompt_file path
bash scripts/evaluate/eval_reward.sh
FAQ: Quick Answers to Common Questions
Q1. Will shrinking the optimized steps to 4 hurt quality?
No. Early denoising steps dominate global structure; later steps refine detail. MixGRPO keeps the crucial 4 steps under RL control and lets the ODE handle the rest.
Q2. Can I plug in my own reward model?
Yes. Add your model class in reward_models/
and return a scalar score per image. The training script auto-detects it.
Q3. What if I only have 8 GPUs?
Reduce the window size to 2 or increase gradient accumulation. One run on 8×A100 finished in 12 hours for 10 k prompts.
Q4. Does MixGRPO work with LoRA?
Absolutely. LoRA only changes the parameterization; the sampling and optimization logic remain identical.
Q5. How do you prevent reward hacking?
At inference time, we blend 80 % MixGRPO steps with 20 % original FLUX steps (see Appendix A). This simple trick removes most artifacts.
Q6. License details?
Tencent Hunyuan custom license. Academic and non-commercial use are free; commercial usage requires a separate agreement.
Extending MixGRPO: Ideas for Your Next Project
Extension | How to Start |
---|---|
Video generation | Replace FLUX with a video diffusion backbone; keep the sliding window over frame timesteps |
3D assets | Use the same hybrid sampling for NeRF or 3D diffusion models |
Compressed ODE | Replace second-order solver with third-order or distillation to push steps below 8 |
Multi-reward fusion | Weight different rewards dynamically (e.g., aesthetic + safety + OCR) |
Citation
If you use MixGRPO in your research or product, please cite:
@misc{li2025mixgrpounlockingflowbasedgrpo,
title={MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE},
author={Junzhe Li and Yutao Cui and Tao Huang and Yinping Ma and Chun Fan and Miles Yang and Zhao Zhong},
year={2025},
eprint={2507.21802},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2507.21802}
}
Happy faster training!