MixGRPO: Train Text-to-Image Models 71 % Faster—Without Sacrificing Quality

Plain-English summary
MixGRPO replaces the heavy, full-sequence training used in recent human-preference pipelines with a tiny, moving window of only four denoising steps. The trick is to mix deterministic ODE sampling (fast) with stochastic SDE sampling (creative) and to let the window slide from noisy to clean timesteps. The result: half the training time of DanceGRPO and noticeably better pictures.


Why Training “Human-Aligned” Image Models Is Painfully Slow

Recent breakthroughs show that diffusion or flow-matching models produce far more pleasing images if you add a Reinforcement-Learning-from-Human-Feedback (RLHF) stage after the base pre-training.
The downside? RLHF is expensive.

Method Sampling Style Steps Optimized GPU Work per Image
Flow-GRPO Stochastic (SDE) every step all 25 ~50 forward passes
DanceGRPO Stochastic (SDE) random subset 14 ~39 forward passes
MixGRPO (this work) ODE + SDE mixed only 4 ~29 forward passes

Fewer forward passes = lower cloud bill and shorter experiments.


MixGRPO in One Minute

  1. Hybrid Sampling

    • Inside a 4-step window: keep the random SDE so the model can explore.
    • Outside the window: switch to deterministic ODE, cutting compute to almost zero.
  2. Sliding Window Scheduler

    • The 4-step window starts at the noisiest timestep and slowly moves to clean timesteps.
    • Mirrors the RL idea of “discounting” future rewards—early decisions matter more.
  3. Flash Variant

    • Use a second-order ODE solver (DPM-Solver++) for the ODE part.
    • Training-time sampling drops from 25 steps to 8–12 steps with no visual downgrade.

What Actually Changes for Practitioners?

1. Training Cost

Model Iteration Time (A100 32-GPU) ImageReward ↑
DanceGRPO 291 s 1.436
MixGRPO 151 s 1.629
MixGRPO-Flash 83 s 1.624

Measured on the HPDv2 training split (≈10 k prompts). Lower is better for time; higher is better for ImageReward.

2. Visual Quality

Side-by-side prompts (see Figures 3 & 6 in the paper) show MixGRPO delivers:

  • sharper text rendering
  • fewer anatomical errors
  • better color harmony

Even at 8 sampling steps (Flash*), humans still rate MixGRPO above DanceGRPO at 25 steps.


Deep Dive: How MixGRPO Works

Hybrid ODE–SDE Sampling

Flow-matching models treat image generation as solving an ordinary differential equation (ODE):

dx / dt = v_θ(x_t, t)

To add exploration for RL, prior methods convert the ODE into a stochastic differential equation (SDE) and optimize every step.
MixGRPO does the opposite: only the steps inside a small interval S = [t₁, t₂) keep the SDE; the rest use the cheap ODE.

Mathematically:

if t inside S:
    x_{t+Δt} = x_t + [v_θ + ½σ²(x_t + (1-t)v_θ)/t]Δt + σ√Δt ε
else:
    x_{t+Δt} = x_t + v_θ Δt
  • |S| = 4 in the best configuration.
  • σ is the noise schedule; ε ~ N(0, I).

Sliding Window

Define a window

W(l) = {t_l, t_{l+1}, …, t_{l+3}}

At iteration m, optimize only the steps in W(l).
Then slide:

l ← min(l + stride, T - 4)

Ablation results (Table 3) show that a constant stride of 1 and window size 4 gives the best trade-off.

Flash Acceleration

Because the ODE steps outside the window do not need the old policy for gradient ratios, we can replace them with a second-order ODE solver that skips timesteps.
Speed-up formula:

speed-up = T / (w + (T - w) × compression_rate)
  • w = 4
  • compression_rate = 0.5 for second-order solver
  • T = 25 → ≈ 2.5× faster training-time sampling.

Installation & Reproduction Guide

The authors released full code and weights. The commands below have been verified on CentOS 7 + Python 3.12 + CUDA 12.1.

1. Environment

# create env
conda create -n mixgrpo python=3.12
conda activate mixgrpo

# system deps
sudo yum install -y pdsh pssh mesa-libGL
git clone https://github.com/Tencent-Hunyuan/MixGRPO.git
cd MixGRPO
bash env_setup.sh          # installs torch, diffusers, accelerate, etc.

2. Download Models

Asset Local Path Command
FLUX.1-dev ./data/flux huggingface-cli download black-forest-labs/FLUX.1-dev --local-dir ./data/flux
HPS-v2.1 checkpoint ./hps_ckpt huggingface-cli download xswu/HPSv2 HPS_v2.1_compressed.pt --local-dir ./hps_ckpt
ImageReward ./image_reward_ckpt huggingface-cli download THUDM/ImageReward ImageReward.pt --local-dir ./image_reward_ckpt
MixGRPO fine-tuned weight ./mix_grpo_ckpt huggingface-cli download tulvgengenr/MixGRPO diffusion_pytorch_model.safetensors --local-dir ./mix_grpo_ckpt

3. Pre-process Prompts

The training prompts come from the HPDv2 dataset.

bash scripts/preprocess/preprocess_flux_rl_embeddings.sh
# will write embeddings to ./data/embeddings/

4. Multi-Node Training (32 GPUs)

Edit data/hosts/hostfile:

192.168.0.1 slots=8
192.168.0.2 slots=8
192.168.0.3 slots=8
192.168.0.4 slots=8

Then on each node:

# sets INDEX_CUSTOM=0,1,2,3 respectively
bash scripts/preprocess/set_env_multinode.sh

Finally launch:

# put your WandB key in finetune_flux_grpo_FastGRPO.sh
bash scripts/finetune/finetune_flux_grpo_FastGRPO.sh

Default hyper-parameters:

Parameter Value
total steps T 25
window size w 4
stride s 1
batch per GPU 1
gradient accumulation 3
learning rate 1 e-5 (AdamW)
precision bf16

5. Single-GPU Inference

bash scripts/inference/inference_flux.sh
# outputs saved to ./output/

6. Evaluation

# edit prompt_file path
bash scripts/evaluate/eval_reward.sh

FAQ: Quick Answers to Common Questions

Q1. Will shrinking the optimized steps to 4 hurt quality?
No. Early denoising steps dominate global structure; later steps refine detail. MixGRPO keeps the crucial 4 steps under RL control and lets the ODE handle the rest.

Q2. Can I plug in my own reward model?
Yes. Add your model class in reward_models/ and return a scalar score per image. The training script auto-detects it.

Q3. What if I only have 8 GPUs?
Reduce the window size to 2 or increase gradient accumulation. One run on 8×A100 finished in 12 hours for 10 k prompts.

Q4. Does MixGRPO work with LoRA?
Absolutely. LoRA only changes the parameterization; the sampling and optimization logic remain identical.

Q5. How do you prevent reward hacking?
At inference time, we blend 80 % MixGRPO steps with 20 % original FLUX steps (see Appendix A). This simple trick removes most artifacts.

Q6. License details?
Tencent Hunyuan custom license. Academic and non-commercial use are free; commercial usage requires a separate agreement.


Extending MixGRPO: Ideas for Your Next Project

Extension How to Start
Video generation Replace FLUX with a video diffusion backbone; keep the sliding window over frame timesteps
3D assets Use the same hybrid sampling for NeRF or 3D diffusion models
Compressed ODE Replace second-order solver with third-order or distillation to push steps below 8
Multi-reward fusion Weight different rewards dynamically (e.g., aesthetic + safety + OCR)

Citation

If you use MixGRPO in your research or product, please cite:

@misc{li2025mixgrpounlockingflowbasedgrpo,
  title={MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE}, 
  author={Junzhe Li and Yutao Cui and Tao Huang and Yinping Ma and Chun Fan and Miles Yang and Zhao Zhong},
  year={2025},
  eprint={2507.21802},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2507.21802}
}

Happy faster training!