PaCo-RL: A Breakthrough in Consistent Image Generation Using Reinforcement Learning

Introduction

Have you ever tried using AI to generate a series of coherent images—for creating story characters or designing multiple advertisement visuals—only to find the results inconsistent in style, identity, or logical flow? Consistent image generation remains a fundamental challenge in AI content creation, requiring models to maintain shared elements like character appearance, artistic style, or scene continuity across multiple images.

In this comprehensive guide, we explore PaCo-RL (Pairwise Consistency Reinforcement Learning), an innovative framework that addresses these challenges through specialized reward modeling and efficient reinforcement learning. Whether you’re a researcher, developer, or AI enthusiast, this article will provide deep insights into how PaCo-RL works, its technical innovations, and practical applications.

What is PaCo-RL? Revolutionizing Consistent Image Generation

PaCo-RL is a reinforcement learning framework specifically designed for consistent image generation. It comprises two core components: PaCo-Reward (a specialized consistency evaluation model) and PaCo-GRPO (an efficient RL optimization algorithm).

The framework addresses two key image generation tasks:

  • Image Editing: Modifying specific attributes (like changing facial expressions) while preserving overall appearance
  • Text-to-Image Set Generation: Producing multiple coherent images from a single prompt while maintaining identity, style, and contextual consistency

Unlike traditional supervised approaches that struggle with limited annotated data, PaCo-RL enables data-efficient learning of complex visual consistency criteria through reinforcement learning. Experimental results demonstrate state-of-the-art performance with significantly improved training efficiency.

Why Consistent Image Generation Matters

Visual consistency is crucial for applications including:

  • Storytelling: Maintaining character appearance across comic panels or book illustrations
  • Brand Design: Ensuring consistent visual identity across marketing materials
  • Product Development: Creating coherent design mockups and prototypes

Traditional methods often fail to balance consistency with generation quality. PaCo-RL addresses this through specialized reward mechanisms and optimization strategies tailored for consistency preservation.

Core Components: Technical Deep Dive

1. PaCo-Reward: Specialized Consistency Evaluation Model

PaCo-Reward serves as the foundation of the framework—a pairwise reward model trained to assess visual consistency between images. Unlike general-purpose reward models focused on aesthetics or prompt alignment, PaCo-Reward specializes in consistency evaluation.

PaCo-Dataset: Large-Scale Consistency Benchmark

The model is trained on PaCo-Dataset, a human-annotated consistency ranking dataset covering 6 major categories and 32 subcategories of consistent image generation scenarios. The dataset includes 54,624 annotated image pairs (27,599 consistent and 27,025 inconsistent), each accompanied by GPT-5 generated Chain-of-Thought reasoning annotations.

Data collection employed an automated pipeline:

  • Generated 2,000 text prompts using Deepseek-V3.1
  • Selected 708 diverse prompts through graph-based diversification
  • Used FLUX.1-dev to generate 2×2 image grids with strong internal consistency
  • Applied sub-figure combinatorial pairing to create 33,984 unique ranking instances

Modeling Innovations

PaCo-Reward reformulates consistency evaluation as a generative task, predicting the probability of “Yes” or “No” tokens to indicate consistency between image pairs. This approach naturally aligns with the autoregressive nature of vision-language models.

The training objective uses weighted likelihood:

L_PaCo = -[α log p(y₀|I) + (1-α) ∑ᵢ₌₁ⁿ⁻¹ log p(yᵢ|I)]

where y₀ is the decision token (“Yes”/”No”) and yᵢ represents reasoning sequence tokens. Through hyperparameter search, α=0.1 achieves optimal generalization performance.

2. PaCo-GRPO: Efficient RL Optimization Algorithm

PaCo-GRPO addresses the computational challenges of consistent image generation through two key strategies:

Resolution-Decoupled Training

The approach separates training and inference resolutions:

  • Training: Uses lower-resolution images (512×512) for reward computation
  • Inference: Generates high-resolution outputs (1024×1024)

This strategy reduces computational costs by approximately 75% while maintaining final performance quality. Experiments show that 512×512 training achieves comparable performance to 1024×1024 training after 50 epochs, despite starting with lower initial rewards.

Log-Tamed Multi-Reward Aggregation

To prevent single reward domination during multi-reward optimization, PaCo-GRPO implements adaptive reward compression:

  1. Compute coefficient of variation for each reward signal:

    hᵏ = std_{i,j}(Rᵏ(xᵢʲ, cᵢ)) / mean_{i,j}(Rᵏ(xᵢʲ, cᵢ))
    
  2. Apply logarithmic transformation to rewards with high variability:

    R̄ᵏ(xᵢʲ, cᵢ) = log(1 + Rᵏ(xᵢʲ, cᵢ)) if hᵏ > δ
    

This approach maintains relative sample rankings while preventing any single reward from dominating optimization.

Experimental Results: Quantifiable Performance Gains

Reward Modeling Performance

PaCo-Reward demonstrates significant improvements over existing models:

  • ConsistencyRank Benchmark: Achieves 0.449 accuracy, outperforming CLIP-I (0.394) and DreamSim (0.403)
  • EditReward-Bench: Reaches 0.709 consistency accuracy, competitive with proprietary models like GPT-5
  • Human Alignment: Shows 10.5% improvement in accuracy and 0.150 increase in Spearman’s ρ compared to base Qwen2.5-VL-7B

End-to-End Generation Results

When integrated with PaCo-GRPO, the framework achieves:

  • Text-to-ImageSet Generation: 11.7% improvement in consistency metrics while maintaining prompt fidelity
  • Image Editing Tasks: Balanced enhancement of semantic consistency and perceptual quality
  • Training Efficiency: Near 2× improvement in training speed with maintained stability

Ablation Studies

Key findings from component analysis:

  • Resolution Strategy: 512×512 training provides optimal efficiency-quality tradeoff
  • Reward Aggregation: Log-tamed approach maintains reward ratios below 1.8 vs. 2.5+ with standard aggregation
  • Multi-Reward Balance: Prevents dominance of any single reward signal during optimization

Practical Implementation Guide

Installation and Setup

git clone https://github.com/X-GenGroup/PaCo-RL.git
cd PaCo-RL

Training PaCo-Reward

cd PaCo-Reward
conda create -n paco-reward python=3.12 -y
conda activate paco-reward
cd LLaMA-Factory && pip install -e ".[torch,metrics]" --no-build-isolation
cd .. && bash train/paco_reward.sh

RL Training with PaCo-GRPO

cd PaCo-GRPO
conda create -n paco-grpo python=3.12 -y
conda activate paco-grpo
pip install -e .

# Start vLLM reward server
conda activate vllm && pip install vllm
export CUDA_VISIBLE_DEVICES=0
export VLLM_MODEL_PATHS='X-GenGroup/PaCo-Reward-7B'
bash vllm_server/launch.sh

# Begin training
export CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7
conda activate paco-grpo
bash scripts/single_node/train_flux.sh t2is

Pre-trained Models

Available models include:

  • PaCo-Reward-7B: Primary reward model (https://huggingface.co/X-GenGroup/PaCo-Reward-7B)
  • PaCo-FLUX.1-dev: Text-to-image model with consistency optimization
  • PaCo-QwenImage-Edit: Image editing model with enhanced consistency

Case Studies: Real-World Applications

Character Consistency Generation

Prompt: “Generate four images depicting the same dentist in scrubs in various medical scenarios”

Results: Progressive improvement in facial features and hairstyle consistency across different medical contexts while maintaining individual identity.

Design Style Consistency

Prompt: “Create four café menu displays with chalkboard font featuring: ‘Fresh Brew’, ‘Daily Specials’, ‘Homemade’, and ‘Sweet Treats'”

Results: Uniform chalkboard font style across all menus with improved prompt fidelity and aesthetic quality.

Progressive Drawing Sequences

Prompt: “Generate four images showing progressive pencil drawing sequence of a young woman’s portrait”

Results: Correct logical progression where each image extends and refines the previous one rather than introducing inconsistent changes.

Frequently Asked Questions

Q: How does PaCo-RL differ from traditional supervised approaches?
A: Traditional methods require large annotated datasets for consistency patterns, which are scarce. PaCo-RL uses reinforcement learning to optimize human preferences directly, enabling data-efficient learning of complex consistency criteria.

Q: Can PaCo-Reward handle multiple image comparisons?
A: Yes, the model performs pairwise comparisons between reference and candidate images. For multiple candidates, it computes consistency scores for each pair and infers rankings.

Q: Does resolution-decoupled training affect output quality?
A: No, the approach maintains high-quality generation during inference while using lower-resolution images during training for efficiency. Experiments confirm comparable final performance.

Q: What applications benefit most from PaCo-RL?
A: The framework is particularly valuable for storyboarding, character design, product mockups, and any application requiring visual consistency across multiple images.

Q: How can I customize PaCo-RL for specific domains?
A: Users can fine-tune the reward model on domain-specific data following the PaCo-Dataset format, then integrate it into the PaCo-GRPO training pipeline.

Conclusion and Future Directions

PaCo-RL represents a significant advancement in consistent image generation through its novel combination of specialized reward modeling and efficient reinforcement learning. The framework demonstrates:

  • Improved Consistency: 10.3-11.7% improvement in consistency metrics
  • Enhanced Efficiency: Near 2× training speedup with maintained stability
  • Better Human Alignment: 8.2-15.0% improvement in correlation with human preferences

Future work may extend the approach to video generation (temporal consistency) and 3D content creation. The open-source implementation encourages community exploration and adaptation.

Citation and Acknowledgments

If you use PaCo-RL in your research, please cite:

@misc{ping2025pacorl,
      title={PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling}, 
      author={Bowen Ping and Chengyou Jia and Minnan Luo and Changliang Xia and Xin Shen and Zhuohang Dang and Hangwei Qian},
      year={2025},
      eprint={2512.04784},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.04784}, 
}