From Imitation to Discrimination: How a Generalized Curriculum Advantage Mechanism Enhances Cross-Domain Reasoning in AI
Summary: This article introduces CAPO (Curriculum Advantage Policy Optimization), an innovative reinforcement learning training paradigm. It employs a staged curriculum, first using positive-advantage samples for imitation learning to build a stable foundation, then introducing negative-advantage samples for discrimination learning to enhance generalization. The method is compatible with mainstream optimization algorithms like GRPO and PPO, consistently improving mathematical reasoning performance by 1.7 to 4.0 points, and effectively generalizes to multimodal GUI reasoning scenarios with a 3.81-point gain, establishing itself as a versatile and robust optimization framework.
I. The Dilemma and Breakthrough in Training Large Models for Reasoning
Imagine teaching a child mathematics. How would you proceed? Typically, you would first show them many correct problem-solving steps, allowing them to imitate and grasp the basic methods. Once they have a foundation, you begin pointing out their mistakes, helping them discern what was done poorly to deepen their understanding. This process, moving from imitation to discrimination, aligns with the natural progression of human cognitive development.
However, when training large language models for complex reasoning, our traditional reinforcement learning methods often “rush the process.”
The post-training paradigm, exemplified by advanced reasoning models like DeepSeek-R1 and Kimi-1.5, widely employs advantage-based reinforcement learning algorithms such as PPO (Proximal Policy Optimization) and GRPO (Group Relative Policy Optimization).
The core of these algorithms is the “advantage value,” which quantifies whether a sample’s performance is better or worse than the model’s current expectation. Positive advantage provides positive feedback, encouraging the model to continue good behaviors; negative advantage provides negative feedback, signaling the model needs adjustment.
The problem lies here: Traditional methods mix positive and negative signals together for training from the very beginning.
This is like praising a child for correct answers and criticizing them for wrong ones before they’ve even learned basic arithmetic. The early model’s “capability” is still unstable, and its understanding of negative feedback signals is ambiguous, potentially even misleading. This signal confusion leads to ambiguous training guidance and unstable early learning, ultimately limiting further model performance improvements.
The Core Research Question
Since the advantage value itself tells us whether the model performed “well” (positive) or “needs improvement” (negative) on a given sample, can we use the advantage as a guide to design a structured learning curriculum that organically integrates positive and negative feedback into a unified, generalizable framework?
This is the starting point for CAPO (Curriculum Advantage Policy Optimization), which we will explore in depth today. Starting from a simple psychological insight, it builds a rigorous machine learning framework that has achieved significant and consistent improvements on mathematical and multimodal reasoning tasks.
II. The Inspiration: The Phased Theory of Human Learning
Our inspiration stems directly from developmental psychology. Research shows that children’s learning progresses in gradual stages:
-
Imitation Phase: By observing and imitating correct behaviors, they establish basic behavioral patterns and stable knowledge foundations. This phase relies primarily on positive reinforcement. -
Discrimination Phase: After mastering the basics, they learn to distinguish right from wrong by receiving corrective feedback and even punishment, thereby refining skills and generalizing to more complex situations.
This “imitation first, discrimination later” progression naturally positions the advantage signal as an effective curriculum indicator. Positive advantage corresponds to “correct behavior to imitate,” while negative advantage corresponds to “deficiencies to discriminate.”
Figure 1: Comparison of Training Signals in Traditional RL vs. CAPO
(a) Traditional RL: Mixes positive and negative signals from the start, potentially causing early learning confusion.
(b) CAPO: Employs a staged curriculum. Uses only positive signals for imitation to build stability first, then introduces negative signals for discrimination to improve generalization.
Translating this intuitive idea into an algorithm requires more than just intuition; it needs theoretical support. We examine this design through the lens of the variance-bias tradeoff:
-
Imitation Phase (Positive-Only Advantage): By excluding negative-advantage samples (especially those extreme negative values arising from the model’s early insufficient capability) that can introduce high variance, it significantly reduces the variance of gradient estimates, thereby stabilizing early training. -
Discrimination Phase (Full Advantage Spectrum): Once the model’s foundation is solid, reintroducing negative-advantage samples restores the unbiasedness of the gradient estimate, allowing the model to learn comprehensively and thus acquire stronger generalization capability.
What makes CAPO unique is that it leverages the advantage itself as a dynamic signal that aligns closely with the model’s own evolving capabilities. In contrast, most traditional curriculum learning methods rely on static heuristic rules, such as sorting tasks from easy to hard, or estimating difficulty based on expert-annotated success rates. These methods are essentially external and heuristic, depending on manually defined proxy metrics rather than signals intrinsic to the model’s evolving capability.
III. CAPO Explained: A Universal Mechanism for Phased Optimization
1. Core Mechanism Overview
CAPO is a mechanism broadly compatible with various advantage-based reinforcement learning algorithms. Its workflow can be summarized as follows:
Figure 2: Illustration of the CAPO Scheduling Mechanism
Each query is processed by the policy model to generate samples, and different optimization algorithms compute their respective advantage values. In Phase 1, only positive-advantage samples are used to ensure stability; after the switch point, Phase 2 incorporates both positive and negative advantages to balance stability and generalization.
Simply put, after obtaining model-generated samples and their computed advantage values, CAPO does not immediately use them all for updates. Instead, based on the training progress, it intelligently filters and utilizes these signals.
2. Phase 1: Positive-Only Advantage Imitation
In the early stages of training, CAPO establishes a positive-only advantage imitation phase. In this phase, only samples with an advantage value greater than or equal to zero (A(τ) ≥ 0) are used to update the model.
Why do this?
-
Consolidates Prior Knowledge: Guides the model to reinforce behaviors that are already better than expected, solidifying correct reasoning patterns. -
Avoids Unstable Gradients: Prevents the model, while still weak, from being prematurely disturbed by hard-to-handle, poorly performing samples (negative advantage) that could bring large and noisy gradient updates. -
Stays Close to Reference Distribution: Combined with a KL divergence regularization term (similar to practices in RLHF), it ensures the model progresses without straying too far from the original model, maintaining generation quality and diversity.
The objective function for this phase (based on the PPO clipped objective) can be formally described as: within the expectation, using an indicator function to filter out positive-advantage trajectories, computing the clipped policy gradient reward for them, while subtracting a KL divergence penalty term from a reference policy. The core idea is selective reinforcement.
3. Phase 2: Full Advantage Spectrum Discrimination
Once the model has built a stable and reliable foundation through the first phase of imitation learning (e.g., after reaching 10% or 20% of the total training steps), CAPO switches to the full advantage spectrum discrimination phase.
At this point, all samples, regardless of whether their advantage is positive or negative, are used for training.
What does this achieve?
-
Refined Learning: The model not only continues learning “what to do” from positive-advantage samples but also begins learning “what not to do” from negative-advantage samples. This contrastive learning can significantly improve the model’s discrimination and generalization abilities. -
Restores Unbiased Estimation: Theoretically, using all samples allows the gradient estimate to regain unbiasedness, which is crucial for the model’s eventual convergence to an optimal solution.
4. Curriculum Scheduling: Simple Yet Effective
CAPO employs a simple, direct “hard switch” strategy between the two phases. We preset a switch point (e.g., 20% of training steps), using positive-only advantage before it and full advantage after.
We also experimented with more complex schemes, such as gradually introducing negative signals, but experiments found that no progressive scheme was more effective than the simple switch point.
The advantages of this pragmatic design are:
-
Strong Robustness: Avoids complex hyperparameter tuning or monitoring of task-specific metrics. -
Task-Agnostic: Works stably across different tasks and datasets. -
High Reproducibility: Ensures the reliability and reproducibility of experimental results.
It perfectly achieves the original design intention: reducing variance to stabilize training in Phase 1, and enhancing generalization through unbiased estimation in Phase 2.
IV. Experimental Results: Quantitative Validation of CAPO’s Effectiveness
Even the most elegant theory requires solid experimental support. We comprehensively evaluated CAPO in two major domains: mathematical reasoning and multimodal GUI reasoning.
1. Mathematical Reasoning Tasks: Universal and Stable Improvement
We tested the combined effect of CAPO with four mainstream reinforcement learning algorithms (GRPO, PPO, RLOO, Reinforce++) on two different-scale models, Qwen2.5-Math-7B and Qwen2.5-Math-1.5B. Evaluation covered seven high-difficulty mathematical reasoning benchmarks including AIME, AMC, MATH500, and GSM8K.
Table 1: Results of Different LLMs on Seven Mainstream Math Reasoning Benchmarks (Key Data Summary)
| Model & Method | AIME24 | AMC | Average Improvement |
|---|---|---|---|
| Qwen2.5-7B GRPO | 16.7 | 52.5 | – |
| + CAPO | 20.0 | 65.0 | +3.9 points |
| Qwen2.5-7B PPO | 26.7 | 52.5 | – |
| + CAPO | 30.0 | 57.5 | +3.2 points |
| Qwen2.5-1.5B GRPO | 13.3 | 52.5 | – |
| + CAPO | 23.3 | 62.5 | +4.0 points |
Key Findings:
-
Broad Applicability: As a “plug-and-play” enhancement mechanism, CAPO delivered consistent and significant performance improvements (+1.7 to +4.0 points) across all four tested optimization algorithms. -
Scale-Friendly: CAPO was effective for both 7B and 1.5B models, helping the smaller model significantly close the performance gap with the larger one. -
Notable Gains on Challenging Tasks: On AMC competition-level problems, CAPO helped the 7B model achieve a leap from 52.5 to 65.0 (+12.5 points).
2. Multimodal GUI Reasoning Tasks: Powerful Cross-Domain Generalization
To test if CAPO is limited to the mathematical domain, we applied it to the more challenging multimodal Graphical User Interface (GUI) reasoning task. This type of task requires the model to simultaneously understand visual interfaces, language instructions, and plan correct action sequences, serving as a testbed for cross-domain reasoning ability.
We trained on the QwenVL2.5-3B model using the VERL framework and tested on four GUI planning benchmarks including GUI-Act-Web and OmniAct-Web.
Table 2: GUI Reasoning Task Performance Comparison (GRPO vs. GRPO+CAPO)
| Metric | GRPO Baseline | CAPO (Ours) | Improvement (Δ) |
|---|---|---|---|
| GUI-Act-Web (SR) | 70.23 | 85.85 | ↑15.62 |
| OmniAct-Web (SR) | 70.76 | 74.16 | ↑3.40 |
| Overall Average | 70.79 | 74.60 | ↑3.81 |
Note: SR is Step Success Rate, a key metric.
The conclusion is clear: CAPO successfully generalized from pure-text mathematical reasoning to multimodal interactive reasoning, delivering an average 3.81-point significant improvement on GUI planning tasks, with an astonishing 15.62-point improvement in step success rate on the GUI-Act-Web dataset. This demonstrates the potential of CAPO as a universal optimization paradigm.
V. In-Depth Analysis: Why Does CAPO Work?
1. Visualizing Training Dynamics
Figure 3: Reward and Entropy Dynamics of GRPO vs. CAPO on the 7B Model
The gray vertical line marks the switch point from the imitation phase to the discrimination phase.
-
Reward Curve: Before the switch point, both show similar growth; after the switch, CAPO’s reward growth is more sustained and superior. -
Entropy Curve: After the switch, CAPO’s entropy value rises steadily, while GRPO’s entropy plateaus. Higher entropy typically indicates the model maintains richer exploratory behavior and more diverse reasoning paths.
This shows that by delaying the introduction of negative samples, CAPO avoids the entropy collapse (where the model rapidly converges to a single mode) that early signal mixing might trigger. After solidifying the foundation, it can utilize negative feedback more effectively, maintaining exploratory capability while improving performance, which is key to its enhanced generalization.
2. Comparison with Static Curriculum Learning
We compared CAPO with traditional Static Curriculum Learning (SC). Static curriculum pre-evaluates all samples for difficulty (e.g., using pass@16), sorts them from easy to hard, and trains the model in this fixed order.
Table 3: Static Curriculum vs. CAPO Dynamic Curriculum on Qwen2.5-7B-Math
| Method | AIME24 | AMC | Overall Performance |
|---|---|---|---|
| GRPO (Baseline) | 16.7 | 52.5 | 49.5 |
| GRPO + Static Curric. | 16.7 | 65.0 | 51.8 |
| GRPO + CAPO | 20.0 | 65.0 | 53.9 |
The results show that static curriculum (GRPO+SC), while effective on some tasks (like AMC), delivers unstable and limited improvement (no improvement on AIME24). In contrast, CAPO’s dynamic advantage curriculum achieved more stable and comprehensive improvements across all tasks. This indicates that relying on external, static difficulty estimates is less effective than leveraging the model’s own intrinsic, dynamic capability signals.
3. Out-of-Distribution Generalization Capability
A good algorithm shouldn’t only perform well within the training distribution. We tested CAPO on two completely unseen out-of-distribution reasoning datasets: ARC-C and GPQA-Diamond.
Figure 5: Results on Out-of-Distribution (OOD) Benchmarks
CAPO achieved an average accuracy of 52.8, significantly outperforming the GRPO baseline’s 49.0, representing a relative improvement of approximately 6.5%. This confirms that CAPO’s progressive learning strategy (imitation then discrimination) does cultivate more robust, better-generalizing models, effectively mitigating performance degradation due to distribution shift.
VI. CAPO Frequently Asked Questions (FAQ)
Q1: How is CAPO different from traditional “Curriculum Learning”?
A1: Traditional curriculum learning is typically “data-centric,” relying on externally defined, static difficulty metrics (like problem length, pass rate) to sort samples from easy to hard. CAPO is “competence-aware,” using the naturally occurring advantage signal from reinforcement learning training as the curriculum indicator. This signal is dynamic and endogenous, directly reflecting the current strength of the model’s capability, enabling adaptive scheduling synchronized with the model’s learning progress.
Q2: Does CAPO require extra computational overhead?
A2: The core logic of CAPO is the selective utilization of existing advantage signals, not the computation of new complex metrics. The additional overhead it introduces (like judging whether an advantage is positive or negative) is minimal and almost negligible. Its main value lies in making each computation step more effective through a smarter training strategy, thereby achieving better model performance with similar computational power.
Q3: How is the switch point (e.g., 20%) determined? Does it need fine-tuning for different tasks?
A3: Our experiments show that CAPO is not overly sensitive to the exact position of the switch point. Switching within a range of 20% to 30% of training progress usually yields good results (as shown in Figure 4). We recommend adopting a simple, unified strategy (e.g., fixed at 20%), which avoids tedious hyperparameter searches and reflects CAPO’s practical, robust design philosophy.
Q4: Can CAPO be applied to other types of models or tasks, such as vision models or reinforcement learning agents?
A4: In principle, the CAPO framework is general. As long as the training paradigm involves policy gradient updates based on an advantage function (the core of modern deep reinforcement learning), the phased curriculum idea of CAPO could be applicable. This article has verified its effectiveness on large language models (mathematical reasoning) and multimodal large models (GUI reasoning). Extending it to pure vision tasks or traditional RL environments is a promising direction for future exploration.
Q5: What if my dataset has very few positive advantage samples? Will CAPO fail?
A5: In the first phase, if positive-advantage samples are too scarce, it might lead to slow or insufficient updates. This essentially reflects that the current model’s overall capability on the task is weak. In this case, it may be necessary to re-examine the base model’s pre-training quality, the design of the reward function, or consider starting from more basic imitation learning. CAPO is more suitable for stages where the model already has some foundational capability and needs further optimization and generalization.
Conclusion
From imitation to discrimination, CAPO offers a new perspective on model training that aligns with cognitive science intuition and has a solid theoretical foundation. It no longer treats the advantage value merely as a gradient weight but elevates it to a curriculum signal that drives the entire learning process.
The core contributions of this work are:
-
Proposing a Universal Dynamic Curriculum Mechanism: Utilizing the advantage signal to design a two-stage training process of imitation and discrimination, adapting to the model’s evolving capability. -
Demonstrating Broad Applicability and Generalizability: Consistently improving mathematical reasoning across multiple algorithms like GRPO and PPO, and effectively transferring to multimodal GUI reasoning scenarios. -
Providing Ample Quantitative Evidence: Through extensive experiments, confirming its consistent effectiveness on both in-distribution and out-of-distribution tasks.
The success of CAPO suggests that letting machine learning processes better reflect the progressive and structural nature of human learning may be key to unlocking more powerful, robust artificial intelligence. It is not just an algorithmic improvement but a shift in mindset towards smarter, more adaptive training paradigms.

