rStar2-Agent: How a 14B Model Achieves Frontier Math Reasoning with Agentic Reinforcement Learning
Introduction
In the rapidly evolving field of artificial intelligence, large language models (LLMs) have made impressive strides in complex reasoning tasks. However, many state-of-the-art models rely on extensive computational resources and lengthy “chain-of-thought” (CoT) processes that essentially encourage models to “think longer” rather than “think smarter.”
A groundbreaking technical report from Microsoft Research introduces rStar2-Agent, a 14-billion-parameter math reasoning model that challenges this paradigm. Through innovative agentic reinforcement learning techniques, this compact model achieves performance comparable to giants like the 671-billion-parameter DeepSeek-R1, demonstrating that smarter training methodologies can outperform sheer scale.
This comprehensive analysis explores the technical innovations behind rStar2-Agent, its training infrastructure, performance benchmarks, and practical implementation details. For researchers and practitioners in AI and machine learning, this model represents a significant step toward more efficient and effective reasoning systems.
What is rStar2-Agent?
rStar2-Agent is a 14-billion-parameter math reasoning model trained using agentic reinforcement learning to achieve frontier-level performance. Unlike conventional approaches that rely on extended chain-of-thought processes, this model demonstrates advanced cognitive behaviors including careful thinking before using Python coding tools and reflecting on code execution feedback to autonomously explore, verify, and refine intermediate steps in complex problem-solving.
The model’s capabilities are enabled through three key innovations:
-
An efficient RL infrastructure with a reliable Python code environment that supports high-throughput execution and mitigates rollout costs -
GRPO-RoC, a novel agentic RL algorithm with a Resample-on-Correct rollout strategy -
An efficient agent training recipe that starts with non-reasoning supervised fine-tuning (SFT) and progresses through multi-stage RL
Remarkably, rStar2-Agent boosts a pre-trained 14B model to state-of-the-art performance in just 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly shorter responses.
Core Technical Innovations
GRPO-RoC: Group Relative Policy Optimization with Resampling on Correct
The cornerstone of rStar2-Agent’s success is GRPO-RoC (Group Relative Policy Optimization with Resampling on Correct), an agentic reinforcement learning algorithm specifically designed to address the inherent environmental noises from coding tools.
The Challenge of Environment Noise
In standard reasoning tasks, models generate text without external feedback. However, when incorporating coding tools, the model must not only decide when to use them but also generate correct and executable code. When errors occur, the environment returns error messages unrelated to the reasoning task, which can mislead the model into spending valuable tokens fixing tool errors rather than advancing its reasoning.
Under outcome-only reward schemes (where trajectories are evaluated solely based on the final answer), trajectories with incorrect intermediate tool calls can still receive positive reward if the final answer is correct. This effectively reinforces the model to treat such errors as acceptable, leading to lengthy, low-quality trajectories containing tool call errors.
The GRPO-RoC Solution
GRPO-RoC integrates GRPO with a Resample-on-Correct (RoC) rollout strategy to address environment-induced noise under sparse, outcome-only rewards. The approach involves:
-
Oversampling: First sampling 2G rollouts (where G is the standard group size) -
Asymmetric downsampling: Applying different selection strategies to positive and negative trajectories -
Negative samples: Preserving diversity by sampling from zero-reward rollouts without filtering -
Positive samples: Filtering environment noises and promoting higher quality by prioritizing trajectories with minimal tool-induced errors or formatting issues
This simple yet effective asymmetric sampling preserves diverse failure modes as informative negative signals while emphasizing higher-quality success cases for positive supervision. Compared to methods that explicitly penalize tool-use errors in the reward function, GRPO-RoC improves training stability and avoids reward-hacking risks.
Efficient Large-Scale Agentic RL Infrastructure
Agentic reinforcement learning introduces significant infrastructure challenges that the rStar2-Agent team addressed through custom-built systems.
Reliable High-Throughput Code Environment
The system implements a dedicated, isolated code environment service capable of handling massive concurrent tool call requests without stalling rollouts. The architecture includes:
-
A centralized task queue with 32 send workers on the master node -
Worker nodes with lightweight task schedulers and pools of execution workers -
Batch processing of up to 64 tool calls grouped together -
Dynamic assignment of tool calls to idle execution workers
This infrastructure reliably handles up to 45,000 concurrent tool calls per step while maintaining consistently low end-to-end latency (0.3 seconds per call on average).
Load-Balanced Rollout Scheduler
Traditional RL systems use static rollout allocation, which leads to GPU idle time and synchronization delays due to variability in computation across multi-turn rollouts. rStar2-Agent introduces a dynamic load-balanced rollout scheduling method that:
-
Assigns requests based on current available KV cache capacity rather than static division -
Dispatches tool calls asynchronously to the environment service immediately upon generation -
Assigns new requests in real-time as GPUs free up KV cache space
This approach significantly improves GPU utilization and overall rollout efficiency compared to static allocation methods.
Efficient Training Recipe
The training methodology for rStar2-Agent emphasizes efficiency and effectiveness through several key strategies:
Non-Reasoning Cold Start for Instruction Following
Unlike prior works that apply reasoning-heavy SFT before RL, rStar2-Agent begins with a non-reasoning SFT stage solely to instill general instruction-following, coding tool usage, and formatting without enhancing reasoning. This approach:
-
Avoids potential SFT overfitting -
Keeps initial average responses short -
Allows RL to more effectively cultivate reasoning while fully exploiting the model’s pre-trained capability
The non-reasoning SFT incorporates 165K function call data, 30K instruction-following examples, and 27K chat data, primarily improving the base model’s tool use, instruction-following, and chat abilities while maintaining comparable math performance.
Multi-Stage RL Training
The training process employs a multi-stage strategy that gradually increases both the maximum training length and the difficulty of the data:
-
Stage 1 – Concise Training at 8K Response Length: Training on the full set of 42K curated math problems with a maximum response length of 8K tokens -
Stage 2 – Extending to 12K Response Length: Increasing the maximum response length to 12K tokens as the model’s capabilities grow -
Stage 3 – Focused Training on Difficult Problems: Shifting focus to harder problems (17.3K) that haven’t been perfectly solved
This approach significantly reduces RL costs while encouraging more efficient reasoning strategies compared to methods that heavily scale rollouts to 16K→48K or more.
Performance and Evaluation
Mathematical Reasoning Benchmarks
rStar2-Agent-14B demonstrates exceptional performance on competitive math benchmarks:
Model | Parameters | AIME24 | AIME25 | HMMT25 |
---|---|---|---|---|
OpenAI o3-mini (medium) | – | 79.6% | 77.0% | 53.0% |
DeepSeek-R1 | 671B | 79.8% | 70.0% | 44.4% |
Claude-Opus-4.0 (Think) | – | 76.0% | 69.2% | – |
QWQ-32B | 32B | 79.5% | 65.8% | 47.5% |
rStar2-Agent-14B | 14B | 80.6% | 69.8% | 52.7% |
Notably, rStar2-Agent-14B achieves the highest score on AIME24 (80.6%), outperforming o3-mini (medium), DeepSeek-R1, and Claude Opus 4.0 (thinking) by 1.0%, 0.8% and 3.6% respectively. On AIME25 and HMMT25, it reaches 69.8% and 52.7%, demonstrating consistently strong results across benchmarks.
Efficiency Advantages
Beyond raw accuracy, rStar2-Agent demonstrates significant efficiency improvements:
Model | AIME24 Response Length | AIME25 Response Length |
---|---|---|
DeepSeek-R1-Zero (671B) | 14,246.8 tokens | 17,132.9 tokens |
QWQ-32B | 11,868.4 tokens | 15,865.4 tokens |
Qwen3-14B | 14,747.6 tokens | 17,521.9 tokens |
rStar2-Agent-14B | 9,339.7 tokens | 10,943.4 tokens |
Despite generating shorter responses, rStar2-Agent-14B attains higher reasoning accuracy on these challenging problems, indicating that the model has learned to use coding tools more intelligently to reason more efficiently.
Generalization Performance
Despite being trained with math-only agentic reinforcement learning, rStar2-Agent-14B demonstrates strong generalization capabilities:
Task Category | Benchmark | DeepSeek-V3 | rStar2-Agent-14B |
---|---|---|---|
Science Reasoning | GPQA-Diamond | 59.1% | 60.9% |
Agentic Tool Use | BFCL v3 | 57.6% | 60.8% |
General Alignment | IFEval (strict prompt) | 86.1% | 83.4% |
General Alignment | Arena-Hard | 85.5% | 86.6% |
Notably, on the science reasoning benchmark GPQA-Diamond, despite no training on science data, rStar2-Agent-14B improves accuracy from 42.1% (after SFT) to 60.9%, surpassing DeepSeek-V3 by 1.8%. This shows that reasoning patterns learned from mathematics transfer effectively to general science reasoning.
Implementation and Setup
Installation
There are two approaches to installing the required dependencies for rStar2-Agent:
Option 1: Manual Installation
# Initialize and update submodules
git submodule init
git submodule update
# install verl
pip install "torch<2.8"
pip install -r verl/requirements_sglang.txt
pip install -e verl
# install code judge
pip install -r code-judge/requirements.txt
pip install -e code-judge
# install rstar2_agent
pip install -e .
Option 2: Automated Installation
bash install.sh
Code Judge Server Setup
Security Warning: Code Judge executes arbitrary code. Always deploy in an isolated environment (preferably Docker) and never expose to external networks.
rStar2-Agent uses Code Judge as a tool call server to execute model-generated Python code:
1. Start Redis Server
redis-server --daemonize yes --protected-mode no --bind 0.0.0.0
2. Launch Code Judge Server
# Start the main server (master node only)
# Replace $WORKSPACE and $MASTER_ADDR with your actual paths
tmux new-session -d -s server \
'cd $WORKSPACE/code-judge && \
MAX_EXECUTION_TIME=4 \
REDIS_URI="redis://$MASTER_ADDR:6379" \
RUN_WORKERS=0 \
uvicorn app.main:app --host 0.0.0.0 --port 8088 --workers 16 \
2>&1 | tee server.log'
3. Start Code Judge Workers
# Launch workers (can be deployed on multiple nodes for increased parallelism)
# Adjust MAX_WORKERS based on your CPU count per node
tmux new-session -d -s worker \
'cd $WORKSPACE/code-judge && \
MAX_EXECUTION_TIME=4 \
REDIS_URI="redis://$MASTER_ADDR:6379" \
MAX_WORKERS=64 \
python run_workers.py \
2>&1 | tee worker.log'
Launch the VLLM Server
Start the VLLM server with:
vllm serve /path/to/your/model \
--host 0.0.0.0 \
--port 8000 \
--enable-auto-tool-choice \
--tool-call-parser hermes
Replace /path/to/your/model
with the actual path to your downloaded model.
Verify Server Status
Check if the server is running properly:
curl http://localhost:8000/v1/models
Run Interactive Chat with Tool Calling
Use the provided script to interact with your model:
python examples/chat_with_tool_call.py \
--model /path/to/your/model \
--prompt "Solve the system of equations: 2x + 3y = 7, x - y = 1" \
--max_tokens 8192
Training Framework
Data Preparation
The training framework uses:
-
Training Dataset: DAPO-17k (English subset) -
Test Dataset: AIME24
Preprocess the datasets with:
# Process AIME 2024 dataset
python data_preprocess/aime2024_rstar2_agent_loop.py
# Process DAPO dataset
python data_preprocess/dapo_rstar2_agent_loop.py
Model Setup
Download the base model (Qwen3-14B-Base):
huggingface-cli download Qwen/Qwen3-14B-Base --local-dir $HOME/models/Qwen3-14B-Base
Note: The base model requires instruction-following SFT before RL training for optimal performance.
Training Execution
Run the training script (for 8x A100/H100 GPUs):
bash examples/run_qwen3-14b_rstar2_agent_weave.sh
Adjust configuration parameters based on your hardware environment.
Configuration Settings
The framework supports various sampling strategies to improve training efficiency:
# Global Settings
augmentation.do_down_sampling=True # Enable down sampling
augmentation.down_sampling_config.down_sample_to_n=16 # Target number of traces per data point
# Sampling Strategies
augmentation.down_sampling_config.reject_equal_reward=True # Enable reject sampling for equal rewards
augmentation.down_sampling_config.roc_error_ratio=True # Resample correct traces by tool call error ratio
augmentation.down_sampling_config.roc_answer_format=True # Resample correct traces by answer format
# Minimum Trace Requirements
augmentation.down_sampling_config.min_zero_reward_trace_num=2 # Minimum negative traces to retain
augmentation.down_sampling_config.min_non_zero_reward_trace_num=2 # Minimum positive traces to retain
Advanced Reasoning Behaviors
Analysis of rStar2-Agent’s reasoning trajectories reveals two distinct patterns of high-entropy tokens that contribute to its success:
Forking Tokens for Exploration and Self-Reflection
The first pattern corresponds to forking tokens, which introduce uncertainty and trigger the model to self-reflect (e.g., “But before”, “double-check”) and verify intermediate steps (e.g., “rerun”, “re-evaluate”). These behaviors increase the likelihood of correcting possible errors and discovering correct solutions.
Reflection Tokens on Tool Call Responses
The second pattern emerges specifically from agentic reasoning. Upon receiving feedback from the code environment, the model generates sequences of high-entropy reflection tokens to analyze and interpret the coding execution results. This includes:
-
Validating correct tool responses -
Diagnosing inconsistencies in error responses -
Exploring alternative solutions -
Refining reasoning based on environmental feedback
This behavior mirrors human-like reasoning in response to environment feedback, revealing more advanced cognitive capabilities than conventional long chain-of-thought approaches.
Comparative Analysis with Other Approaches
rStar2-Agent demonstrates clear advantages over other reinforcement learning methods:
Method | Model Size | Has Reasoning SFT? | Tools | AIME24 | AIME25 | RL Steps |
---|---|---|---|---|---|---|
DeepSeek-R1-Zero | 671B | ✗ | ✗ | 71.0% | 53.3% | >9K |
DAPO | – | ✗ | ✗ | 50.0% | 32.1% | >5000 |
ReTool-32B | 32B | ✓ | ✓ | 67.0% | 49.3% | 400 |
ZTRL-32B | 32B | ✗ | ✓ | 56.7% | 33.3% | 600 |
rStar2-Agent-32B | 32B | ✗ | ✓ | 69.4% | 57.3% | 700 |
rStar2-Agent-14B | 14B | ✗ | ✓ | 80.6% | 69.8% | 510 |
Notably, rStar2-Agent achieves superior performance with fewer training steps and without reasoning-specific supervised fine-tuning, demonstrating the effectiveness of its agentic reinforcement learning approach.
Practical Implications and Applications
The development of rStar2-Agent has significant implications for the future of AI reasoning systems:
Computational Efficiency
By achieving state-of-the-art performance with a 14B parameter model, rStar2-Agent demonstrates that sophisticated training methodologies can reduce the computational resources required for advanced reasoning capabilities. This makes frontier-level AI more accessible to organizations with limited computational budgets.
Transfer Learning Capabilities
The model’s strong performance on scientific reasoning tasks (GPQA-Diamond) despite math-only training suggests that the reasoning patterns learned through agentic reinforcement learning transfer effectively to other domains. This has promising implications for developing generally capable AI systems.
Tool Use and External Verification
rStar2-Agent’s ability to effectively use coding tools and incorporate environmental feedback represents a step toward more interactive and adaptable AI systems that can leverage external resources and verify their own reasoning processes.
Challenges and Limitations
Despite its impressive performance, rStar2-Agent faces certain limitations:
Capacity Ceilings
The researchers observed that continued RL training beyond 510 steps led to collapse in both policy and reward signals, suggesting that RL cannot reliably extend reasoning ability beyond what was acquired during pretraining. This highlights the importance of efficiently reaching the base model’s reasoning ceiling with minimal RL compute.
Domain Specificity
While the model demonstrates strong generalization, its training focused primarily on mathematical reasoning with integer answers. Performance on problems requiring different answer formats or domains may vary.
Infrastructure Requirements
The efficient training process still requires substantial resources (64 MI300X GPUs), which may be prohibitive for some organizations despite being more efficient than alternative approaches.
Future Directions
The rStar2-Agent approach opens several promising directions for future research:
Extension to Other Domains
Applying similar agentic reinforcement learning techniques to other domains beyond mathematics, such as scientific reasoning, code generation, or logical deduction.
Multi-Tool Environments
Expanding beyond Python coding tools to incorporate diverse tools and environments that could provide different types of feedback and capabilities.
Efficiency Improvements
Further optimizing the training process to reduce computational requirements while maintaining or improving performance.
Conclusion
rStar2-Agent represents a significant advancement in efficient reasoning for large language models. By combining innovative agentic reinforcement learning algorithms (GRPO-RoC), scalable infrastructure, and efficient training recipes, the approach demonstrates that smaller models can achieve frontier-level performance through smarter training methodologies rather than mere scale expansion.
The model’s strong mathematical reasoning capabilities, efficient use of tokens, and generalization to other domains highlight the potential of agentic reinforcement learning to develop more capable and efficient AI systems. As research in this area continues, we can expect to see further improvements in reasoning capabilities across diverse domains and applications.
For researchers and practitioners, rStar2-Agent provides both a proven methodology and open-source implementation to build upon, accelerating progress toward more intelligent and efficient AI systems that can reason effectively while managing computational resources responsibly.