Advancing Math and Code Reasoning through Reinforcement Learning

Introduction

In the field of artificial intelligence, reasoning capability has always been a crucial benchmark for evaluating model performance. Following OpenAI’s introduction of training reasoning models using large-scale reinforcement learning (RL), significant progress has been made in this domain. However, the technical details required to reproduce the success of frontier models, such as data curation strategies and specific RL training recipes, are often omitted from reports. This leaves researchers scrambling to replicate their achievements.

Recent research indicates that for smaller models, distillation remains more effective than RL. In this work, we demonstrate that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong small- and mid-sized models, achieving results that surpass those of state-of-the-art distillation-based models. We propose a simple yet effective approach: first training on math-only prompts, followed by training on code-only prompts. Notably, we find that math-only RL not only significantly boosts the performance of strong distilled models on math benchmarks (e.g., +14.6% / +17.2% on AIME 2025 for the 7B / 14B models), but also on code reasoning tasks (e.g., +6.8% / +5.8% on LiveCodeBench for the 7B / 14B models). Extended code-only RL iterations further improve code benchmark performance with minimal or no degradation in math results.

We develop a robust data curation pipeline to collect challenging prompts with high-quality, verifiable answers and test cases, enabling verification-based RL across both domains. Finally, we identify key experimental insights, including curriculum learning with progressively increasing response lengths and the stabilizing effect of on-policy parameter updates. We find that RL not only elicits the foundational reasoning capabilities acquired during pre-training and supervised fine-tuning (e.g., distillation) but also pushes the limits of the model’s reasoning ability, enabling it to solve previously unsolvable problems.

Background Knowledge

Research Status of Reasoning Models

In recent years, researchers have attempted to enhance model reasoning capabilities by incorporating mathematical and code data into the pre-training and supervised fine-tuning stages. However, these models still struggle with complex reasoning tasks in mathematical and code domains. With the open-sourcing of models like DeepSeek-R1, researchers have begun exploring the use of RL to boost reasoning capabilities. Nevertheless, most of these studies focus on specific domains, and the reported benchmark tests are limited, typically covering only AIME 2024/2025.

Limitations of Existing Research

  • Missing Key Technical Details: Reports on frontier models often omit data curation strategies and specific RL training methods, making it challenging for researchers to replicate their success.
  • Model Size and Method Limitations: Research shows that distillation outperforms RL for smaller models. As a result, existing studies predominantly recommend RL for large-scale models like 671B or 253B, with limited exploration of RL training methods for smaller to mid-sized models.

Proposal and Contributions of the AceReason-Nemotron Model

Research Objectives and Methods

Our goal is to demonstrate that large-scale RL can significantly enhance the reasoning capabilities of smaller to mid-sized models. We propose a two-stage RL training approach: first training on math prompts, followed by code prompts. This strategy was initially adopted to improve training efficiency, as code verification takes significantly longer than math verification.

Data Collection and Verification

To support RL training, we developed a robust data curation and verification pipeline. For mathematical data, we combined DeepScaler and NuminaMath datasets, covering algebra, combinatorics, number theory, and geometry. We applied 9-gram filtering to avoid contamination of common math benchmarks and implemented filtering rules to exclude unsuitable data, such as multi-part questions, multiple-choice or true/false questions, and overly long or complex answers. We also used DeepSeek-R1 to verify and retain only majority-voted correct solutions, discarding ambiguous or OCR-error-prone questions. Our final dataset contains approximately 49,000 high-quality math problems suitable for RL training.

For code data, we curated our dataset from modern competitive programming platforms, ensuring strict selection criteria for high-quality coding problems. The dataset includes both function-calling and standard input/output formats, covering a wide range of algorithmic topics. We filtered out problems incompatible with standard output comparison and those requiring platform-specific templates to minimize false negative rewards. We also curated strong test cases to cover tricky edge cases, ensuring incorrect solutions would fail and eliminating potential false positive rewards. After rigorous filtering, 8,520 problems remained in our final training set.

Experimental Setup and Evaluation Methods

Our experiments start from strong SFT models, DeepSeek-R1-Distill-Qwen-7B and 14B, based on the Qwen2.5 model family and distilled from DeepSeek-R1. To ensure consistency and reproducibility, we follow the DeepSeek-R1 evaluation protocol, using a temperature of 0.6, top-p of 0.95, and a maximum output length of 32,768 tokens.

For math evaluation, we use diverse math competition benchmarks, including AIME2024, AIME2025, MATH500, HMMT2025 Feb, and BRUMO2025. We report pass@1 performance averaged over k generations (avg@k). For small-scale benchmarks like AIME, we use k=64, following DeepSeek-R1.

For code evaluation, we test our AceReason-Nemotron models on LiveCodeBench v5 and v6 subsets, containing recently released AtCoder and LeetCode problems. We also report Codeforces ELO and percentile numbers based on the LiveCodeBench Pro dataset, which includes Codeforces problems from 202407 to 202412. Additionally, we include evaluations on the EvalPlus benchmark.

Research Findings and Results

Significant Improvement in Reasoning Capabilities through RL

Our AceReason-Nemotron-7B and 14B models demonstrate that RL significantly enhances reasoning capabilities over initial SFT models (DeepSeek-R1-Distill-Qwen-7B/14B) in both math and coding tasks. Specifically, in math tasks, AceReason-Nemotron-7B improves accuracy by 14.5% on AIME 2024 and 14.6% on AIME 2025. In coding tasks, it achieves 14.2% and 8% accuracy improvements over DeepSeek-R1-Distill-Qwen-7B on LiveCodeBench v5 and v6. Meanwhile, AceReason-Nemotron-14B improves from 69.7%/50.2% to 78.6%/67.4% on AIME24/25 and from 53.1%/47.9% to 61.1%/54.9% on LiveCodeBench v5/v6, surpassing significantly larger SFT models like DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Llama-70B.

Comparison with State-of-the-Art RL-Based Reasoning Models

When compared to state-of-the-art open RL-based reasoning models of the same parameter scale, AceReason-Nemotron models maintain their superiority. In math reasoning, AceReason-Nemotron-7B competes with strong RL-based models like Skywork-OR1-Math and Light-R1, while AceReason-Nemotron-14B delivers best-in-class results. In code generation, AceReason-Nemotron models outperform all open-sourced reasoning models of similar scale. Their math-code capabilities remain competitive even when compared to frontier reasoning models like QWQ-32B and o3-mini.

Comparison with State-of-the-Art Distillation-Based Models

AceReason-Nemotron-14B demonstrates better math and code performance than the latest specialized distilled models, such as OpenMath-14B/32B (+2.1%/+4.4% on AIME24/25) and OpenCodeReasoning-14B (+1.7%/+0.8% on LiveCodeBench v5/v6), showing that RL can achieve higher model performance than distillation. Although at the 7B model scale, AceReason-Nemotron-7B performs competitively with OpenCodeReasoning-7B in coding tasks, OpenMath-7B has a greater advantage in math reasoning. In all cases, AceReason-Nemotron models significantly outperform general-purpose reasoning models like Llama-Nemotron-8B/49B, which are trained via distillation. This indicates that the effectiveness of distillation versus RL depends on model size and task domain, though RL offers the potential for significantly higher accuracy in models of 14B scale and above.

Detailed Analysis

Response Length Grows alongside Performance

The evolution of response length on the training set and corresponding AIME24 accuracy (AIME25 in the appendix) throughout the RL training of the AceReason-Nemotron-7B model reveals two distinct stages in the 8K →16K length-extension training strategy. In the first stage (8K), the model learns to compress its reasoning process to adapt to an 8K token limit, causing an initial drop in problem-solving capabilities compared to the baseline. However, after approximately 1K–2K RL steps, the model gradually recovers its performance. In the second stage (16K), upon extending the token limit to 16K, the model immediately exploits the extra reasoning token capacity. Within 500 RL steps, its average response length increases from 4K to around 6.5K tokens. Simultaneously, AIME24 accuracy sharply improves from 55% to 62%, after which both response length and accuracy plateau. Extending the maximum response length further to 24K and 32K demonstrates a similar trend.

Hard Prompts Drive the Largest Gains

At the 24K response length RL stage, we built {Easy, Medium, Hard}-prompt sets based on the difficulty estimated by the 7B model’s performance over 16 attempts. The Easy prompt set includes prompts except those solved more than 14 times, the Medium set excludes those solved more than 10 times, and the Hard set excludes those solved more than 6 times. Our ablation study confirms that training with Hard prompts yields a 2.6% improvement on the AIME24 benchmark compared to fullset data and outperforms using Easy and Medium prompts, despite containing only 2.2K prompts.

Starting from 8K Improves Later Performance

Although training initially at an 8K max response length offers faster training per step than starting at 16K or 24K, the early decline in AIME24 accuracy raises questions about its necessity for 7B-sized models. An ablation study comparing three strategies: 8K→16K→24K, 16K→24K, and directly at 24K, reveals that while the 16K→24K strategy initially outperforms the baseline, the 8K→16K strategy quickly catches up during the 16K stage and consistently outperforms other strategies thereafter. Conversely, starting directly at 24K results in minimal performance gains and significantly longer training times, suggesting limited effectiveness during RL training for 7B models.

Interplay of Math-only RL and Code-only RL

Initializing code RL training from math-RL checkpoints provides a substantially better starting point and results in significantly higher final accuracy on LiveCodeBench. We investigated the effects of sequential Math RL and Code RL training stages on developing strong reasoning models in both math and code domains. Starting from DeepSeek-R1-Distill-Qwen-7B/14B as our initial models, we first applied Math-RL and then Code-RL, evaluating performance on standard Math (AIME 24/25) and Code (LiveCodebench v5/v6) benchmarks. Surprisingly, initial Math-RL training not only significantly improves accuracy on math benchmarks but also enhances model performance on coding benchmarks. Subsequent Code-RL training further boosts coding benchmark scores with minimal degradation in math benchmark performance. This indicates that our proposed sequential training strategy effectively cultivates models with strong reasoning abilities across both math and coding domains.

Topic-wise Accuracy Analysis on LiveCodeBench v5

Given the observation that both Math-RL and Code-RL enhance code generation ability on coding benchmarks, we conducted ablation studies on the LiveCodeBench v5 dataset, which comprises coding problems from AtCoder and LeetCode platforms. While LeetCode problems come with human-annotated topic tags (e.g., Greedy, Math, DFS), there are no tags for Atcoder problems. To address this, we first extracted a set of problem tags from LeetCode. Then, for each AtCoder problem, we queried the o4-mini-high model to infer candidate tags given the problem statement and the set of topics. We grouped all LiveCodeBench v5 problems by their assigned topics and evaluated model performance (avg@8 accuracy) for each topic group.

We compared the performance of our initial SFT models, DeepSeek-R1-Distill-Qwen-7B/14B, against corresponding AceReason-Nemotron-7B/14B models after applying Math-RL alone and after incorporating both Math-RL and Code-RL. As shown in Figure 4, applying math-only RL enhances model performance across all coding problem topics, with particularly strong gains in algorithmic and math-related areas such as Math, Counting, and Combinatorics—domains that rely heavily on mathematical concepts and general reasoning abilities. Furthermore, for topics like Simulation, String, and Graph, which depend more on coding implementation and data structure skills, Code-RL leads to significant further improvement.

False Positive and False Negative Rewards in Code RL Training

To highlight the importance of eliminating false positive rewards (incorrect code passing all tests within time constraints) and false negative rewards (incorrect test cases that fail correct code) in RL training, we conducted two ablation experiments. Simulating false negative rewards by introducing a subset of problems into the training data where test case correctness could not be verified or where the official “golden” solution failed to pass all provided tests revealed that the RL model tends to converge to a sub-optimal point, with no further improvement on the evaluation set. We hypothesize that such false negative rewards cause the model to discard correct algorithms for certain training problems. Consequently, when encountering testing problems that share similar underlying concepts and require these discarded algorithms, the model remains unable to identify the correct solution, thereby degrading its performance.

To simulate RL training with false positive rewards, we incorporated problems with weak test cases that allowed incorrect solutions to pass into our training set. Notably, we examined DeepCoder’s RL training set and found that despite their claimed strong test case selection criteria, their tests still failed to cover all tricky edge cases—there exist incorrect solutions that can pass their selected test cases but subsequently fail on official test cases. After adding unfiltered Code-RL training data from these sources to our problem set, we observed a sudden drop in testing set accuracy midway through the training process. We suspect that this performance degradation occurs because the model learns to solve some training problems without considering all corner cases or by generating brute-force solutions with incorrect time complexity. This wrongly learned policy is then penalized when evaluated against strong test cases in the testing set.

Does RL Improve pass@k or pass@1 over the SFT Model?

Recent studies suggest that RL primarily improves pass@1 accuracy over the SFT model without significantly impacting pass@k. However, our results demonstrate that RL consistently enhances pass@k scores (from k=8 to k=64) for both 7B and 14B models on LiveCodeBench v5 and v6, with improved pass@k scores maintaining a 10% margin from pass@8 to pass@64. On AIME24/25, the 14B model also shows better pass@k performance, although the gap on AIME25 for the 7B model is closed by the SFT model at pass@64.

Where does RL Improve over the SFT Model?

Comparing the problem-level accuracies of the initial 7B SFT model with AceReason-Nemotron-7B after RL on LiveCodeBench v5/v6 and AIME 2024/2025 reveals that RL unlocks a long tail of difficult coding problems that the distilled model fails to solve in 64 attempts, adding 30 and 23 additional solvable problems to LiveCodeBench v5 and v6. It also significantly improves performance on challenging problems where the SFT model has less than 20% accuracy. On AIME, for the most challenging problems with zero solve rates, RL enables the model to solve 3 more problems on AIME24. In conclusion, we find that RL not only improves accuracy on problems with high solve rates but also extends the boundary to solve hard problems that the SFT model was initially unable to solve.

Conclusion

We have demonstrated that large-scale reinforcement learning can substantially enhance the reasoning capabilities of strong small- and mid-sized SFT models. Our proposed approach of first conducting RL on math prompts and then on code prompts has proven effective. Notably, math-only RL significantly boosts performance not only on math benchmarks but also on code reasoning tasks. Crucially, subsequent code-only RL further improves code benchmark performance with minimal to no degradation in math results. To support this process, we developed a robust data curation pipeline that collects challenging prompts with high-quality, verifiable answers and test cases, enabling verification-based RL across both domains. Several key empirical insights from RL training reveal that RL not only elicits the foundational reasoning capabilities acquired during pre-training and supervised fine-tuning but also pushes the limits of the model’s reasoning ability, enabling it to solve previously unsolvable problems.

Acknowledgement

We extend our gratitude to the NVIDIA Nemo team for valuable discussions and collaborations on building reasoning models. Special thanks to Boris Ginsburg, Oleksii Kuchaiev, Igor Gitman, Wei Du, Somshubra Majumdar, Siddhartha Jain, Jiaqi Zeng, Yi Dong, Alexander Bukharin, Olivier Delalleau, Tugrul Konuk, Vahid Noroozi, and Jonathan Cohen.

References

[1] Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding. arXiv preprint arXiv:2504.01943, 2025.
[2] Alisson Azzolini, Hannah Brandon, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Francesco Ferroni, Rama Govindaraju, et al. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558, 2025.
[3] Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URL https://matharena.ai/.
[4] Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. Llama-Nemotron: Efficient Reasoning Models. arXiv preprint arXiv:2505.00949, 2025.
[5] Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ganqu Cui, Yin Cui, Haotian Ye, Tsung-Yi Lin, Ming-Yu Liu, Jun Zhu, and Haoxiang Wang. Bridging supervised learning and reinforcement learning in math reasoning. arXiv preprint, 2025.
[6] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
[7] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
[8] Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models. arXiv preprint arXiv:2410.07985, 2024.
[9] Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
[10] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
[11] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
[12] Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner series, 2025. Notion Blog.
[13] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021.
[14] HuggingFace. Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1.
[15] Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.27974, 2024.
[16] Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[17] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
[18] Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. https://huggingface.co/AI-MO/NuminaMath-CoT, 2024.
[19] Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2023.
[20] Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-V3 technical report. arXiv preprint arXiv:2412.19437, 2024.
[21] Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms. arXiv preprint arXiv:2410.18451, 2024.
[22] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=1qvx610Cu7.
[23] Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, and Lingming Zhang. Evaluating language models for efficient code generation. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=IBCBMeAhmC.
[24] Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. AceMath: Advancing frontier math reasoning with post-training and reward modeling. arXiv preprint arXiv:2412.15084, 2024.
[25] Michael Luo, Sijun Tan, Roy Huang, Xiaoxiang Shi, Rachel Xin, Colin Cai, Ameen Patel, Alpay Ariyak, Qingyang Wu, Ce Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepcoder: A fully open-source 14b coder at o3-mini level, 2025. Notion Blog.
[26] Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL, 2025. Notion Blog.
[27] Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems. arXiv preprint arXiv:2412.09413, 2024.
[28] Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset. arXiv preprint arXiv:2504.16891, 2025.
[29] OpenAI. Learning to reason with LLMs, 2024.
[30] Team Qwen. Qwen3, April 2025. URL https://qwenlm.github.io/blog/qwen3/.
[31] Team Qwen. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URL https://qwenlm.github.io/blog/qwq-32b/.
[32] Ant Research RL Lab. Areal: Ant reasoning rl. https://github.com/inclusionAI/AReaL, 2025.
[33] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
[34] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. DeepseekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
[35] Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv:2409.19256, 2024.
[36] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
[37] Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, et al. Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond. arXiv preprint arXiv:2503.10460, 2025.
[38] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024.
[39] An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2.5-Math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024.
[40] Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. DAPO: An open-source LLM reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025.
[41] Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025.
[42] Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. Acecoder: Acing coder rl via automated test-case synthesis. arXiv preprint arXiv:2502.01718, 2025.
[43] Zihan Zheng, Zerui Cheng, Zeyu Shen, Shang Zhou, Kaiyuan Liu, Stanley Wei, Hansen He, Dongruixuan Li, Hangyi Hao, Jianzhu Yao, Peiyao Sheng, Zixuan Wang, Wenhao Chai, Peter Henderson, Aleksandra Korolova, Pramod Viswanath, Saining Xie, and Jingbo Shang. Livecodebench pro: How olympiad medalists view llms in competitive programming?, 2025. URL https://livecodebenchpro.com/. Available at LiveCodeBench Pro website.

Appendix

Math Evaluation Instructions

Please reason step by step, and put your final answer within \boxed{{}}.

Example of AceReason-Nemotron-14B Response to Simple Query

User: hi
Model:
Alright, the user wrote “hi” in the middle of the message. Probably just saying hello. I should respond in a friendly and welcoming way. Maybe ask how I can assist them today. Keep it open-ended to encourage them to share what they need help with. Hello! How can I assist you today?

User: How many strawberries do I need to collect 9 r’s?
Model:
First, I need to understand what the user is asking. They want to know how many strawberries are needed to collect 9 r’s. I interpret r’s as the letter ’r’. Each strawberry is associated with a certain number of ’r’s. Looking at the word strawberry, I count the number of ’r’s it contains. There are three ’r’s in strawberry. If one strawberry provides three ’r’s, then to get a total of nine ’r’s, I divide the total needed by the number each strawberry provides. So, nine divided by three equals three. Therefore, the user needs to collect three strawberries to have nine ’r’s. To determine how many strawberries you need to collect 9 “r’s, let’s break it down step by step.

Step 1: Understand the Context

  • It seems you’re referring to the letter “r” within the word strawberry.

Step 2: Count the r’s in strawberry

  • Let’s look at the word strawberry:
  • s t r a w b e r r y
  • r, r, r (there are 3 r’s)

Step 3: Calculate the Number of Strawberries Needed

  • Each strawberry contains 3 r’s.
  • To collect 9 r’s, you would need:
    [
    \text{Number of strawberries} = \frac{\text{Total r’s needed}}{\text{r’s per strawberry}} = \frac{9}{3} = 3
    ]

Final Answer

[
\boxed{3}
]

Additional Math-RL Analysis

In Figure 8, we show the accuracy on AIME2025 during math RL training. We found that the stage 2 (16K) training leads to a large performance improvement within 300 steps. In Figure 9, we plot boxplots of AIME2024 and AIME2025 accuracy for the 7B and 14B AceReason model comparing to DeepSeek-R1-Distill-Qwen. We can observe the accuracy of AceReason over 64 generations at 25% and 75% percentile and median, significantly improving over the distilled SFT model. Finally, in Figure 10, we inspect the problem-level solve rate for AceReason-Nemotron-14B on AIME2024 and AIME2025. On AIME24 and AIME25, the AceReason model solves one additional problem. We also observe large gains on problems with higher initial accuracy, showing the benefits of RL training.

Code-RL Dataset Curation Details

We collected our Code-RL training data from various modern competitive programming platforms, such as AtCoder, LeetCode, Aizu, etc., with public strong test cases. Most open-sourced coding datasets (e.g., TACO, APPs) suffer from noisy problem statements, self-contamination, and weak LLM synthetic test cases that are unreliable for RL training. To ensure data quality, we performed very strict filtering rules, filtering out:

  1. Multi-solution or interactive problems that require special judges or other external tools;
  2. Problems where images within the statement obscure a clear understanding;
  3. Problems containing incorrect test cases or those lacking golden solutions;
  4. Problems with weak test cases that allow incorrect solutions to pass all tests.

Furthermore, to prevent self-contamination within our collected problem set, we conducted strict problem statement and source URL matching. To avoid any potential contamination of our test set, we excluded all problems released after 20240801 and applied n-gram matching (with n=14) to our testing set problem statement.

To prepare for subsequent curriculum training for Code-RL, we estimated the relative difficulty of the collected problems. We deployed the local DeepSeek-R1-671B model, allowed it to generate 8 attempts on each problem, and assigned a corresponding difficulty score ranging from 0 to 8. Problems that the DeepSeek-R1-671B model failed to solve in all 8 attempts were excluded from the training set. Following this aggressive filtering process, 8,520 problems remained, forming our final Code-RL training set.