# DeepSeek-R1: Enhancing Reasoning in Large Language Models via Reinforcement Learning

## Abstract

DeepSeek-R1 is an advanced large language model (LLM) developed by DeepSeek-AI that leverages reinforcement learning (RL) to autonomously evolve reasoning capabilities without heavy reliance on human-annotated data. The model demonstrates remarkable improvements in mathematical reasoning, code generation, and a variety of academic benchmarks—for instance, achieving an accuracy of 77.9% on the AIME 2024 math competition, up from an initial 15.6%. This article details the training methodology, experimental results, engineering insights, and limitations of DeepSeek-R1, along with open-source resources for replication.


## 1. Introduction

Reasoning capability is a cornerstone of human intelligence, encompassing tasks such as mathematical problem-solving, logical deduction, and programming. Recent advances in large language models have shown that, when scaled sufficiently, LLMs can exhibit emergent reasoning abilities. Techniques like Chain-of-Thought (CoT) prompting further enhance model performance by generating intermediate reasoning steps.

However, existing methods face notable limitations:

  • Heavy dependence on human-annotated reasoning traces limits scalability and introduces cognitive bias.
  • Models are constrained to human-like reasoning patterns, preventing the discovery of potentially superior, non-human reasoning pathways.

To tackle these challenges, DeepSeek-R1 introduces a reinforcement learning framework that uses rule-based reward signals (e.g., answer correctness, formatting consistency) to incentivize the autonomous development of reasoning strategies—eliminating the need for large-scale human supervision.


## 2. Methodology

### 2.1 Model Architecture and Training Foundation

  • Base Model: DeepSeek-R1 builds on DeepSeek-V3 Base, a multilingual (primarily Chinese and English) pre-trained Transformer model.
  • Training Framework: The model uses Group Relative Policy Optimization (GRPO), a streamlined and efficient variant of PPO (Proximal Policy Optimization), tailored for large-scale RL training.

### 2.2 Reinforcement Learning Training Pipeline

#### 2.2.1 DeepSeek-R1-Zero: Pure RL Phase

  • Reward Design:

    • Rule-based rewards include answer accuracy and format consistency.
    • Final reward formulation:

      Reward_rule = Reward_acc + Reward_format
      
  • Training Hyperparameters:

    • Learning rate: 3e-6
    • KL divergence coefficient: 0.001
    • Sampling temperature: 1.0
    • Generation length: 32,768 → 65,536 tokens (adjusted mid-training)
    • Training steps: 10,400 (~1.6 epochs)

#### 2.2.2 DeepSeek-R1: Multi-Stage Alignment Training

DeepSeek-R1 extends R1-Zero with additional alignment stages:

  1. Cold-start data collection: Curated human-aligned conversational reasoning data.
  2. First-stage RL: Optimizes dialogue reasoning and language consistency.
  3. Rejection sampling + SFT: Incorporates both reasoning and non-reasoning data.
  4. Second-stage RL: Enhances helpfulness and safety using reward models.

## 3. Experiments and Results

### 3.1 Evaluation Protocol

DeepSeek-R1 was evaluated on the following public benchmarks:

  • Mathematical reasoning: AIME 2024, CNMO 2024
  • Code generation: LiveCodeBench, Codeforces, SWE-bench
  • General language understanding: MMLU, MMLU-Pro, C-Eval, DROP
  • Instruction following and safety: IFEval, AlpacaEval 2.0, Arena-Hard

### 3.2 Quantitative Results

#### Mathematical Reasoning (AIME 2024)

Model pass@1 cons@16
DeepSeek-R1-Zero 77.9% 86.7%
Human Average ~60%

#### Multi-Task Language Understanding (MMLU-Pro)

Model Avg. Accuracy
DeepSeek-V3 Base 68.2%
DeepSeek-R1 83.5%

#### Code Generation (LiveCodeBench 2024)

Model Pass Rate
DeepSeek-V3 62.1%
DeepSeek-R1 78.3%

### 3.3 Evolution of Reasoning Behavior

  • Increased response length: The model autonomously developed longer reasoning pathways.
  • Rise in reflective terminology: Words like “wait,” “verify,” and “error” became more frequent, indicating self-correction and reflection.
  • Mitigation of language mixing: Initial Chinese-English blending was reduced via a language consistency reward.

## 4. Engineering and Deployment

### 4.1 Training Infrastructure

  • Hardware: NVIDIA A100 cluster supporting 8,192-sequence parallel generation.
  • Training Framework: Built on PyTorch and HAI-LLM (internal distributed framework).
  • Inference Optimization: Leveraged vLLM with PagedAttention for higher throughput.

### 4.2 Model Distillation and Release

  • Several distilled smaller models were released, maintaining strong reasoning capabilities with reduced computational cost.
  • All models, code, and data samples are open-sourced under MIT license.

## 5. Limitations and Future Work

### 5.1 Current Limitations

  • Structured output & tool use: Cannot use external tools (e.g., calculators, search engines).
  • Token efficiency: May “overthink” on simpler problems.
  • Multilingual support: Optimized mainly for Chinese and English; other languages may exhibit mixing.
  • Prompt sensitivity: Few-shot prompting degrades performance; zero-shot is recommended.

### 5.2 Future Directions

  • Integration of tool-augmented reasoning.
  • Improved reward modeling to avoid reward hacking.
  • Asynchronous evaluation and large-scale RL for software engineering tasks.

## 6. Frequently Asked Questions (FAQ)

>

Q: Is DeepSeek-R1 open-sourced?
✅ Yes. Model weights, training code, and inference scripts are available:
GitHub: https://github.com/deepseek-ai/DeepSeek-R1
Zenodo: https://doi.org/10.5281/zenodo.15753193

>

Q: How can I reproduce the results?
✅ Complete Docker environment and training scripts are provided in the repository README.

>

Q: Does the model support multimodal reasoning?
❌ Currently text-only; future versions may include multimodal support.

>

Q: How safe is the model?
✅ Safety evaluations show comparable performance to GPT-4o. With a risk control system, higher safety standards are achieved.


## References

  1. Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, NeurIPS 2022.
  2. Shao et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning, arXiv:2402.03300.
  3. Schulman et al., Proximal Policy Optimization Algorithms, arXiv:1707.06347.
  4. Guo et al., DeepSeek-R1: Incentivizing Reasoning in LLMs via RL, Nature 2025.
  5. Open-Source Project: DeepSeek-R1 GitHub Repository, https://github.com/deepseek-ai/DeepSeek-R1

This article was authored by the DeepSeek-AI team, based on the Nature paper “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.” We are committed to advancing open-source AI research and development.
Authors: Daya Guo, Dejian Yang, Haowei Zhang, et al.
Affiliation: DeepSeek-AI
Source: https://www.nature.com/articles/s41586-025-09422-z