Seed-Thinking-v1.5: How the 200B inference model surpasses DeepSeek R1 through reinforcement learning

高效码农

3 months ago

Technical Analysis and Application Prospects of Bytedance Seed-Thinking-v1.5: A Breakthrough Reasoning Model

Introduction: Milestone in the Evolution of Reasoning Models

In April 2025, Bytedance officially released the Seed-Thinking-v1.5 inference model, which achieved significant breakthroughs in mathematical competitions, programming tasks, and scientific Q&A fields with a mixed expert architecture (MoE) of 200 billion parameters (200 billion activated parameters). Its core innovation lies in solving the performance bottleneck of traditional large models in complex reasoning tasks through the stability optimization of the reinforcement learning (RL) framework and the fusion of high-quality data. This article will deeply analyze the innovative value of this model from its technical architecture, training methods to its actual performance.

I. Core Architecture and Technological Innovation

1.1 Lightweight Design of the Mixed Expert Architecture (MoE)

Seed-Thinking-v1.5 adopts a dynamic activation MoE architecture with a total parameter scale of 200 billion, but only activates 20 billion parameters each time for inference. This design ensures the model’s capacity while significantly reducing computational resource consumption. The core principle is:

Dynamic Routing Mechanism: Automatically selects the most relevant expert module based on the input question type
Layered Computation Optimization: Assigns tasks such as mathematical reasoning and code generation to dedicated subnetworks

1.2 Data-driven training paradigm

The foundation for improving model performance is the strict selection and enhancement of high-quality training data:

1.2.1 Construction of the STEM Problem Library

Source: International mathematical/physics/chemistry competition questions, open-source datasets, and manually constructed difficult problems
Cleaning Process:
1. Eliminate questions with ambiguous statements or questionable answers
2. Through model self-verification (Doubao-Pro 1.5 generates multiple answers) to filter simple questions
3. Artificial expert secondary review of controversial cases
Data Augmentation: Convert multiple-choice questions into fill-in-the-blank questions to avoid guessing, and adjust the structure of the questions to increase the complexity of reasoning

1.2.2 Programming and Logic Data

Code Tasks: Select questions from competitive programming platforms such as Codeforces, and provide unit tests and sandbox validation environments
Logical Puzzle: Automatically generates Sudoku, mazes, and other 100,000-level questions, supporting dynamic adjustment of difficulty

Second, the stability breakthrough of reinforcement learning algorithms

2.1 VAPO and DAPO dual frameworks

In response to the problem of easy collapse in traditional RL training, the team proposes two innovative frameworks:

VAPO (Value-Augmented Policy Optimization): Based on value function optimization, suitable for verifiable tasks (such as mathematical problems)
DAPO (Decentralized Advantage Policy Optimization): Independent of the value function, focusing on unstructured tasks (such as creative writing)

The experiment shows that these two methods compress the performance fluctuations from ±10% to ±1% or less across different training rounds.

2.2 Optimization of Five Key Technologies

Length-Adaptive GAE: Dynamically adjusts credit allocation based on response length to balance training of short and long sequences
Token-Level Loss Function: Refines the calculation of each token’s contribution to avoid gradient dilution in long texts
Clip-Higher PPO: Loosens the upper limit of policy updates to encourage exploration of low-probability tokens
Online Data Distribution Adaptation: Dynamically adjusts the proportion of training data based on model capabilities
Mixed Precision Training: Using FP8 quantization technology, memory usage is reduced by 40%

Section 3: The dual verification mechanism of the reward model

3.1 Seed-Verifier: The Essence of Rule-Based Equivalence Judgment

Principle: Comparing the Mathematical Equivalence of Reference Answers and Model Outputs (e.g., treating 524288 as equivalent)
Advantages: Fast processing speed, training set accuracy > 98%
Limitations: Prone to misjudgment in edge cases (such as problems with multiple solutions)

3.2 Seed-Thinking-Verifier: Chain Reasoning Verifier

Innovation Points: Simulating human step-by-step analysis, generating a verification reasoning chain (see case studies in the appendix)
Performance Improvement:
- The accuracy of the人工 test set has increased from 82.7% to 99.3%.
- Effectively Prevent Reward Hacking
- Resolve Ambiguity Caused by Format Differences

3.3 Reward Modeling for Non-Verification Tasks

For creative writing and other subjective tasks, adopt a Pairwise Generation Reward Model:

Generate relative ratings by comparing the strengths and weaknesses of two replies
Avoid excessive attention of traditional models to irrelevant details

Chapter 4: The Efficiency Revolution of Infrastructure

4.1 Streaming Push System (SRS)

Asynchronous Trajectory Generation: Splitting the complete inference process into segments for parallel processing
Dynamic Resource Scheduling: Automatically allocate computing units based on the generated length
Effect: The RL training cycle is shortened to 1/3 of traditional methods

4.2 Mixed and Parallel Architecture

Expert Parallel (EP): MoE layer experts are dynamically allocated to different GPUs
Tensor Parallel (TP): Distributed computation of attention layer parameters
Sequential Parallelism (SP): Long Context Chunk Processing

4.3 Automatic Optimization System

Memory Management: Inter-layer Recomputation + Activation Offloading, Supporting Larger Batch Training
Failure Recovery: ByteCheckpoint Technology for Seamless Resuming Training from Breakpoints

Chapter 5: Multi-Domain Performance Evaluation

5.1 Mathematical reasoning ability

Test Set	Seed-Thinking	DeepSeek R1	GPT-4 o3
AIME 2024	86.7%	79.8%	87.3%
Beyond AIME	48.0%	42.4%	63.6%

Highlight: In the ultra-high difficulty question bank Beyond AIME independently built by the team, the gap with top models has been significantly narrowed

5.2 Programming Task Performance

Codeforces pass@8: 55.0% (超越DeepSeek R1’s 45.0%)
Practical Verification: The proportion of generated code passing offline sandbox testing reaches 92%, highly consistent with the platform’s submitted results

5.3 Scientific Knowledge and Logic

GPQA Diamond-Level Questions: 77.3% accuracy, approaching the level of human experts
ARC-AGI Logic Reasoning: 39.9%, reaching the current SOTA performance

Chapter 6: Open Source Plan and Industry Impact

6.1 Standardization of Evaluation System

BeyondAIME and Codeforces Evaluation Sets: Plan to open source 100 original math problems and 12 programming competition datasets
Meaning: Provide an reproducible difficulty benchmark for the industry, reduce the risk of model overfitting

6.2 Insights from the Technical Path

RL Stability Solution: VAPO/DAPO Framework Can Be Transferred to Other Large Model Training
Mixed Architecture Design: Providing a New Paradigm for the Practical Use of Models with Over 200B Parameters

Conclusion: The Next Station for Reasoning Intelligence

The breakthrough of Seed-Thinking-v1.5 is not only reflected in performance indicators, but also lies in the validation of a scalable technical framework—from data quality control, RL stability optimization to infrastructure innovation. With the open-source of evaluation sets like BeyondAIME, this model may become an important milestone in promoting the standardization of AI inference capabilities. In the future, how to combine the precision of verified tasks with the creativity of non-verified tasks remains the core direction for the team to explore.