LongCat-Flash-Thinking: Revolutionizing Open-Source AI Reasoning with 560B MoE Architecture

高效码农

3 months ago

In the rapidly evolving world of artificial intelligence, large language models (LLMs) are pushing the boundaries of what’s possible in reasoning and problem-solving. Today, we’re diving deep into LongCat-Flash-Thinking, a groundbreaking 560-billion-parameter Mixture-of-Experts (MoE) model developed by the Meituan LongCat Team. This open-source powerhouse activates an average of 27 billion parameters, making it both efficient and powerful for tasks like math, coding, and agentic reasoning. If you’re an AI enthusiast, researcher, or developer searching for the latest in open-source AI reasoning models, this blog post is your ultimate guide. We’ll explore its architecture, training pipeline, key features, benchmarks, and how to get started—optimized for Google SEO with insights into AI advancements as of September 2025.

Whether you’re optimizing for large language model efficiency or exploring MoE architectures in AI, stick around as we break it down in simple, readable English. Let’s jump in!

What is LongCat-Flash-Thinking? A Quick Overview

LongCat-Flash-Thinking stands out as an efficient open-source MoE reasoning model with 560 billion total parameters, but it smartly activates only 18.6 to 31.3 billion (averaging 27 billion) based on the task at hand. Built on the LongCat-Flash-Base foundation [Meituan, 2025], this model uses a two-phase training approach: Long Chain-of-Thought (CoT) cold-start training followed by large-scale Reinforcement Learning (RL).

Key highlights include:

Core Contributions: Domain-parallel RL training for stability across STEM, coding, and agentic tasks; the DORA system for over 3x faster asynchronous training; and advanced capabilities in formal and agentic reasoning.
Performance Wins: As an open-source leader, it achieves state-of-the-art (SOTA) results, slashing token usage by 64.5% on AIME-25 (from 19,653 to 6,965) without losing accuracy.

This model isn’t just theoretical—it’s released to fuel further research in AI reasoning systems and agentic AI. Chat with it at https://longcat.ai, grab it from Hugging Face, or check the GitHub repo at https://github.com/meituan-longcat/LongCat-Flash-Thinking.

Introduction: The Rise of Reasoning in Large Language Models

Large language models have come a long way, with a shift toward enhancing reasoning capabilities driving us closer to Artificial General Intelligence (AGI). Models like OpenAI’s o1 and o3, Google’s Gemini 2.5, DeepSeek-R1, Qwen3, and GLM-4.5 showcase this by excelling in complex logic, mathematics, coding, and agentic tasks. The secret sauce? Large-scale RL that not only refines the model but also allocates more compute during inference to extend Chain-of-Thought (CoT) reasoning.

However, challenges remain: high computational costs, training instability, and gaps in areas like formal theorem proving and tool-based agentic reasoning. Enter LongCat-Flash-Thinking—an efficient open-source AI model designed to tackle these. Based on LongCat-Flash-Base, it boasts 560 billion parameters (27 billion active on average) and shines in logic, math, coding, and agents. Its development follows a two-phase pipeline: cold-start training to build foundational reasoning, and RL via the DORA system with domain-parallel schemes for expert fusion. The end result? A robust, safe, and human-aligned model ready for real-world use.

Our goal here is to demystify this for graduates and pros alike, highlighting how it advances open-source MoE models and AI efficiency.

Long CoT Cold-Start Training: Building the Foundation

The first phase, Long CoT cold-start training, is like preheating an oven—it prepares the model for complex reasoning through a multi-stage curriculum. This includes mid-training and reasoning-oriented Supervised Fine-Tuning (SFT), using a curated data pipeline (as shown in Figure 3).

Mid-Training: Boosting Reasoning Capabilities

Base pre-trained models are strong in general tasks but often falter on complex reasoning due to data imbalances. Pre-training corpora are dominated by general text, with little emphasis on reasoning-heavy domains like STEM and coding, and even less on explicit long CoT patterns.

To fix this, we turn mid-training into a balanced curriculum. Data curation pulls from academic archives, textbooks, and proprietary sources for math, physics, chemistry, and coding problems, focusing on multi-step logic. We apply heuristic rules and LLM-as-a-Judge for filtering, deduplication, and decontamination. Mixing ratios balance reasoning data with original mid-training data to preserve general abilities. Details are in Appendix A.1.

We validated this with pass@k metrics on AIME-24, BeyondAIME, and LiveCodeBench. Results? Higher reasoning data ratios boost performance significantly—e.g., pass@1 up 27.7% on AIME-24—expanding the model’s “reasoning boundary.”

Reasoning-Oriented SFT: Aligning for Advanced Skills

Post-mid-training, SFT aligns the model with quality instructions and boosts specialized reasoning, including general, formal, and agentic types.

General Reasoning

Data from STEM, coding, logic, and general QA. Prompt curation uses multi-stage filtering: LLM-as-a-Judge to remove low-quality queries; model voting for answer verification; difficulty filtering via expert pass rates. Responses are generated with rejection sampling from LongCat-Flash-Chat, selecting top-quality ones via rules and judgments. Domain details in Appendix A.2.

Formal Reasoning

For tasks like automatic theorem proving (ATP), we use an expert-iteration pipeline (Figure 3, bottom left): Statement formalization converts informal problems to formal ones, filtered by Lean4 server for syntax/semantics; iterative proof synthesis starts from a baseline prover, generating and verifying proofs with thinking steps. This dataset enhances formal proving skills.

Agentic Reasoning

Agentic tasks involve tools for complex problems. Our dual-path reasoning selects high-value queries: Compute tool necessity vx = sw/tool(x) – sw/o.tool(x), keeping those needing tools. Automated trajectory synthesis in MCP servers and simulated environments creates quality trajectories, layered by complexity (single/multi-turn) for curriculum learning.

Training Recipe

SFT data mix: STEM 35%, general QA 20%, coding 20%, agentic 14%, proving 8%, logic 3%. Optimized with AdamW (lr 3e-5), 2 epochs, 48K context length for long chains.

Large-Scale Reinforcement Learning: Scaling Up Potential

The second phase uses RL to amplify capabilities via DORA and adapted algorithms. Covers infrastructure, algorithms, rewards, and recipes.

RL Infrastructure: The DORA System

RL faces scheduling and long-tail generation issues. DORA enables asynchronous rollout, 3x faster than synchronous. Features: Elastic colocation (fixed generation groups + elastic roles); multi-version pipelines for sampling consistency and KV-cache reuse (Figures 5,6).

Scalable optimizations: Massive streaming RPCs and efficient MoE parallelism via graph-level compilation to cut kernel launches.

RL Algorithms: GRPO Adaptations

Modified GRPO: Drop KL loss; token-level losses; ternary clipping for negative advantages; truncated importance sampling for engine gaps. Final objective as in Equation (4).

Efficiency tricks: Replacement online filtering; staleness control; incomplete signal masking.

Reward Systems: Tailored for Tasks

Non-verifiable tasks use discriminative reward models trained on human+model preference data. Verifiable: STEM via GenRM with reasoning (98.8% accuracy, Table 1); coding via distributed sandbox clusters.

Training Recipes

Reasoning-Oriented RL: Domain-Parallel Approach

Mixed-domain RL is unstable, so we decouple STEM, coding, agentic into expert models. Query curation filters per domain. Configs: Progressive difficulty for STEM; multi-stage contexts for coding; structured templates for agentic.

Model Fusion

Fuse experts with task vector normalization, dropout, and erasure (Figure 8) for Pareto-optimal unity.

General RL Fine-Tuning

Uses open+synthetic data, clustered dedup, final PPO for robustness, safety, alignment.

Key Features: What Sets It Apart

Domain-Parallel RL Training: Decouples domains for stability, fuses experts to avoid interference.
Pioneering RL Infrastructure: DORA supports asynchronous, large-scale training.
Advanced Formal and Agentic Reasoning: Expert-iteration for proofs; dual-path selection and tool-augmented trajectories.

Evaluation Results: Benchmarks and Insights

Benchmarks as in Table 2. LongCat-Flash-Thinking leads open-source: 99.2% math (MATH500); 79.4% coding (LCB); 74.4% agentic (BFCL V3); 81.6% proving (MiniF2F-Test@32); 93.7%-98.8% safety. Competes with closed-source like GPT-5/o3, with high efficiency (Figure 9).

Getting Started: Deployment and Quick Start

Chat templates: Single-turn [Round 0] USER:{query} /think_on ASSISTANT:; multi-turn similar. Tool calls as Markdown. Deploy with SGLang/vLLM. Try at https://longcat.ai—enable “Think” button.

License and Usage Considerations

Model weights under MIT License (excludes Meituan trademarks/patents). Consider LLM limits (language variations, safety); comply with laws.

Contact Us

Email: longcat-team@meituan.com. Join WeChat group via QR code.

Conclusion and Future Directions

LongCat-Flash-Thinking advances open-source AI reasoning. Future: More domain strategies and efficient RL.