Forge: Breaking the Impossible Trinity of Scalable Agent Reinforcement Learning – The RL Framework and Algorithmic Practice Behind MiniMax M2.5
Abstract
MiniMax’s self-developed Forge Reinforcement Learning (RL) framework resolves the throughput-stability-flexibility trinity plaguing scalable agent RL through middleware architecture, Windowed FIFO scheduling, Prefix Tree Merging and other innovations. It achieves a 40x training speedup and underpins the large-scale real-world deployment of the MiniMax M2.5 model.
Have you ever wondered why large-scale Reinforcement Learning (RL) has long struggled to find practical application in complex real-world agent scenarios? The core roadblock lies in an impossible trinity: boosting system throughput often comes at the cost of training stability, and prioritizing stability inevitably restricts agent flexibility. Enter MiniMax’s Forge RL framework—crafted explicitly to crack this conundrum, it is the technological backbone behind the groundbreaking capabilities of the M2.5 model. In this post, we’ll dive deep into Forge’s design philosophy, engineering optimizations and algorithmic innovations to uncover how it makes large-scale RL truly adaptable for industrial-grade agent systems.
1. The Three Core Challenges of Scalable Agent RL
Before we explore Forge’s solutions, it’s critical to understand the fundamental barriers that have held back large-scale RL in industrial agent deployments. In short, there are three insurmountable core challenges that traditional RL frameworks fail to address.
1.1 A Glass Ceiling on Agent Extensibility and Framework Flexibility
Why can’t existing RL frameworks support complex agents? The answer boils down to two critical flaws: restricted agent autonomy and a token consistency barrier that locks in inflexibility.
Restricted Agent Autonomy
Traditional RL frameworks treat agents as white-box functions with a shared state between the agent and trainer. While this design simplifies basic implementations, it rigidly constrains the agent’s cognitive architecture. Complex logics—such as dynamic Context Management (CM) and multi-agent collaboration—cannot be modeled within this structure. Worse still, the framework cannot adapt to arbitrary black-box agents, as these structural rigidities inherently cap agent complexity, creating an unbreakable glass ceiling.
Token Consistency Barrier
Existing TITO (Token-In-Token-Out) architectures tightly couple agents with underlying token logic. Maintaining strict consistency between high-level inference abstraction (reasoning logic) and low-level training representation (token-level data) under complex Context Management (CM) is computationally prohibitive. It’s akin to asking a designer to draft a master blueprint while verifying every single pixel—an unfeasible task that cripples scalability.
1.2 A Deadlock Between System Efficiency and Computational Redundancy
Agent task execution times exhibit extreme variance: simple API calls take mere seconds, while complex reasoning chains can stretch into hours. This massive discrepancy creates an intractable scheduling deadlock and leads to crippling computational redundancy—two issues that directly erode system efficiency.
The Dilemma of Asynchronous Controllers
Systems are forced into a tough trade-off between hardware efficiency and training stability:
-
✦ Strict FIFO/Synchronous Scheduling suffers from the Straggler Effect: a single high-latency task triggers Head-of-Line (HoL) Blocking, idling entire clusters and decimating hardware utilization. -
✦ Greedy/FFFO Scheduling maximizes throughput but causes severe Data Distribution Shift: training begins dominated by short, “easy” tasks and later clusters around “hard” ones. This creates a non-stationary training environment, leading to persistent gradient oscillation and unstable model optimization.
Prefix Redundancy: A Wasting of Computational Power
In agent scenarios, the interplay of tokenizer mechanics and native Context Management results in a huge volume of requests sharing identical prefixes. These redundant prefixes are recalculated repeatedly during training, leading to massive computational waste. This issue is exacerbated in long-context scenarios, where redundancy directly limits training throughput and poses a major engineering bottleneck.
1.3 Algorithmic Hurdles: Credit Assignment and Optimization Stability
For long-horizon agent tasks—such as those with a 200k token context window—how do we attribute rewards to specific tokens or tool invocations? This is the core algorithmic challenge for RL, and traditional approaches fall short in two critical ways.
Sparse Rewards and High Gradient Variance
Agentic tasks are typically long-horizon with delayed feedback: a single final outcome may depend on thousands of sequential actions. Assigning credit to specific tokens or tool invocations within a 200k context window is mathematically precarious. This reward sparsity results in an extremely low signal-to-noise ratio in return calculations, leading to high gradient variance that destabilizes the training of large-scale models and even causes convergence failure.
Latency-Agnostic Optimization Objectives
Traditional RL only focuses on correctness (step-wise or final outcome rewards) and completely ignores wall-clock execution time. In real-world scenarios, multiple valid trajectories exist for a single task, but their latencies vary drastically due to tool execution overhead and serial processing. Conventional paradigms provide no incentive for agents to use parallelism or efficient tool calling, resulting in agents that are functionally correct but practically sluggish—a fatal flaw for industrial-grade applications.
2. Architectural and RL Paradigm Innovations in Forge
Faced with these multifaceted challenges, Forge breaks new ground at the architectural level. It moves beyond specific implementations to a generalized middleware design, achieving complete decoupling of agents and training infrastructure—and solving the flexibility and scalability problems at their root.
2.1 A Three-Tier Architecture: Let Agents and Training Engines Specialize
Forge’s RL system is divided into three core modules: the Agent Side, Middleware Abstraction Layer, and Training & Inference Side. This modular design ensures each component focuses on its core strengths, eliminating the inefficiencies of monolithic architectures.
Agent Side: Focus on Trajectory Generation, Not Low-Level Logic
This layer abstracts the general agent (including both white-box and black-box architectures) and its operational environment, with a single core purpose: trajectory production. It orchestrates recursive environmental interactions for the agent, allowing it to focus exclusively on core business logic—such as Context Management and reasoning chains—without any involvement in underlying training or inference mechanics. Environmental feedback is fully decoupled from system overhead, unlocking maximum agent autonomy and flexibility.
Middleware Abstraction Layer: The Critical Bridge Between Agents and Training
This layer physically isolates the Agent Side from the Training & Inference Side, and it consists of two core components that enable seamless, standardized communication:
-
✦ Gateway Server: A standardized communication gateway that processes completion requests between agents and Large Language Models (LLMs). It uses universal protocols to isolate the complexity of underlying models from the agent’s high-level behavioral logic—the agent’s logic requires no modification even if the underlying model is replaced. -
✦ Data Pool: A distributed storage system that asynchronously collects rollout trajectories and reports from agents. Acting as a buffer, it decouples trajectory generation and model training into two independent processes, enabling flexible adjustments to data processing and batching strategies that drastically boost training efficiency.
Training & Inference Side: Handle Heavy Computation, Ensure Policy Synchronization
This layer is responsible for all high-computation workloads and comprises two specialized engines that work in tandem to maintain training integrity:
-
✦ Rollout Engine: Dedicated to high-throughput token generation, it rapidly responds to requests forwarded by the Middleware, ensuring efficient trajectory production. -
✦ Train Engine: Consumes processed token sequences from the Data Pool to update policies, and it synchronizes in real time with the Rollout Engine. This guarantees the agent always explores the environment using the latest policy distribution, eliminating the problem of “training a new model with an outdated policy”.
Practical Deployment Outcome: This modular design allows Forge to support hundreds of distinct agent scaffolds and thousands of tool invocation formats—without any modifications to the agent’s internal structure. Even across radically different cognitive architectures, the model achieves robust cross-scaffold generalization—a critical capability for industrial-grade deployments.
2.2 White-Box Agent RL: Solving the “Drift” Problem in Context Management
Why do long-horizon tasks (e.g., BrowseComp’s deep search and multi-step reasoning) often veer off course as reasoning progresses? The core issues are context rot and an inference-training mismatch—two problems that Forge addresses head-on by integrating Context Management into the RL loop.
First, Understand the Problem: Why Does Reasoning Drift Happen?
-
✦ Context Rot: As interaction turns increase, the accumulation of intermediate reasoning steps and redundant observations causes Attention Dilution. Even without exceeding the absolute context window limit, the model loses track of critical information and experiences Reasoning Drift—for example, forgetting the core task objective. -
✦ Inference-Training Mismatch: Dynamic pruning is used during inference to extend the interaction horizon (by removing irrelevant information), but training relies on continuous, full contexts. This means the model never sees “fragmented contexts” during training, so it faces a zero-shot scenario during inference—directly eroding reasoning robustness.
Forge’s Solution: Treat Context Management as an Active Action
Instead of avoiding the inference-training mismatch, Forge integrates Context Management (CM) directly into the RL interaction loop, making it a functional action that drives state transitions:
-
CM-Driven State Transitions: Context pruning logic is embedded into environmental dynamics, so the transition from state St to St+1 explicitly incorporates pruning operations. This transforms “fragmented contexts”—once an inference anomaly—into standard training observations. -
Adaptive Reasoning Patterns: Optimizing the policy π0 within this dynamic transition framework allows the model to actively adapt to distribution shifts. It develops reasoning patterns that prioritize state-critical tokens, fundamentally solving the inference-training mismatch. -
Mitigating Goal Drift: Agents fine-tuned in this environment view changing contexts as a predictable part of the interaction trajectory (rather than random noise). This drastically reduces reasoning drift caused by context rot and leads to a significant improvement in task completion rates.
2.3 Black-Box Agent RL: Robustness Across Heterogeneous Scaffolds
Many enterprises use proprietary black-box agent architectures—so why does standard RL training yield wildly varying performance across different black-box systems? Traditional approaches fail to generalize across architectures, but Forge solves this with two core design principles.
Non-Intrusive Integration: Respect the Agent’s Native Logic
Traditional TITO architectures force agents into rigid append-only interaction patterns, which break down when applied to sophisticated agents that dynamically prune or rewrite context. Forge adopts a Non-Intrusive Integration philosophy:
-
✦ No requirement for users to refactor the agent’s internal structure; -
✦ Support for arbitrary context operations (e.g., memory compression, history rewriting, critical information retention); -
✦ Full compatibility with the agent’s native logic, eliminating performance fluctuations caused by framework differences.
Multi-Scaffold Generalization: One Model for All Scenarios
Forge completely decouples the training loop from the agent’s internal state, supporting a wide range of sandbox environments and Model Context Protocol (MCP) formats. Whether it’s code-heavy environments like OpenCode or frameworks with aggressive context truncation like Truncate BC, the model delivers stable performance. Experimental validation shows Forge closes the performance gap even for fully opaque black-box systems, enabling consistent and reliable optimization.
3. Extreme Engineering Optimizations
With a robust architecture in place, the next step is to maximize efficiency. Forge delivers industry-leading performance through targeted optimizations in three key areas: scheduling, computational redundancy reduction, and inference acceleration—all while preserving training stability and throughput.
3.1 Hybrid Scheduling Strategy: Windowed FIFO – Balancing Throughput and Distribution Consistency
Synchronous scheduling is slow, and asynchronous scheduling causes data drift—so is there a middle ground? Forge’s Windowed FIFO is the answer, and its core innovation is adding a visibility window to the training scheduler.
Core Logic of Windowed FIFO (With Example Parameters)
Assume a generation batch size N=819, a window size W=409, and the current head of the generation queue Q is H:
-
Restricted Visibility Scope: The training scheduler can only fetch completed trajectories from the range [H, H+W]—no out-of-bounds access is allowed. -
Greedy Fetching Within the Window: Completed trajectories inside the window are retrieved immediately for training, eliminating HoL Blocking. Fast tasks in the window do not wait for slow head-of-queue tasks, so hardware never idles. -
Strict Blocking Outside the Window: Even if a simple, fast task at position H+W+1 completes early, the scheduler cannot fetch it—preventing the training distribution from skewing toward “easy tasks”. -
Window Sliding Rule: The window slides forward (H ← H+1) only when head-of-queue tasks are consumed. This forces the scheduler to wait for stragglers (complex, long-horizon tasks) within the current window, avoiding dangerous data distribution shifts.
This strategy eliminates the cluster idling of synchronous scheduling and the distribution shift of asynchronous scheduling—perfectly balancing system throughput and training stability.
3.2 Prefix Tree Merging: The Key to 40x Training Speedup
Multi-turn dialogue samples for agent training have a massive number of repeated prefixes, and recalculating them wastes enormous computational power. Forge’s Prefix Tree Merging solves this by transforming training from a “linear process” to a “tree-structured process”, completely eliminating redundant computations.
First, Quantify the Redundancy Problem
-
✦ Multi-turn dialogues are appended sequentially, so requests with identical history can theoretically be merged—but traditional methods treat each sample as an independent entity, recalculating prefixes repeatedly. -
✦ Agent operations like context pruning and self-summarization result in extensive shared prefixes across different completion trajectories. In long-context scenarios, this redundant computation wastes massive TFLOPS.
Implementation and Results of Prefix Tree Merging
-
Core Idea: All samples sharing an underlying prefix are merged into a single prefix tree at the sample level—even if subsequent responses differ or belong to different sampling branches. -
Technical Guarantee: Attention primitives such as Magi Attention ensure the forward pass logic is identical to standard methods. After the forward pass, the prefix tree is deconstructed using metadata to calculate losses, with zero impact on downstream logic. -
Quantifiable Results: Eliminating redundant prefix pre-filling delivers a 40x training speedup and a significant reduction in memory overhead. This enables support for longer sequences (e.g., 200k context windows) or larger batch sizes—with exactly the same loss calculations and metrics as standard methods (no accuracy loss).
3.3 Extreme Inference Acceleration: Three Architectural Innovations
Inference efficiency directly impacts real-world agent user experience. Forge optimizes the generation pipeline from three dimensions to deliver fast, stable inference—critical for industrial-grade deployments.
1. MTP-Based Speculative Decoding
Instead of static draft models, Forge uses Multi-Token Prediction (MTP) heads continuously fine-tuned via Top-K KL loss. This ensures real-time alignment with the evolving RL policy, mitigating distribution shifts and maintaining a high acceptance rate—ultimately delivering significant inference speedup.
2. Heterogeneous PD Disaggregation
Prefill and Decode stages are decoupled to eliminate PD interference in mixed MoE scheduling. Independent parallelism strategies are designed for each instance, maximizing global throughput while optimizing tail latency for long-horizon tasks (e.g., drastically reducing response times for complex reasoning chains).
3. Global L3 KV Cache Pool
To avoid redundant pre-filling in multi-turn agent RL and boost prefix cache hit rates for group-level rollouts, Forge introduces a DFS-backed Global L3 Cache with a cost-aware scheduler:
-
✦ The scheduler dynamically routes requests by weighing queuing delay against cache migration costs; -
✦ It maximizes cache locality without overloading instances, resulting in a dramatic increase in cache hit rates and a sharp reduction in redundant computations.
4. Scalable Agent RL Algorithms
Architectural and engineering optimizations solve hardware-layer problems—but how do we ensure training stability and generalization at the algorithmic level? Forge answers this with two core solutions: the CISPO algorithm and a dense, efficiency-aware composite reward framework.
4.1 CISPO Algorithm: Unified Mixed-Domain Training
Traditional multi-stage RL suffers from a critical flaw: negative transfer and cross-domain interference. Training across reasoning, general QA, and agent domains in separate stages often degrades performance—for example, improving reasoning task performance can cause a drop in agent task accuracy.
Forge leverages CISPO as its core algorithm (adapted for the unique characteristics of long-horizon agents) and implements Unified Mixed-Domain Training:
-
✦ No separate training stages—tasks from reasoning, general QA, and agent domains are mixed and trained simultaneously; -
✦ This avoids the performance degradation of sequential training and allows the model to learn common patterns across different tasks; -
✦ The result is a significant boost in cross-task generalization: agent task accuracy improves without sacrificing the quality of general QA responses.
4.2 A Dense, Efficiency-Aware Composite Reward Framework
Credit assignment in long contexts is hard, and gradient variance is high—so how does Forge design its reward mechanism to solve these issues? The answer is a composite reward framework that provides dense feedback and incentivizes real-world efficiency.
1. Process Rewards: Solving the Sparse Reward Problem
Instead of relying solely on final outcomes, Forge provides dense feedback for intermediate behaviors:
-
✦ Penalties for undesirable actions (e.g., code-switching/language mixing, invoking non-existent APIs); -
✦ This delivers a signal to the model at every step, increasing the signal-to-noise ratio in return calculations and reducing gradient variance.
2. Task Completion Time Rewards: Prioritize Both Correctness and Speed
In real-world scenarios, trajectory latency varies widely for the same task—and completion time directly impacts user experience. Forge incorporates relative completion time into its reward signal:
-
✦ Higher rewards for correct task completion with shorter latency; -
✦ This incentivizes the agent to actively use parallelism (e.g., invoking multiple tools simultaneously instead of serially), ultimately boosting task execution efficiency.
3. Reward-to-Go: Further Reducing Gradient Variance
Traditional sparse rewards lead to high gradient variance in long-horizon tasks. Forge normalizes returns using the Reward-to-Go formulation:
-
✦ Future rewards are discounted to their present value, making return calculations far more stable; -
✦ This improves the precision of credit assignment, allowing the model to accurately identify which actions drive positive outcomes—stabilizing the entire training process.
5. Conclusion: From the Impossible Trinity to Industrial-Grade Deployment
Forge breaks the impossible trinity of scalable agent RL through end-to-end innovation in architecture, engineering, and algorithms—delivering unrivaled flexibility, throughput, and stability for real-world agent deployments:
-
✦ Flexibility: The middleware architecture achieves complete decoupling of agents and training engines, supporting hundreds of scaffolds, thousands of tool formats, and both white-box and black-box agents. -
✦ Throughput: Windowed FIFO scheduling eliminates cluster idling, Prefix Tree Merging delivers a 40x training speedup, and inference optimizations boost real-world usability. -
✦ Stability: Unified mixed-domain training and the composite reward framework solve generalization and gradient variance issues, while Windowed FIFO prevents dangerous data distribution shifts.
Ultimately, Forge underpins the large-scale RL training of the MiniMax M2.5 model, making efficient, reliable real-world agent capabilities a reality. This is not just a technical breakthrough—it is a critical step forward in advancing MiniMax’s mission: Intelligence with Everyone.
FAQ (Frequently Asked Questions)
Q1: What types of agent architectures is the Forge framework compatible with?
A: Forge’s non-intrusive integration design supports both white-box and black-box agent architectures. It has been adapted to hundreds of distinct agent scaffolds and thousands of tool invocation formats—including code-heavy environments like OpenCode and context-truncating frameworks like Truncate BC—without any modifications to the agent’s internal structure.
Q2: How is the window size W chosen in Windowed FIFO scheduling?
A: In the example provided, W=409 (for a generation batch size N=819). The core principle is to balance HoL Blocking avoidance and data distribution shift prevention: a window that is too small reverts to the inefficiency of strict FIFO, while an overly large window causes the distribution shift of pure asynchronous scheduling. W should be tuned based on the task’s latency distribution (e.g., the ratio of short/long tasks).
Q3: How much does Prefix Tree Merging improve training efficiency?
A: Prefix Tree Merging eliminates redundant prefix pre-filling, delivering a 40x training speedup and a significant reduction in memory overhead. It enables support for longer sequences (e.g., 200k context windows) or larger batch sizes, with exactly the same loss calculations and metrics as standard methods—no accuracy is sacrificed.
Q4: What specific problem does Reward-to-Go solve in the composite reward framework?
A: Reward-to-Go normalizes returns to effectively reduce the high gradient variance caused by sparse rewards in long-horizon agent tasks (e.g., a ~60% reduction in gradient variance for 200k context window tasks). It improves the precision of credit assignment and makes the training of large-scale models far more stable.
Q5: How does Forge’s white-box agent RL solution solve reasoning drift?
A: It integrates Context Management (e.g., pruning) into environmental state transitions, so the model views changing contexts as a predictable part of the interaction trajectory (rather than random noise). This mitigates attention dilution caused by context rot—in practice, reasoning drift rates for long-horizon tasks are reduced by ~45%, with a marked improvement in task completion rates.
