iFlow-ROME: A Complete Guide to Alibaba’s Next-Generation AI Agent Training System

Snippet Summary: iFlow-ROME is Alibaba’s agentic learning ecosystem featuring a 30B MoE ROME model that achieves 57.40% task completion on SWE-bench Verified. The system generates over 1 million verified interaction trajectories through ROCK sandbox manager and employs a three-stage curriculum training methodology for end-to-end execution optimization in real-world environments.

When you type a command in your terminal, expecting AI to help you complete complex software engineering tasks, traditional large language models often disappoint—they might generate code that looks reasonable but crashes when you run it, or they “lose the thread” midway through multi-step tasks, unable to complete the full workflow. This is precisely the pain point Alibaba’s team set out to solve.

In August 2025, they officially launched iFlow CLI, an agent product designed for real-world engineering scenarios. After gathering user feedback, the team discovered a critical issue: no matter how high a model scores on benchmarks, it frequently “fails” once deployed in real, complex environments. This isn’t a problem of model size, but rather that existing training systems insufficiently model how agent models execute and receive feedback in real task environments.

Based on this insight, Alibaba’s Future Living Lab, together with their Intelligent Engine and Data Technology teams, launched the complete Agentic Learning Ecosystem (ALE) and its core model, ROME-V0.1.

What Is ROME and How Does It Fundamentally Differ from Traditional LLMs?

ROME stands for “ROME is Obviously an Agentic ModEl”—not just a recursive acronym, but a reflection of its design philosophy. This is a model built specifically for agentic capabilities, not simply a language model.

Core Architecture: The Logic Behind 30B MoE

ROME-V0.1 adopts a 30B MoE (Mixture of Experts) architecture. This scale wasn’t chosen to pursue maximum parameter count. The team explicitly states this represents a balance point between trainability, deployability, and reproducibility. The 30B scale is sufficient to support complex agentic capabilities while ensuring the complete training loop can run stably with high efficiency and cost-effectiveness.

In mainstream Agent benchmarks, IFlow-CLI + ROME-V0.1 outperforms open-source models of similar scale:

Benchmark	ROME-V0.1 Score	Comparison
SWE-bench Verified	57.40%	Software engineering task completion rate, approaching 100B+ parameter model performance
Terminal-Bench 2.0	24.72%	Terminal operation success rate, leading among similar-scale open-source models

The key behind these numbers: ROME isn’t a model specifically optimized for certain evaluation benchmarks. Instead, it naturally evolved through over 1 million verified interaction trajectories with real environment feedback.

Three Fundamental Differences from Traditional LLMs

1. Paradigm Shift in Training Data Sources

Traditional LLMs primarily use static text corpora, organized in doc-centric (around documents) or query-centric (around questions) patterns. This data lacks executable environment constraints, causing models to learn behavior patterns that “look reasonable” but “don’t actually work in real conditions.”

ROME employs an environment-centric data construction paradigm. The team first builds reproducible execution environments and runnable task instances at scale, with each instance including:

Task description
Docker environment configuration
Initialization scripts
Test files
Golden solution

Multi-turn interaction trajectories are systematically generated on top of these instances, with all trajectories verified through execution and testing. Differences between various environments and tools are reflected in different trajectories, constraining the model from the start to “executable, verifiable” learning objectives.

2. Redefinition of Training Objectives

Traditional LLMs optimize for “generating correct text,” while ROME optimizes for “completing tasks in real environments.” This difference manifests in every training detail.

For example, during the Supervised Fine-Tuning (SFT) stage, ROME introduces an error masking training mechanism: based on tool execution feedback, gradients corresponding to non-executable or failed behaviors are zeroed out. This avoids the traditional approach of indiscriminately backpropagating gradients for all tokens, which unintentionally reinforces incorrect behaviors.

3. Essential Transformation in Evaluation Standards

Evaluating ROME isn’t about how “professional” its generated code looks, but whether it can actually execute in a Docker sandbox and pass tests. This “executable, verifiable” evaluation standard fundamentally changes the model’s learning direction.

The ALE Ecosystem: Complete Infrastructure Behind ROME

If ROME is like a rigorously trained engineer, then ALE is the complete educational system that cultivates this engineer—from training grounds and teaching methods to practical exercises, forming a closed loop.

ROCK: A Training Ground with Massive Concurrent Capacity

ROCK (Reinforcement Open Construction Kit) is a self-developed sandbox manager that provides real, secure, and isolated execution environments for model training.

Core Capability Metrics:

Concurrent capacity: Supports tens of thousands of simultaneously running sandboxes
Data scale: Generates over 1 million interaction trajectories with environment feedback
Environment foundation: Built on real GitHub projects

ROCK ensures that every operation the model encounters during training has real environment execution results as feedback. It’s like having students learn in real engineering projects rather than just doing exercises in a classroom.

ROLL: Solving the Long-Tail Rollout Efficiency Problem

Optimizing rollout efficiency in reinforcement learning has always been a persistent challenge. In Agent-related complex tasks, the wide variance in task difficulty and complexity makes the long-tail phenomenon in environment interaction and sample generation even more severe. Some tasks might require dozens of interaction steps to complete, while others need only a few, causing the entire training pipeline to slow down while waiting for long tasks to finish.

ROLL (Reinforcement Learning Optimization for Large-Scale Learning) breaks through this bottleneck with the following techniques:

1. Extreme Distributed Parallelization
Fully parallelizes trajectory sampling, policy evaluation, gradient computation, and other processes, with different tasks proceeding simultaneously in different sandboxes.

2. Asynchronous Training Pipeline
Instead of waiting for all tasks to complete before starting policy optimization, it continuously collects completed trajectories for training. This dramatically reduces the time cost of trajectory sampling and policy optimization.

3. High-Frequency Closed-Loop Training
Supports the model in synchronously conducting trial-and-error iterations across massive tasks, completing more frequent closed-loop training per unit time.

iFlow CLI: The Unified Interface Between Training and Production

iFlow CLI is not just the interface users interact with ROME, but a crucial component in the training system. It implements standardized context management and flexible, open configuration settings, eliminating the gap between training and production.

Why Is Context Management So Critical?

In long-chain tasks, the model needs to:

Remember previous operation history
Track current environment state
Understand changes in available tools
Manage switching between multiple sub-tasks

Traditional training methods often use simplified context concatenation, creating significant differences from actual Agent frameworks, causing model capabilities to degrade in production environments. iFlow CLI, through standardized protocols, ensures Agent models maintain real-time, smooth interaction with environments throughout complex task workflows.

Three-Stage Curriculum Training: How to Cultivate a Competent AI Engineer from Scratch

ROME doesn’t simply follow the common “pretraining—fine-tuning—reinforcement learning” paradigm. Instead, centered on the gradual formation process of Agent capabilities, it designed a curriculum-based three-stage training system.

Stage One: CPT (Continued Pretraining) — Building Foundation Capabilities

Like a newly hired engineer who first needs to learn programming languages and development tools, the CPT stage’s goal isn’t to directly optimize task success rates, but to systematically inject foundational Agent capabilities.

Core Capability Matrix:

Capability Dimension	Specific Content
Code Understanding & Modification	Understanding code structure, identifying bugs, generating fix solutions
Task Decomposition & Planning	Breaking complex tasks into executable steps
Tool Usage & Reasoning	Mastering terminal commands, API calls, file operations, etc.
Environment State Perception	Understanding execution feedback, judging operation success

The data filtering strategy is also unique: rather than using result correctness as the sole criterion, it primarily focuses on behavioral pattern coverage. By introducing diverse interaction trajectories, it provides ample activatable space for subsequent policy optimization.

This is like letting students first encounter various types of programming tasks—even if some tasks aren’t completed correctly, through diverse attempts they build basic cognition of engineering problems.

Stage Two: SFT (Supervised Fine-Tuning) — Stabilizing Interaction Behavior

This stage’s core objective is to anchor subsequent reinforcement learning in reliable, executable policy regions, avoiding high-frequency occurrence of low-quality or non-executable behaviors.

Two-Phase SFT Strategy:

Phase One: Lightweight SFT
Data filtering based on heuristic rules ensures the model possesses correct behavioral patterns. For example, filtering out code with obvious syntax errors, invalid tool calls, etc.

Phase Two: Adaptive Enhancement
Introduces adaptive sample selection mechanisms to prioritize interaction trajectories with high learning value. What trajectories have high learning value? Those demonstrating complex problem-solving approaches and successfully handling edge cases.

The Necessity of Error Masking Training

In long-chain interactions, tool call errors or execution failures are extremely common. If gradients are indiscriminately backpropagated for all tokens, the model might unintentionally reinforce incorrect behaviors.

ROME’s solution: based on tool execution feedback, gradients corresponding to non-executable or failed behaviors are zeroed out. The model learns only from successful behaviors, not learning incorrect patterns from failures.

Decision Boundary Recognition

In multi-sub-Agent scenarios, the system identifies decision boundaries for specific tasks, retaining only context rounds directly relevant to the current sub-task. Through pattern-based heuristic identification, loss gradients are masked for redundant, highly similar, or pruned historical rounds, concentrating learning signals on interaction processes with true causal influence.

This dramatically improves sample efficiency, preventing the model from wasting learning capacity on irrelevant information.

Stage Three: IPA Reinforcement Learning — The Core Algorithm for Policy Evolution

After completing basic alignment, ROME enters the reinforcement learning stage based on IPA (Interaction-Perceptive Agentic Policy Optimization). IPA is a reinforcement learning algorithm specifically designed for Agent long-chain tasks, addressing multiple core pain points of traditional RL in complex interaction scenarios.

IPA Algorithm: Paradigm Upgrade from Token-Level to Interaction Chunk-Level

Traditional reinforcement learning uses tokens as optimization units, but this creates serious problems in Agent tasks. A complete tool call might contain dozens of tokens, and if each token is optimized independently, it’s difficult to accurately assign reward signals—which specific tokens led to the tool call’s success or failure?

IPA’s core innovation is elevating the optimization objective from “token granularity” to “semantic interaction chunk (Interaction Chunk)” level.

Chunked Markov Decision Process: Re-modeling the Decision Process

IPA first re-models the Markov Decision Process (MDP) at the interaction chunk level. It divides a complete token sequence into individual interaction chunks, with each chunk covering the process between two consecutive environment interactions, forming a complete decision unit.

Taking tool calls as an example, one interaction chunk contains:

Analysis and reasoning phase: Understanding current state, deciding which tool to call
Tool invocation phase: Generating correct tool call statements
Execution trigger phase: Waiting for environment to return execution results

This modeling approach aggregates tokens that collectively influence a single environment interaction into a whole, enabling each optimization objective (interaction chunk) to correspond with the same environment interaction, achieving more accurate credit assignment.

Chunk-Level Discounted Return: Solving Long-Tail Trajectory Training Challenges

In traditional reinforcement learning, discount rewards play an important role. However, in LLM RL training, traditional token-based optimization methods struggle to introduce meaningful discount rewards.

Root Cause: A complete trajectory often contains thousands of tokens. The discount factor (less than 1) decays exponentially across these tokens, approaching zero. Many tokens in the trajectory have their reward weights excessively reduced, making it difficult to obtain effective gradient updates.

IPA’s Solution: After aggregating optimization objectives from token level to interaction chunk level, reward discount time steps can perfectly align with each actual environment interaction.

Suppose a task has 20 environment interactions (rather than 2000 tokens)—the discount factor’s decay over 20 interactions is reasonable and doesn’t cause early interactions to be excessively down-weighted. This effectively prevents early ineffective operations (like invalid tool calls) from being over-rewarded, encouraging the model to more efficiently learn high-impact interaction steps.

Chunk-Level Importance Sampling: Stabilizing the Training Process

In reinforcement learning, there’s a discrepancy between training distribution and sampling distribution that needs correction through importance sampling.

IPA proposes a chunk-level importance sampling method: within each interaction chunk, it calculates the ratio of training distribution probability to sampling distribution probability for all tokens, using the geometric mean of these probability ratios to measure chunk-level sampling probability differences.

The advantage of geometric mean is that it weakens the influence of anomalous tokens and avoids extreme ratio values. Combined with chunk-level reward allocation, importance sampling can adjust optimization objectives to compensate for training instability caused by distribution bias.

Chunk-Level Initialized Resampling: Learning by Standing on Giants’ Shoulders

This is one of IPA’s most innovative techniques, solving the problem of sparse positive signals in complex multi-turn interaction tasks.

Problem Scenario: In some complex tasks, if the model can’t make stable correct decisions at each key point, task success rate will decline exponentially. For example, in a task requiring 10 operational steps with 90% success rate per step, the final success rate is only 35%. This leads to extremely sparse positive reward signals, making learning difficult for the model.

IPA’s Solution — Chunk-Level Initialized Resampling:

Using interaction chunks from successful reference trajectories (from the model itself or external expert models) as anchor points:

“Pre-fill” the first half of trajectories using these interaction chunks and execute interactions
Initialize the environment to intermediate states of these successful trajectories
The model “resamples” subsequent interaction chunks from the intermediate state and continues interacting
Complete the entire trajectory and obtain final rewards

This approach lets the model “stand on giants’ shoulders”: leveraging successful trajectories to anchor partial interactions, reducing overall task difficulty while letting the model first learn how to complete later steps, then modify initialization points, ultimately gradually learning to solve the entire task.

Sequential Rollback Strategy:

To determine specific initialization positions in reference trajectories, IPA employs an intelligent rollback strategy:

Start initialization from the last interaction chunk of the reference trajectory
Record resampling trajectory success rate at that position
“Roll back” the initialization point to the state before the previous interaction chunk’s execution
When resampling success rate drops sharply after a certain rollback, define the reference interaction chunk that rollback crossed as a “critical interaction”
The model stops rolling back, resamples multiple times from the state before that interaction chunk’s execution and learns
After mastering it, continues rolling back

This process is like teaching a student to solve math problems: first teach them the last few steps, then gradually work backward, adding one new step each time, until they can solve the problem completely from start to finish.

Parallelized Initialization:

Considering data characteristics and extreme cases, IPA also supports the model simultaneously resampling from multiple initialization points in reference trajectories, and introduces imitation learning for reference interaction chunks, greatly accelerating training efficiency.

Agent-Native Training: The Design Philosophy of Training as Production

Many Agent training pipelines have a fatal problem: the context organization methods used during training significantly differ from actual agent frameworks, causing model capabilities to degrade in production environments. It’s like practicing in a driving simulator—no matter how good you get, you’ll still have problems when actually driving on the road.

ROME fundamentally solves this problem through Agent-Native Training. Its core philosophy can be summarized as: “ROME isn’t trained in a simulated agent, but directly trains the agent itself in real environments.”

Directly Reusing iFlow CLI’s Complete Execution Logic

During training, ROLL doesn’t use manually rewritten prompt concatenation or simplified Agent scaffolds, but directly calls iFlow CLI to run the real Agent.

This means model inputs include dynamically generated context from iFlow CLI:

Long context compression strategies
Dynamic updates of callable tools
Various system prompts
Intermediate state management

The input distribution seen during RL training phase remains consistent with online usage, eliminating training-production distribution shift problems.

ModelProxy Service: Non-Invasive Training Architecture

To avoid repeatedly implementing Agent logic in the training framework, ROCK introduces ModelProxy Service within sandboxes.

Workflow:

Agent calls model interfaces in the usual way within the sandbox
ModelProxy Service intercepts these requests
Asynchronously forwards to inference services launched by ROLL
Returns inference results back to Agent

Core Advantage: ROLL doesn’t need to perceive Agent’s prompt structure or context management details, yet can still train real Agent behavior. This “non-invasive” design dramatically reduces system coupling and improves flexibility.

Unified Execution Chain: Training, Distillation, and Evaluation as One

Since training stage directly runs the real Agent, data synthesis, reinforcement learning, distillation, and evaluation can all reuse the same execution and environment interaction logic.

Engineering Value:

Significantly reduces Agentic RL engineering complexity
Ensures no behavioral drift between different stages
Provides unified interface for ablation experiments and Agent framework switching (supports iFlow CLI, SweAgent, OpenHands, etc.)

This design ensures highly consistent model behavior across training, evaluation, and real deployment stages, avoiding the common problem of “one approach for training, another for deployment.”

How to Access and Use ROME: A Practical Guide

ROME model is already integrated into iFlow CLI and available for use. Let’s see how to use this powerful AI engineer assistant in real projects.

Installing iFlow CLI

For Mac systems:

bash -c "$(curl -fsSL http://cloud.iflow.cn/iflow-cli/install.sh)"

After installation, simply type iflow in your terminal to launch it.

Selecting the ROME Model

Within iFlow CLI, you can choose to use the ROME model for task execution. The system automatically allocates appropriate model configurations based on task complexity.

Typical Application Scenarios

1. Code Analysis and Repair

When you have a codebase with bugs, you can directly ask ROME to help locate and fix them:

Analyze this Python project, find the bug causing test failures and fix it

ROME will:

Read project structure and related code
Run tests and analyze failure causes
Locate bug positions
Generate fix solutions
Verify tests pass after fixes

2. Automated Test Generation

Generate complete unit tests for this API interface

ROME will understand the API’s functionality, parameters, and return values, automatically generating test cases covering normal scenarios and edge cases.

3. Project Refactoring and Optimization

Refactor this module to improve code readability and performance

ROME won’t just provide refactoring suggestions—it will actually execute the refactoring, run tests to ensure functionality remains unchanged, and generate refactoring reports.

What ROME’s Performance Metrics Really Mean

What does ROME’s 57.40% completion rate on SWE-bench Verified actually mean? This benchmark includes real GitHub project issues, requiring the model to:

Understand issue descriptions
Locate relevant code
Understand code logic
Generate fix solutions
Run tests for verification
Handle possible failure retries

A 57.40% completion rate means ROME can successfully complete over half of real software engineering tasks—a remarkably high level in the AI Agent field.

Frequently Asked Questions

What’s the main difference between ROME and general-purpose LLMs like GPT-4 or Claude?

The core difference lies in the training paradigm. General-purpose LLMs are primarily trained on static text, optimizing for “generating correct text.” ROME is trained in real executable environments, optimizing for “completing tasks in real environments.”

Specific manifestations:

General models might generate code that looks correct but can’t guarantee it will run
Every operation ROME generates is validated by the environment, ensuring executability
General models easily “lose the thread” in long-chain tasks; ROME optimizes long-task processing through IPA algorithm

Why can 30B-parameter ROME approach 100B+ model performance?

This is primarily due to three factors:

1. Training Data Quality: Over 1 million interaction trajectories with real environment feedback, each trajectory verified through execution—data quality far exceeds ordinary text corpora.

2. Training Method Specificity: Three-stage curriculum training, IPA algorithm, etc., are all specifically designed for Agent tasks, with higher training efficiency.

3. Agent-Native Design: Training and deployment use the same execution chain, with no capability loss.

What application scenarios is ROME suitable for? What scenarios is it not suitable for?

Suitable Scenarios:

Software engineering tasks: code analysis, bug fixes, test generation
Terminal operation automation: batch file processing, system configuration
Complex multi-step workflows: tasks requiring environment interaction

Unsuitable Scenarios:

Pure creative writing: ROME’s strength is execution, not creativity
Simple Q&A: Using ROME for simple questions is “overkill”
Tasks requiring massive knowledge: ROME focuses on execution capability rather than breadth of knowledge

What hardware environment is needed to use ROME?

As a user, you only need a terminal environment capable of running iFlow CLI. The ROME model runs in the cloud, so users don’t need high-performance hardware.

For teams wanting to deploy ROME themselves:

30B MoE model requires corresponding GPU resources
Specific configuration depends on concurrent request volume and response time requirements
Team recommends first trying the cloud version, then considering private deployment after evaluating effectiveness

Does ROME’s training data include private code?

ROME’s training is based on real GitHub projects, using only public code repositories. If you use iFlow CLI to process private projects, that data won’t be used for model training.

The team explicitly states that the ALE system supports building dedicated Agents in local or private environments, allowing training with your own private data to ensure data security.

What’s the future development roadmap for ROME?

The team states they will follow the training pipeline already validated by ALE, systematically:

Expand environment scale: support more types of development environments and toolchains
Increase task complexity: handle more complex multi-step, multi-module tasks
Model iteration: release more powerful ROME versions
Lower usage barriers: make it easy for individual developers and small teams to use

The team emphasizes: “ROME is just the beginning.”

Technical Insights: Future Directions for AI Agents

ROME’s emergence provides important insights for the AI Agent field.

From “Armchair Strategizing” to “Real Combat”

In the past, we evaluated an AI model’s coding capabilities mainly by whether it could generate syntactically correct, logically clear code. But in real engineering, that’s far from enough. Code needs to:

Run in specific environments
Integrate with existing codebases
Pass test verification
Handle edge cases and exceptions

ROME demonstrates a new path: don’t stop at the text generation level, but dive deep into real execution environments, letting models learn and evolve through actual feedback.

The Importance of Training Infrastructure

ROME’s success is 50% due to model design and 50% due to ALE, this complete training infrastructure. The coordination of components like ROCK, ROLL, and iFlow CLI constructs an end-to-end closed-loop system.

This tells us: to train truly useful AI Agents, we can’t just focus on model architecture—we need to invest equal or even more effort in building training infrastructure.

The Power of Open Source Ecosystems

Alibaba’s team chose to release ALE as open-source infrastructure to lower the barriers to using and iterating on Agentic LLMs, enabling more individual developers and teams to build their own Agents.

This open attitude accelerates the entire field’s development. As more developers share cases and innovative designs in the iFlow CLI forum, the entire community’s Agent capabilities improve.

The release of ROME-V0.1 marks an important step for AI Agents from “proof of concept” to “production ready.” It’s not a race for maximum parameter scale, but a systematic exploration of the core question: “How to train AI Agents that can truly get work done?”

Through environment-centric data construction, three-stage curriculum training, IPA reinforcement learning algorithm, and Agent-Native training design, ROME demonstrates a clear and viable technical path. More importantly, the open-sourcing of the entire ALE ecosystem makes this path a public asset that the entire community can explore and improve.

The next time you invoke iFlow CLI in your terminal and ask ROME to help you complete a complex engineering task, this carefully designed training system is operating behind the scenes. And this is just the beginning.

iFlow-ROME Explained: How Alibaba’s 30B AI Agent Mastered Real-World Coding Tasks