ARPO: End-to-End Policy Optimization for GUI Agents
In the modern digital era, human-computer interaction methods are continuously evolving, and GUI (Graphical User Interface) agent technology has emerged as a crucial field for enhancing computer operation efficiency. This blog post delves into a novel method called ARPO (Agentic Replay Policy Optimization), which is designed for vision-language-based GUI agents. It aims to tackle the challenge of optimizing performance in complex, long-horizon computer tasks, ushering in a new era for GUI agent development.
The Evolution of GUI Agent Technology
Early GUI agents relied primarily on supervised fine-tuning (SFT), training on large-scale trajectory datasets to imitate human behavior and predict next actions. However, these agents had significant limitations. They lacked the ability to self-correct and were prone to error accumulation in operational trajectories. To overcome these bottlenecks, researchers began exploring the application of reinforcement learning (RL) technology in GUI agent training.
Unlike single-turn reinforcement learning, GUI agents need to perform multi-turn reasoning and decision-making, interacting with dynamic environments that provide visual feedback. Nevertheless, GUI environments pose several challenges for reinforcement learning. Reward signals are often sparse and delayed, with complex tasks yielding no rewards during the early training phases. Additionally, the cost of rollouts in real desktop environments is substantial due to operating system-level delays, which slow down the data collection process. These issues have significantly hindered the application of reinforcement learning in GUI agent training.
The Emergence and Innovation of ARPO
To address the aforementioned challenges, ARPO was introduced. Based on the GRPO (Group Relative Policy Optimization) algorithm, ARPO is a reinforcement learning approach that eliminates the need for an explicit value function or critic. It calculates token-level advantages using group-normalized rewards, making it particularly suitable for large language models (LLMs).
An End-to-End Reinforcement Learning Framework
ARPO’s GUI agent is built upon the UI-Tars framework and the Qwen2.5-VL architecture. It can process up to 15 image inputs and has a model context length of 64K, ensuring the complete processing of entire GUI trajectories. Unlike previous short-context GUI agents that only processed the most recent one or two screenshots, ARPO leverages the full trajectory history. This enables the model to reason over long-term dependencies and optimize performance across the entire interaction sequence.
The agent tokenizes the entire history of screenshots and corresponding actions into the input context of the vision-language model (VLM). This design allows the agent to make more accurate and interpretable decisions by considering the entire context of the task.
Integration of Chain-of-Thought (CoT) Technology
To enhance the reasoning capabilities of VLM agents, ARPO incorporates the Chain-of-Thought (CoT) prompting technique. Each action consists of two parts:
-
A thinking part, which represents the agent’s internal reasoning. -
A solution part, which executes the resulting action.
This design enables the agent to perform more accurate and interpretable decision-making, as it can carefully consider the reasoning process before executing an action.
Distributed Trajectory Rollout Strategy
Reinforcement learning training for GUI agents requires efficient trajectory collection across rich, interactive desktop environments. To meet this need, ARPO employs a distributed trajectory rollout strategy tailored for parallel interaction with live environments such as OSWorld.
The system establishes multiple rollout workers, each consisting of an interactive environment paired with a GUI agent that maintains a history of screenshots and corresponding actions. Each worker continuously captures screenshots of the current GUI environment and transmits them to a centralized language model inference server powered by VLLM. The policy model processes these batched visual observations in parallel, predicting the next action for all environments simultaneously.
This distributed strategy effectively utilizes GPU resources on the inference server and minimizes per-step decision latency, thereby improving the efficiency of trajectory collection.
Experience Replay Buffer
Given the infrequency of successful trajectories in GUI-based tasks, ARPO introduces an experience replay buffer that caches successful trajectories on a per-task basis. During training, if an entire GRPO training group consists solely of failed trajectories, one of them is randomly replaced with a previously stored successful trajectory from the buffer for the corresponding task. This ensures that as long as the agent has successfully completed a task once, its training group in subsequent training processes will include at least one rollout with a non-zero reward signal.
The buffer is dynamically updated during rollouts. To prevent the stored samples from diverging too significantly from the current policy, a fixed-size limit is imposed on the buffer, and the oldest entries are evicted when it is full.
Experimental results indicate that models equipped with a replay buffer begin to outperform baseline models around step 30 and maintain a consistent advantage throughout the remainder of the training. By the end of training, models with a replay buffer achieve higher average trajectory rewards (0.75 vs. 0.65), demonstrating that leveraging past successes substantially improves both sample efficiency and overall policy performance in sparse-reward GUI environments.
Task Selection Strategy
ARPO adopts a task filtering procedure to identify a subset of “valuable” tasks that can produce successful trajectories under a baseline agent. Specifically, each task in OSWorld is evaluated using the UI-Tars-1.5 model, with 16 rollouts per task. A task is retained in the GRPO training set if the agent completes it successfully in at least one of these attempts. This method yields a curated set of 128 tasks that are more amenable to early-stage learning, allowing the policy optimization to benefit from informative reward signals.
Experimental results show that training on the selected subset of tasks leads to significantly higher average trajectory rewards and faster convergence speed from the early stages of training. The standard deviation of rewards within GRPO groups is also consistently higher when training on the curated task set, which is critical for GRPO as it relies on within-group reward diversity to compute token-level advantages.
Experimental Evaluation and Results Analysis
Experimental Setup
Experiments were conducted based on the OSWorld benchmark, a real-computer environment designed for evaluating multi-modal agents on open-ended GUI tasks. OSWorld contains 369 tasks across diverse domains such as office productivity, web browsing, system management, and multi-app workflows. Each task is executed within virtual machines using real applications and evaluated via execution-based scripts.
The evaluation metrics followed the standard rule-based evaluation protocol defined in OSWorld. Each agent trajectory receives a scalar reward between 0 and 1.0 from the environment. To provide a more accurate assessment of agent capabilities for RL, a stricter evaluation protocol called OSWorld Hard was introduced, which prohibits the replacement of the last action with a FAIL action when the maximum step number is reached in a rollout.
Experimental Results
On the OSWorld benchmark, the ARPO method demonstrated superior performance. For instance, when applied to the UI-Tars-1.5 base model, ARPO achieved a success rate of 29.9% on the standard OSWorld setting and 23.8% on the stricter OSWorld Hard variant, representing improvements of 6.4% and 5.6% respectively over the original UI-Tars-1.5 model. Other model versions also exhibited consistent performance gains. For example, UI-Tars-7B-DPO improved from 15.6% to 20.4% with ARPO.
When compared to offline preference optimization methods, ARPO achieved the highest score of 27.3%, outperforming GRPO (26.0%) and other preference-based methods such as KTO (24.6%), DPO (22.4%), and Reject Sampling (21.8%). These results indicate that direct trajectory-level optimization with rule-based rewards provides stronger learning signals than offline preference modeling. The addition of experience replay in ARPO further enhances stability and sample efficiency in sparse-reward GUI settings.
Generalization Ability Analysis
To assess the generalization ability of RL training, the performance of models was evaluated on both in-domain and out-of-domain (OOD) tasks. Specifically, 32 tasks from the training task set were selected for reinforcement learning, with the remaining 96 serving as OOD tasks. The results showed that reinforcement learning significantly improved in-domain accuracy. ARPO achieved an accuracy of 81.25%, compared to 68.8% for GRPO and 43.8% for the base UI-Tars-1.5 model. However, on OOD tasks, the improvements were more modest. While the base UI-Tars-1.5 model achieved 55.2%, GRPO slightly underperformed at 52.08%. ARPO, however, recovered generalization capability, scoring 56.3%, slightly above the base model. This indicates that structured trajectory grouping and replay can mitigate overfitting to some extent. Overall, although reinforcement learning effectively enhances the in-domain success rate of VLM agents, strong generalization still depends on broader task diversity, carefully designed reward signals, and larger-scale training resources.
Rollout Efficiency Analysis
The impact of the number of parallel environments on rollout efficiency was also investigated. As the number of parallel environments increased from 8 to 256, the rollout time for a single batch of trajectories grew from 3 minutes to 19 minutes. However, the total time to sample all trajectories in an epoch dropped sharply from over 6 hours to approximately 1.2 hours. This improvement is mainly attributed to two factors: (1) Larger batches allow the VLLM server to perform more efficient GPU inference, and (2) OS-level delays in GUI environments are overlapped for all parallel environments. As a result, scaling to 256 environments enables high-throughput rollouts, making RL training in real desktop settings much more practical.
Self-Correction Capability of ARPO-Trained Agents
ARPO-trained agents have demonstrated self-corrective behavior. For example, in a task to change the “2” in “H2O” to a subscript, the agent initially selected the superscript button instead of the subscript button. However, by observing the current screen, the agent realized the mistake and decided to use the Ctrl-Z hotkey to revert the previous operation. Notably, the success rate for this specific task improved from 25% to 62.5% after applying ARPO.
Conclusion and Future Outlook
ARPO, as a reinforcement learning approach, enhances vision-language models with longer input contexts and multi-turn, multi-modal screenshot processing capabilities. It has successfully enabled end-to-end policy optimization in complex GUI environments using rule-based reward signals. Experiments have shown that careful task selection significantly improves learning stability and reward diversity.
This study highlights the potential of combining multi-modal understanding with reinforcement learning to create more adaptive and capable GUI agents. Future research directions include expanding the task set to cover a broader range of real-world applications, further extending the context length of agents to support more sophisticated trial-and-error behaviors, and exploring the use of learned reward models to autonomously evaluate trajectories, thereby reducing reliance on manually crafted reward functions.
In conclusion, ARPO represents a significant advancement in GUI agent technology, offering a promising path for the future development of more intelligent and efficient human-computer interaction systems.