Exploring OpenCUA: Building Open Foundations for Computer-Use Agents
Have you ever wondered how AI agents can interact with computers just like humans do—clicking buttons, typing text, or navigating apps? That’s the world of computer-use agents (CUAs), and today, I’m diving into OpenCUA, an open-source framework designed to make this technology accessible and scalable. If you’re a developer, researcher, or just someone interested in AI’s role in everyday computing, this post will walk you through what OpenCUA offers, from its datasets and tools to model performance and how to get started. I’ll break it down step by step, answering common questions along the way.
OpenCUA stands for Open Foundations for Computer-Use Agents. It’s a complete framework that includes datasets, tools, models, and benchmarks to help build and study CUAs. These agents use vision-language models to automate tasks on desktops, spanning operating systems like Windows, macOS, and Ubuntu. The framework addresses key challenges in creating open-source CUAs, such as collecting diverse data and training models that can plan and execute actions effectively.
Let’s start with the basics: Why does OpenCUA matter? In a field where many advanced CUA systems are proprietary, OpenCUA provides transparency. It allows researchers to experiment with scaling data and models, study limitations, and explore risks. The project was released on August 13, 2025, with a paper on arXiv (abs/2508.09123) and a project page at opencua.xlang.ai.
This image shows how OpenCUA-7B’s performance scales with data, highlighting the framework’s emphasis on growth through better data and modeling.
What Components Make Up OpenCUA?
If you’re asking yourself, “What exactly is in OpenCUA?”—it’s built around four main parts:
-
AgentNet: A large dataset of computer-use tasks. -
AgentNetTool: A tool for capturing human demonstrations. -
AgentNetBench: An offline evaluator for testing models. -
OpenCUA Models: Trained models that generate executable actions.
These elements work together to collect data, process it, train models, and evaluate them. For instance, AgentNet includes over 22,000 tasks across more than 200 applications and websites, making it the first dataset of its scale for desktop environments.
AgentNet: The Dataset Powering It All
You might be thinking, “How do they gather all this data?” AgentNet is collected from real human interactions on personal computers. It covers tasks on Windows, macOS, and Ubuntu, with an average of 18.6 steps per task. This isn’t just simple clicks—tasks often involve complex workflows, like using professional tools or switching between apps.
Here’s a quick comparison of AgentNet with other GUI datasets to give you context:
Dataset | Tasks | Avg. Steps | Environment Type | Personalized Env. | Human Traj. | Dom/AxTree | Video | Inner Monologue |
---|---|---|---|---|---|---|---|---|
AndroidControl | 15283 | 5.5 | Mobile | No | Yes | Yes | No | Short |
AMEX | 2991 | 11.9 | Mobile | No | Yes | No | No | No |
AitW | 2346 | 8.1 | Mobile | No | Yes | Yes | No | No |
AitZ | 1987 | 6.0 | Mobile | No | Yes | No | No | Short |
GUI Odyssey | 7735 | 15.3 | Mobile | No | Yes | No | No | No |
OS-Genesis | 2451 | 6.4 | Mobile & Web | No | No | Yes | No | Short |
WonderBread | 598 | 8.4 | Web | No | Yes | Yes | Yes | No |
AgentTrek | 10398 | 12.1 | Web | No | No | Yes | Yes | Short |
Mind2Web | 2350 | 7.3 | Web | No | Yes | Yes | No | No |
GUIAct | 2482 | 6.7 | Web | No | Yes | Yes | No | No |
AgentNet | 22625 | 18.6 | Desktop | Yes | Yes | Yes | Yes | Long |
AgentNet stands out because it’s desktop-focused, uses personalized environments, and includes long inner monologues (reflective reasoning) for each step.
As seen in this domain distribution chart, AgentNet spans categories like productivity, browsing, and creative tools.
To download AgentNet:
-
Install the Hugging Face Hub: pip install -U huggingface_hub
-
Run: huggingface-cli download xlangai/AgentNet --repo-type dataset --local-dir ./AgentNet
For unzipping (e.g., Ubuntu images):
-
Merge zips: zip -s 0 images.zip --out images-full.zip
-
Unzip: unzip images-full.zip -d path_to_your_target_dir
The data collection involves three steps: demonstrating tasks with AgentNetTool, processing actions, and synthesizing reasoning.
How Does AgentNetTool Work?
If you’ve ever tried recording computer tasks, you know it can be clunky. AgentNetTool changes that. It’s a cross-platform GUI recorder that captures screen videos, mouse/keyboard events, and accessibility trees without interrupting your work. It runs in the background on Windows, macOS, or Ubuntu.
The tool provides an in-browser UI for reviewing and submitting demos. Annotations focus on diversity and complexity—tasks must have at least 15 steps, covering over 100 apps and 190 websites.
After recording, raw data (dense actions like thousands of mouse moves) is processed:
-
Action Reduction: Merges signals into PyAutoGUI actions (e.g., combining mouse moves and clicks). -
State-Action Matching: Aligns actions with the last distinct screenshot before the action to avoid leaking future info.
This results in compact trajectories used for training.
Synthesizing Reflective Reasoning in OpenCUA
One common question is, “How do these models learn to think like humans?” OpenCUA augments data with reflective long Chain-of-Thought (CoT) reasoning. This isn’t just basic steps—it’s detailed inner monologues that include reflection on past actions, error detection, and planning.
The synthesis pipeline:
-
Reflector: Checks for errors by comparing before/after screenshots and generates reflections. -
Generator: Creates structured CoT based on history, screenshots, and actions. Uses visual cues like red markers for coordinates. -
Summarizer: Refines task goals for clarity.
This CoT has three levels:
-
L3: Contextual observation of key elements. -
L2: Reflective reasoning on transitions, history, and alternatives. -
L1: Executable action.
Training on this boosts generalization, as models learn to recover from mistakes.
OpenCUA Models: From Training to Execution
Now, let’s talk models. OpenCUA includes several, like OpenCUA-7B and OpenCUA-32B, based on Qwen architectures but modified for better alignment (e.g., replacing M-RoPE with 1D RoPE and using Kimi-VL tokenizer).
Quick Start: Installing and Running Models
Wondering how to get started with OpenCUA models? Here’s a step-by-step guide.
Installation
-
Create a Conda environment: conda create -n opencua python=3.10
-
Activate: conda activate opencua
-
Install requirements: pip install -r requirement.txt
Download Model Weights
Use Hugging Face:
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="xlangai/OpenCUA-7B",
local_dir="OpenCUA-7B",
local_dir_use_symlinks=False
)
Note for Qwen-based models: Use custom tokenizer and chat template; don’t rely on default Transformers or vLLM classes. vLLM support is in progress—stay tuned.
GUI Grounding Examples
To test grounding (identifying UI elements):
-
Navigate to ./model/inference/
-
Run: python huggingface_inference.py
This runs five examples.
Running as a Computer-Use Agent
OpenCUA agents work in environments like OSWorld. The agent perceives via screenshots, generates CoT, and predicts actions.
Command for OSWorld:
python run_multienv_opencua.py \
--headless \
--observation_type screenshot \
--model OpenCUA-32B \
--result_dir ./results --test_all_meta_path evaluation_examples/test_all_no_gdrive.json \
--max_steps 100 \
--num_envs 30 \
--coordinate_type qwen25
It uses 3 images and L2 CoT by default. Currently, only Hugging Face inference is supported.
Performance Insights: How Good Are These Models?
If performance is your main concern, OpenCUA shines on benchmarks.
Online Agent Evaluation on OSWorld-Verified
OpenCUA-32B hits 34.8% success rate at 100 steps, topping open-source models.
Model | 15 Steps | 50 Steps | 100 Steps |
---|---|---|---|
Proprietary | |||
OpenAI CUA | 26.0 | 31.3 | 31.4 |
Seed 1.5-VL | 27.9 | — | 34.1 |
Claude 3.7 Sonnet | 27.1 | 35.8 | 35.9 |
Claude 4 Sonnet | 31.2 | 43.9 | 41.5 |
Open-Source | |||
Qwen 2.5-VL-32B-Instruct | 3.0 | — | 3.9 |
Qwen 2.5-VL-72B-Instruct | 4.4 | — | 5.0 |
Kimi-VL-A3B | 9.7 | — | 10.3 |
UI-TARS-72B-DPO | 24.0 | 25.8 | 27.1 |
UI-TARS-1.5-7B | 24.5 | 27.3 | 27.4 |
OpenCUA-7B (Ours) | 24.3 | 27.9 | 26.6 |
OpenCUA-32B (Ours) | 29.7 | 34.1 | 34.8 |
Scores are averages from 3 runs. It closes the gap to proprietary models like Claude.
GUI Grounding Performance
On tasks like OSWorld-G and ScreenSpot:
Model | OSWorld-G | ScreenSpot-V2 | ScreenSpot-Pro |
---|---|---|---|
Qwen2.5-VL-7B | 31.4 | 88.8 | 27.6 |
Qwen2.5-VL-32B | 46.5 | 87.0 | 39.4 |
UI-TARS-72B | 57.1 | 90.3 | 38.1 |
OpenCUA-A3B | 48.6 | 91.4 | 28.5 |
OpenCUA-7B | 45.7 | 88.5 | 23.7 |
OpenCUA-2.5-7B | 55.3 | 92.3 | 50.0 |
OpenCUA-2.5-32B | 59.6 | 93.4 | 55.3 |
Offline Evaluation with AgentNetBench
AgentNetBench compares predicted actions to ground-truth, focusing on coordinates, content, and functions.
Model | Coordinate Actions | Content Actions | Function Actions | Average |
---|---|---|---|---|
Qwen2.5-VL-7B | 50.7 | 40.8 | 3.1 | 48.0 |
Qwen2.5-VL-32B | 66.6 | 47.2 | 41.5 | 64.8 |
Qwen2.5-VL-72B | 67.2 | 52.6 | 50.5 | 67.0 |
OpenAI CUA | 71.7 | 57.3 | 80.0 | 73.1 |
OpenCUA-2.5-7B | 79.0 | 62.0 | 44.3 | 75.2 |
OpenCUA-2.5-32B | 81.9 | 66.1 | 55.7 | 79.1 |
This offline setup speeds up testing by using gold-standard actions.
AgentNetBench: Evaluating Agents Offline
“What if I don’t want to set up a full environment for testing?” AgentNetBench is your answer. It’s an offline evaluator that matches model actions (like click, moveTo, write) against human trajectories. See the README in ./evaluation/agentnetbench/ for details.
The action space includes:
-
Click: click(x, y, button) -
Mouse Move: moveTo(x, y) -
Type: write(text) -
Press: press(key) -
Terminate: terminate(‘success’ or ‘failure’)
And more, as listed in the framework.
Training and Modeling Details
Curious about the training process? OpenCUA uses supervised fine-tuning on state-action pairs augmented with CoT. Key insights:
-
Scaling works better with reflective CoT than plain actions. -
Multi-image history improves context. -
Data mixtures (reasoning + general text) enhance performance.
The framework models tasks as state-action trajectories: Given instruction I, predict actions from screenshots until termination.
Upcoming Features and Acknowledgments
The team is working on vLLM support and open-source training code (currently based on Kimi infrastructure).
Acknowledgments go to collaborators like Yu Su, Caiming Xiong, and Moonshot AI for data and infrastructure. The tool builds on DuckTrack and OpenAdapt.
For research use only—prohibited for illegal activities. Cite the paper if using:
@misc{wang2025opencuaopenfoundationscomputeruse,
title={OpenCUA: Open Foundations for Computer-Use Agents},
author={Xinyuan Wang and Bowen Wang and Dunjie Lu and Junlin Yang and Tianbao Xie and Junli Wang and Jiaqi Deng and Xiaole Guo and Yiheng Xu and Chen Henry Wu and Zhennan Shen and Zhuokai Li and Ryan Li and Xiaochuan Li and Junda Chen and Boyuan Zheng and Peihang Li and Fangyu Lei and Ruisheng Cao and Yeqiao Fu and Dongchan Shin and Martin Shin and Jiarui Hu and Yuxiao Ye and Danyang Zhang and Dikang Du and Hao Hu and Huarong Chen and Zaida Zhou and Haotian Yao and Ziwei Chen and Qizheng Gu and Yipu Wang and Heng Wang and Diyi Yang and Victor Zhong and Flood Sung and Y. Charles and Zhilin Yang and Tao Yu},
year={2025},
eprint={2508.09123},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.09123},
}
FAQ: Answering Your Questions on OpenCUA
Here are some common queries based on what users might ask.
What is a computer-use agent in OpenCUA?
A CUA is an AI that automates desktop tasks using vision-language models. It observes screenshots and predicts actions like clicks or typing.
How does OpenCUA differ from proprietary CUAs?
It’s fully open-source, allowing study of data scaling, risks, and capabilities. Proprietary ones like Claude keep details closed.
Can I use OpenCUA models for my own projects?
Yes, for research/education. Download from Hugging Face: https://huggingface.co/collections/xlangai/opencua-open-foundations-for-computer-use-agents-6882014ebecdbbe46074a68d
Why add reflective CoT?
It teaches planning, error recovery, and reflection, leading to better scaling (e.g., 34.8% on OSWorld vs. 4.4% without).
Is AgentNetTool free to use?
Yes, available at https://agentnet-tool.xlang.ai/ with docs.
How to evaluate my own model with AgentNetBench?
Follow the README: It reports metrics on action matches.
What OSes does it support?
Windows, macOS, Ubuntu—data collected across all.
Are there privacy concerns with data collection?
Yes, but multi-layer protection is used, and consent is required.
How scalable is OpenCUA?
Performance improves with data size and model size, as shown in scaling figures.
Can I contribute to OpenCUA?
Check the GitHub: https://github.com/xlang-ai/OpenCUA
Wrapping up, OpenCUA opens doors to building better CUAs. Whether you’re experimenting with datasets or running models, it provides solid foundations. If you try it, share your experiences—it’s all about advancing the field together.
(Word count: 3,456)