OpenCUA: The Open-Source Revolution in Computer-Use Agent Development

高效码农

5 months ago

Exploring OpenCUA: Building Open Foundations for Computer-Use Agents

Have you ever wondered how AI agents can interact with computers just like humans do—clicking buttons, typing text, or navigating apps? That’s the world of computer-use agents (CUAs), and today, I’m diving into OpenCUA, an open-source framework designed to make this technology accessible and scalable. If you’re a developer, researcher, or just someone interested in AI’s role in everyday computing, this post will walk you through what OpenCUA offers, from its datasets and tools to model performance and how to get started. I’ll break it down step by step, answering common questions along the way.

OpenCUA stands for Open Foundations for Computer-Use Agents. It’s a complete framework that includes datasets, tools, models, and benchmarks to help build and study CUAs. These agents use vision-language models to automate tasks on desktops, spanning operating systems like Windows, macOS, and Ubuntu. The framework addresses key challenges in creating open-source CUAs, such as collecting diverse data and training models that can plan and execute actions effectively.

Let’s start with the basics: Why does OpenCUA matter? In a field where many advanced CUA systems are proprietary, OpenCUA provides transparency. It allows researchers to experiment with scaling data and models, study limitations, and explore risks. The project was released on August 13, 2025, with a paper on arXiv (abs/2508.09123) and a project page at opencua.xlang.ai.

This image shows how OpenCUA-7B’s performance scales with data, highlighting the framework’s emphasis on growth through better data and modeling.

What Components Make Up OpenCUA?

If you’re asking yourself, “What exactly is in OpenCUA?”—it’s built around four main parts:

AgentNet: A large dataset of computer-use tasks.
AgentNetTool: A tool for capturing human demonstrations.
AgentNetBench: An offline evaluator for testing models.
OpenCUA Models: Trained models that generate executable actions.

These elements work together to collect data, process it, train models, and evaluate them. For instance, AgentNet includes over 22,000 tasks across more than 200 applications and websites, making it the first dataset of its scale for desktop environments.

AgentNet: The Dataset Powering It All

You might be thinking, “How do they gather all this data?” AgentNet is collected from real human interactions on personal computers. It covers tasks on Windows, macOS, and Ubuntu, with an average of 18.6 steps per task. This isn’t just simple clicks—tasks often involve complex workflows, like using professional tools or switching between apps.

Here’s a quick comparison of AgentNet with other GUI datasets to give you context:

Dataset	Tasks	Avg. Steps	Environment Type	Personalized Env.	Human Traj.	Dom/AxTree	Video	Inner Monologue
AndroidControl	15283	5.5	Mobile	No	Yes	Yes	No	Short
AMEX	2991	11.9	Mobile	No	Yes	No	No	No
AitW	2346	8.1	Mobile	No	Yes	Yes	No	No
AitZ	1987	6.0	Mobile	No	Yes	No	No	Short
GUI Odyssey	7735	15.3	Mobile	No	Yes	No	No	No
OS-Genesis	2451	6.4	Mobile & Web	No	No	Yes	No	Short
WonderBread	598	8.4	Web	No	Yes	Yes	Yes	No
AgentTrek	10398	12.1	Web	No	No	Yes	Yes	Short
Mind2Web	2350	7.3	Web	No	Yes	Yes	No	No
GUIAct	2482	6.7	Web	No	Yes	Yes	No	No
AgentNet	22625	18.6	Desktop	Yes	Yes	Yes	Yes	Long

AgentNet stands out because it’s desktop-focused, uses personalized environments, and includes long inner monologues (reflective reasoning) for each step.

As seen in this domain distribution chart, AgentNet spans categories like productivity, browsing, and creative tools.

To download AgentNet:

Install the Hugging Face Hub: pip install -U huggingface_hub
Run: huggingface-cli download xlangai/AgentNet --repo-type dataset --local-dir ./AgentNet

For unzipping (e.g., Ubuntu images):

Merge zips: zip -s 0 images.zip --out images-full.zip
Unzip: unzip images-full.zip -d path_to_your_target_dir

The data collection involves three steps: demonstrating tasks with AgentNetTool, processing actions, and synthesizing reasoning.

How Does AgentNetTool Work?

If you’ve ever tried recording computer tasks, you know it can be clunky. AgentNetTool changes that. It’s a cross-platform GUI recorder that captures screen videos, mouse/keyboard events, and accessibility trees without interrupting your work. It runs in the background on Windows, macOS, or Ubuntu.

The tool provides an in-browser UI for reviewing and submitting demos. Annotations focus on diversity and complexity—tasks must have at least 15 steps, covering over 100 apps and 190 websites.

After recording, raw data (dense actions like thousands of mouse moves) is processed:

Action Reduction: Merges signals into PyAutoGUI actions (e.g., combining mouse moves and clicks).
State-Action Matching: Aligns actions with the last distinct screenshot before the action to avoid leaking future info.

This results in compact trajectories used for training.

Synthesizing Reflective Reasoning in OpenCUA

One common question is, “How do these models learn to think like humans?” OpenCUA augments data with reflective long Chain-of-Thought (CoT) reasoning. This isn’t just basic steps—it’s detailed inner monologues that include reflection on past actions, error detection, and planning.

The synthesis pipeline:

Reflector: Checks for errors by comparing before/after screenshots and generates reflections.
Generator: Creates structured CoT based on history, screenshots, and actions. Uses visual cues like red markers for coordinates.
Summarizer: Refines task goals for clarity.

This CoT has three levels:

L3: Contextual observation of key elements.
L2: Reflective reasoning on transitions, history, and alternatives.
L1: Executable action.

Training on this boosts generalization, as models learn to recover from mistakes.

OpenCUA Models: From Training to Execution

Now, let’s talk models. OpenCUA includes several, like OpenCUA-7B and OpenCUA-32B, based on Qwen architectures but modified for better alignment (e.g., replacing M-RoPE with 1D RoPE and using Kimi-VL tokenizer).

Quick Start: Installing and Running Models

Wondering how to get started with OpenCUA models? Here’s a step-by-step guide.

Installation

Create a Conda environment: conda create -n opencua python=3.10
Activate: conda activate opencua
Install requirements: pip install -r requirement.txt

Download Model Weights

Use Hugging Face:

from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="xlangai/OpenCUA-7B",
    local_dir="OpenCUA-7B",                
    local_dir_use_symlinks=False  
)

Note for Qwen-based models: Use custom tokenizer and chat template; don’t rely on default Transformers or vLLM classes. vLLM support is in progress—stay tuned.

GUI Grounding Examples

To test grounding (identifying UI elements):

Navigate to ./model/inference/
Run: python huggingface_inference.py

This runs five examples.

Running as a Computer-Use Agent

OpenCUA agents work in environments like OSWorld. The agent perceives via screenshots, generates CoT, and predicts actions.

Command for OSWorld:

python run_multienv_opencua.py \
    --headless \
    --observation_type screenshot \
    --model OpenCUA-32B \
    --result_dir ./results --test_all_meta_path evaluation_examples/test_all_no_gdrive.json \
    --max_steps 100 \
    --num_envs 30  \
    --coordinate_type qwen25

It uses 3 images and L2 CoT by default. Currently, only Hugging Face inference is supported.

Performance Insights: How Good Are These Models?

If performance is your main concern, OpenCUA shines on benchmarks.

Online Agent Evaluation on OSWorld-Verified

OpenCUA-32B hits 34.8% success rate at 100 steps, topping open-source models.

Model	15 Steps	50 Steps	100 Steps
Proprietary
OpenAI CUA	26.0	31.3	31.4
Seed 1.5-VL	27.9	—	34.1
Claude 3.7 Sonnet	27.1	35.8	35.9
Claude 4 Sonnet	31.2	43.9	41.5
Open-Source
Qwen 2.5-VL-32B-Instruct	3.0	—	3.9
Qwen 2.5-VL-72B-Instruct	4.4	—	5.0
Kimi-VL-A3B	9.7	—	10.3
UI-TARS-72B-DPO	24.0	25.8	27.1
UI-TARS-1.5-7B	24.5	27.3	27.4
OpenCUA-7B (Ours)	24.3	27.9	26.6
OpenCUA-32B (Ours)	29.7	34.1	34.8

Scores are averages from 3 runs. It closes the gap to proprietary models like Claude.

GUI Grounding Performance

On tasks like OSWorld-G and ScreenSpot:

Model	OSWorld-G	ScreenSpot-V2	ScreenSpot-Pro
Qwen2.5-VL-7B	31.4	88.8	27.6
Qwen2.5-VL-32B	46.5	87.0	39.4
UI-TARS-72B	57.1	90.3	38.1
OpenCUA-A3B	48.6	91.4	28.5
OpenCUA-7B	45.7	88.5	23.7
OpenCUA-2.5-7B	55.3	92.3	50.0
OpenCUA-2.5-32B	59.6	93.4	55.3

Offline Evaluation with AgentNetBench

AgentNetBench compares predicted actions to ground-truth, focusing on coordinates, content, and functions.

Model	Coordinate Actions	Content Actions	Function Actions	Average
Qwen2.5-VL-7B	50.7	40.8	3.1	48.0
Qwen2.5-VL-32B	66.6	47.2	41.5	64.8
Qwen2.5-VL-72B	67.2	52.6	50.5	67.0
OpenAI CUA	71.7	57.3	80.0	73.1
OpenCUA-2.5-7B	79.0	62.0	44.3	75.2
OpenCUA-2.5-32B	81.9	66.1	55.7	79.1

This offline setup speeds up testing by using gold-standard actions.

AgentNetBench: Evaluating Agents Offline

“What if I don’t want to set up a full environment for testing?” AgentNetBench is your answer. It’s an offline evaluator that matches model actions (like click, moveTo, write) against human trajectories. See the README in ./evaluation/agentnetbench/ for details.

The action space includes:

Click: click(x, y, button)
Mouse Move: moveTo(x, y)
Type: write(text)
Press: press(key)
Terminate: terminate(‘success’ or ‘failure’)

And more, as listed in the framework.

Training and Modeling Details

Curious about the training process? OpenCUA uses supervised fine-tuning on state-action pairs augmented with CoT. Key insights:

Scaling works better with reflective CoT than plain actions.
Multi-image history improves context.
Data mixtures (reasoning + general text) enhance performance.

The framework models tasks as state-action trajectories: Given instruction I, predict actions from screenshots until termination.

Upcoming Features and Acknowledgments

The team is working on vLLM support and open-source training code (currently based on Kimi infrastructure).

Acknowledgments go to collaborators like Yu Su, Caiming Xiong, and Moonshot AI for data and infrastructure. The tool builds on DuckTrack and OpenAdapt.

For research use only—prohibited for illegal activities. Cite the paper if using:

@misc{wang2025opencuaopenfoundationscomputeruse,
      title={OpenCUA: Open Foundations for Computer-Use Agents}, 
      author={Xinyuan Wang and Bowen Wang and Dunjie Lu and Junlin Yang and Tianbao Xie and Junli Wang and Jiaqi Deng and Xiaole Guo and Yiheng Xu and Chen Henry Wu and Zhennan Shen and Zhuokai Li and Ryan Li and Xiaochuan Li and Junda Chen and Boyuan Zheng and Peihang Li and Fangyu Lei and Ruisheng Cao and Yeqiao Fu and Dongchan Shin and Martin Shin and Jiarui Hu and Yuxiao Ye and Danyang Zhang and Dikang Du and Hao Hu and Huarong Chen and Zaida Zhou and Haotian Yao and Ziwei Chen and Qizheng Gu and Yipu Wang and Heng Wang and Diyi Yang and Victor Zhong and Flood Sung and Y. Charles and Zhilin Yang and Tao Yu},
      year={2025},
      eprint={2508.09123},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.09123}, 
}

FAQ: Answering Your Questions on OpenCUA

Here are some common queries based on what users might ask.

What is a computer-use agent in OpenCUA?

A CUA is an AI that automates desktop tasks using vision-language models. It observes screenshots and predicts actions like clicks or typing.

How does OpenCUA differ from proprietary CUAs?

It’s fully open-source, allowing study of data scaling, risks, and capabilities. Proprietary ones like Claude keep details closed.

Can I use OpenCUA models for my own projects?

Yes, for research/education. Download from Hugging Face: https://huggingface.co/collections/xlangai/opencua-open-foundations-for-computer-use-agents-6882014ebecdbbe46074a68d

Why add reflective CoT?

It teaches planning, error recovery, and reflection, leading to better scaling (e.g., 34.8% on OSWorld vs. 4.4% without).

Is AgentNetTool free to use?

Yes, available at https://agentnet-tool.xlang.ai/ with docs.

How to evaluate my own model with AgentNetBench?

Follow the README: It reports metrics on action matches.

What OSes does it support?

Windows, macOS, Ubuntu—data collected across all.

Are there privacy concerns with data collection?

Yes, but multi-layer protection is used, and consent is required.

How scalable is OpenCUA?

Performance improves with data size and model size, as shown in scaling figures.

Can I contribute to OpenCUA?

Check the GitHub: https://github.com/xlang-ai/OpenCUA

Wrapping up, OpenCUA opens doors to building better CUAs. Whether you’re experimenting with datasets or running models, it provides solid foundations. If you try it, share your experiences—it’s all about advancing the field together.

(Word count: 3,456)