From Clicking to Coding: How CoAct-1 Teaches Your Computer to Actually Understand You

Imagine telling your laptop, “Resize every photo on my desktop to 512 × 512 and zip them before I grab my coffee.”
Traditional automation tools would obediently open each file, click through menus, and—twenty minutes later—still be working.
CoAct-1, a new research prototype, finishes the same job in seconds by deciding when to write a quick script and when to click the interface like a human.
Below you’ll learn exactly how it works, how well it performs, and what limits still remain—no hype, just facts.


Table of Contents

  1. Quick-Start: What problem does CoAct-1 solve?
  2. The Three-Team Setup: Orchestrator, Programmer, GUI Operator
  3. Walk-Through: From One Sentence to Finished Task
  4. Real Numbers: 369 OSWorld Benchmark Tasks
  5. Known Limitations (and why they matter)
  6. FAQ: Ten Questions Beginners Ask
  7. Next Steps for Curious Users

1. Quick-Start: What Problem Does CoAct-1 Solve?

Daily Scenario Traditional GUI Agent CoAct-1’s Hybrid Approach
Resize 200 images Opens each file → clicks resize → saves → repeats Programmer writes a 5-line Python script and runs it once
Create a Thunderbird filter Navigates menus, fills forms by hand GUI Operator opens the window; Programmer pastes a ready-made filter rule
Export Excel charts to PowerPoint and save as PDF Manual copy-paste across apps Orchestrator splits the job: script handles Excel export, GUI Operator handles PowerPoint save dialog

Key takeaway
CoAct-1 chooses the fastest, most reliable method for every sub-task instead of forcing everything into mouse clicks.


2. The Three-Team Setup: Orchestrator, Programmer, GUI Operator

CoAct-1 is not a single model; it is a multi-agent system where each member keeps its own short-term memory and communicates only through concise notes and screenshots.

Member Nickname Core Skill Typical Output
Orchestrator The Planner Breaks big requests into steps, decides who should act next Task list, updated every round
Programmer The Coder Writes and runs Python or Bash scripts .py or .sh file, terminal logs
GUI Operator The Clicker Sees the screen like a human, clicks buttons, fills forms Click sequence, keyboard shortcuts

They never share long chat histories; they only exchange current status + screenshot. This keeps each agent focused and limits confusion.


3. Walk-Through: From One Sentence to Finished Task

Let’s follow the exact pipeline for the user request:

“Resize every JPG on my desktop to 512 × 512, zip them, and email the archive.”

Step 1 – Planning (Orchestrator)

Sub-task 1  Find all .jpg files on the desktop  
Sub-task 2  Resize each to 512 × 512  
Sub-task 3  Zip the results  
Sub-task 4  Open Thunderbird, compose email, attach zip

Step 2 – Assignment

  • Sub-tasks 1–3 → Programmer (script is faster)
  • Sub-task 4 → GUI Operator (email client must be clicked)

Step 3 – Execution

  1. Programmer creates resize_and_zip.py

    from PIL import Image
    import os, zipfile
    
    desktop = os.path.expanduser("~/Desktop")
    out_dir = os.path.join(desktop, "resized")
    os.makedirs(out_dir, exist_ok=True)
    
    for file in os.listdir(desktop):
        if file.lower().endswith(".jpg"):
            img = Image.open(os.path.join(desktop, file))
            img.resize((512, 512)).save(os.path.join(out_dir, file))
    
    with zipfile.ZipFile(os.path.join(desktop, "pics.zip"), "w") as z:
        for file in os.listdir(out_dir):
            z.write(os.path.join(out_dir, file), arcname=file)
    
  2. GUI Operator performs clicks

    • Start menu → type “Thunderbird” → Enter
    • New message → enter recipient → attach pics.zip → send

Step 4 – Feedback Loop

If any step fails (e.g., Thunderbird not installed), Orchestrator sees the error screenshot and revises the plan:

“GUI Operator, open the browser and log in to webmail instead.”


4. Real Numbers: 369 OSWorld Benchmark Tasks

The research team tested CoAct-1 on OSWorld, a public benchmark that simulates real desktop environments with 369 tasks covering browsers, Office suites, file managers, and email clients.

Metric CoAct-1 Previous Best (GTA-1) Improvement
Overall Success Rate 60.76 % 53.10 % +7.66 percentage points
Average Steps per Task 10.15 >15 ~30 % faster
Operating-system-level Tasks 75 % Not disclosed Clear lead
Cross-application Tasks 47.88 % Not disclosed Clear lead
Thunderbird Email Tasks 66.67 % Not disclosed Clear lead

Observations

  • Scripts excel at file operations and system-level changes.
  • GUI clicks remain essential for visual confirmation or apps without APIs.
  • Fewer steps directly translate into shorter wall-clock time.

5. Known Limitations (and Why They Matter)

The paper highlights three open challenges:

  1. High-level Abstract Instructions
    Example prompt: “Keep the cursor in the VSCode console while debugging.”
    The system must infer the exact setting (focusEditorOnBreak: false) rather than simply clicking menus.

  2. Ambiguous Scope
    Example prompt: “Hide the pycache folder in VSCode.”
    The agent might wrongly edit a global config instead of collapsing the folder view.

  3. Pure GUI Bottlenecks
    Tasks that rely solely on vision—such as solving a CAPTCHA—still hit the ceiling of the GUI Operator’s visual model.


6. FAQ: Ten Questions Beginners Ask

Q1 Is CoAct-1 open-source?
The paper does not mention a public release; all details come from the published study.

Q2 Do I need to train three separate models?
All roles are implemented with the same base language model, differentiated by prompt prefixes. Training specifics are not provided.

Q3 What operating system is required?
Experiments ran on Ubuntu with X11. Adapting to Windows or macOS requires handling different GUI toolkits and script syntax.

Q4 Could it accidentally delete my files?
Scripts run under the user’s account with normal permissions. The paper mentions dry-run checks, but no sandbox details are given.

Q5 Does it handle non-English file names?
Python and Bash support UTF-8 natively; the GUI Operator’s vision model must still recognize non-English text.

Q6 How is this different from classic RPA tools like UiPath?
Traditional RPA requires manual recording of click paths; CoAct-1 generates them from natural language and can swap to code at any step.

Q7 Will it work on my phone or tablet?
The study focused on desktop environments; mobile interfaces differ significantly and were not tested.

Q8 Can I undo a failed task?
Each agent returns a screenshot and log; Orchestrator can in theory roll back, but the paper does not detail an undo mechanism.

Q9 Can I describe loops in plain English?
Yes—statements like “process every CSV the same way” are converted into for-loops by the Programmer.

Q10 Is my data sent to the cloud?
All tests were performed on a local virtual machine. Productized versions would need clear data-residency policies.


7. Next Steps for Curious Users

If you’d like to explore similar automation today:

  1. Set up a local Ubuntu VM (or WSL on Windows).
  2. Keep reusable Python snippets for common file operations.
  3. Map frequently used GUI apps with accessibility IDs to reduce visual recognition failures.
  4. Experiment with prompt templates that mimic the Orchestrator’s task-splitting style.

When you finally watch a single line of plain English become both a concise script and a handful of exact clicks, you’ll experience the shift:
Computers stop being dumb click-machines and start acting like helpful junior developers who also know where the “Send” button is.