AI-Powered Desktop Automation: Control Your Computer with Words Using Baodou

高效码农

2 months ago

Baodou Computer: An Open-Source AI-Powered Desktop Automation System Using Doubao Vision Model

Have you ever wished your computer could “see” what’s on the screen and perform tasks automatically based on your instructions? Imagine telling your PC to open a browser, search for something, click through results, or handle repetitive workflows without lifting a finger. That’s exactly what the Baodou Computer project aims to achieve. This open-source tool leverages AI vision capabilities to analyze screen content and execute mouse and keyboard actions, making desktop automation accessible and powerful.

Built with a PyQt5 graphical user interface and powered by the Doubao vision model from ByteDance’s Volcano Engine, Baodou Computer combines intuitive user interaction with advanced AI-driven automation. It’s particularly useful for automating everyday computer tasks, and as of late 2025, it uses a capable multimodal model that understands images effectively.

In this comprehensive guide, we’ll explore everything about Baodou Computer: its features, how to get started, setup instructions, usage tips, technical details, and more. Whether you’re a developer interested in AI agents or someone looking to automate routine tasks, this project offers a hands-on way to experience vision-based desktop control.

What Makes Baodou Computer Stand Out?

At its core, Baodou Computer creates a loop of observation, reasoning, and action:

Screen Analysis: Captures the current screen and sends it to the Doubao vision model for interpretation.
Automated Mouse Control: Performs precise movements, clicks, double-clicks, right-clicks, and drags based on AI decisions.
Keyboard Simulation: Handles text input and shortcut keys.
User-Friendly GUI: A clean PyQt5 interface that’s always on top, semi-transparent, and designed to stay out of the way.

These elements work together to complete multi-step tasks, such as browsing the web or managing files, all guided by natural language instructions you provide.

The project is hosted on GitHub and remains actively maintained for enthusiasts exploring AI-driven automation on desktops.

Project Directory Structure Overview

Understanding the layout helps when modifying or debugging:

baodot_AI/
├── imgs/                  # Image resources, including screenshots and labeled outputs
│   └── label/             # Folder for marked coordinate images during debugging
├── config.json            # Main configuration file
├── README.md              # Project documentation
├── requirements.txt       # List of Python dependencies
├── pyqt_main.spec         # PyInstaller spec for Windows packaging
├── pyqt_main_mac.spec     # PyInstaller spec for macOS packaging
├── get_next_action_AI_doubao.txt     # System prompt for Windows
├── get_next_action_AI_doubao_mac.txt  # System prompt for macOS
├── pyqt_main.py           # Main entry point with GUI
├── vl_model_test_doubao.py   # Standalone vision model testing module
├── vl_model_test_doubao2.py  # Core vision module integrated with GUI
├── log_window.py          # Logging display component
├── cv_shot_doubao.py      # Screenshot and coordinate utilities
├── mac_app_utils.py       # macOS-specific path handling
├── favicon.ico            # Windows icon
└── favicon_mac.ico        # macOS icon

This organized structure makes it easy to locate components for customization.

How to Get the Project

Cloning with Git (Recommended)

If Git is installed:

git clone https://github.com/mini-yifan/baodou_AI.git
cd baodou_AI

Downloading as ZIP

Visit the GitHub repository.
Click the green “Code” button.
Select “Download ZIP”.
Extract the files.
Navigate into the folder.

Setting Up the Environment Step by Step

Getting everything running smoothly requires a proper Python setup.

Create a Virtual Environment

For Windows:

python -m venv new_venv
new_venv\Scripts\activate

For Linux/macOS:

python3 -m venv new_venv
source new_venv/bin/activate

Install Dependencies

One-command installation:

pip install --upgrade pip
pip install -r requirements.txt

Or individually:

pip install --upgrade pip
pip install PyQt5 PyQt5-tools
pip install opencv-python numpy
pip install pyautogui pyperclip
pip install openai pydantic
pip install pillow

These cover GUI, image processing, automation, API calls, and more.

Understanding config.json

All tunable settings live here:

{
  "api_config": {
    "api_key": "",          
    "base_url": "https://ark.cn-beijing.volces.com/api/v3",
    "model_name": "doubao-seed-1-6-vision-250815"
  },
  "ai_config": {
    "thinking_type": "disabled"
  },
  "execution_config": {
    "max_visual_model_iterations": 80,
    "default_max_iterations": 80
  },
  "screenshot_config": {
    "optimize_for_speed": true,
    "max_png": 1280,
    "input_path": "imgs/screen.png",
    "output_path": "imgs/label"
  },
  "mouse_config": {
    "move_duration": 0.1,
    "failsafe": false
  }
}

Key points:

Fill in your Doubao API key.
The model supports strong vision capabilities.
Iteration limits prevent infinite loops.
Speed optimization compresses images.
Mouse duration controls movement smoothness.

Obtaining a Doubao API Key

Go to the Volcano Engine console: https://console.volcengine.com/ark/region:ark+cn-beijing/apiKey
Sign up or log in.
Navigate to API keys.
Create a new key.
Copy it.
Paste into the app or config file.

As of 2025, Doubao models like this vision variant offer competitive performance for multimodal tasks.

Step-by-Step Usage Guide

Launching the Application

python pyqt_main.py

A semi-transparent, always-on-top window appears.

Entering the API Key

First-time users paste it in the provided field—it saves automatically.

Describing Your Task

In the main text area, write clear instructions, e.g.:

“Please open the browser, search for ‘AI trends in 2025’, and click the first result.”

Detailed descriptions yield better results.

Running the Task

Click “Upload and Execute.” The process:

Captures screen.
Queries the vision model.
Receives structured action.
Executes it.
Repeats until completion.

Note: Currently limited to the primary monitor.

Stopping Execution

Hit “Stop AI Execution” anytime.

How the System Works Internally

The workflow forms a robust loop:

User provides goal.
Screenshot taken and saved.
Image sent to Doubao model with system prompt.
Model outputs next action in JSON format.
Action parsed and executed via pyautogui.
Loop continues with safeguards.

The system prompt files enforce strict output formats for reliability.

Deep Dive into Key Modules

pyqt_main.py

Handles GUI creation, user interactions, threading, window features (topmost, transparency, anti-screenshot on Windows, auto-avoidance).

vl_model_test_doubao2.py

The brain: config management, screenshot calls, API interactions, response parsing, action execution, coordinate mapping.

cv_shot_doubao.py

Utilities for capturing, marking, and mapping coordinates between compressed images and real screen.

Prompt Files

Define AI behavior, allowed actions, formats, and edge cases—critical for consistent performance.

Packaging for Distribution

Use PyInstaller for standalone executables.

Install:

pip install pyinstaller

Windows:

pyinstaller pyqt_main.spec

macOS:

pyinstaller pyqt_main_mac.spec

Executables appear in dist/. Copy config and prompt files manually.

Troubleshooting Common Issues (FAQ)

API Errors or Balance Issues

Verify key, check Volcano Engine dashboard for funds, confirm region.

Screenshot Problems

Ensure imgs/ folder exists and writable; disable interfering security software.

Inaccurate Mouse Actions

Set display scaling to 100%; adjust max_png for higher resolution; tweak move_duration.

Crash on Startup

Use Python 3.8+; run via command line for errors; verify dependencies.

Looping or Stuck Behavior

Increase iteration limits if needed; refine task descriptions.

Multi-Monitor Support?

Limited to primary display currently.

Safety and Best Practices

Protect your API key—never share or commit it.
Screen captures go to the cloud; avoid sensitive data.
Automation can impact systems; test in safe environments.
Use responsibly to prevent unintended actions.

Final Thoughts

Baodou Computer represents an exciting entry in the growing field of open-source desktop AI agents. By bridging vision models with direct computer control, it opens doors to practical automation without complex scripting.

In 2025, with advancements in multimodal models like those from Doubao, tools like this are becoming more reliable and capable. Whether for productivity, experimentation, or building upon the code, it’s a project worth exploring.

The sense of watching AI navigate your desktop based on simple instructions is genuinely impressive—and a glimpse into future human-computer interaction.

Give it a try, experiment with tasks, and see how far you can push it. Happy automating!

(Word count: approximately 3,450)