Site icon Efficient Coder

Mobile-Agent-v3 & GUI-Owl: Revolutionizing Mobile Automation with 95.7% Accuracy

From First Tap to Cross-App Flow: A Practical Guide to Mobile-Agent-v3 and GUI-Owl for Global Developers

Author: A Mobile-Automation Engineer Who Still Gets Excited by Green CI Pipelines
Last Updated: 21 Aug 2025


What You’ll Get from This Post

  • A plain-language explanation of GUI-Owl and Mobile-Agent-v3—no PhD required
  • Exact installation commands copied from the official repo (they really do work)
  • Side-by-side performance numbers you can quote to your manager today
  • A step-by-step mini-project you can finish during your next coffee break

1. In One Sentence—What Are These Things?

Name One-Sentence Explanation Everyday Analogy
GUI-Owl A 7 B–32 B multimodal vision-language model that looks at any screen and turns words into taps, swipes, or keystrokes. The intern who never sleeps and always clicks the right button.
Mobile-Agent-v3 A multi-agent framework that uses GUI-Owl to break big tasks into small steps, keeps track of progress, and retries when pop-ups appear. The project manager who writes the task list and updates the Kanban board for you.

2. Why Should You Care Right Now?

Below are the official benchmark scores released by the authors. If you have ever built UI-automation scripts, you know how hard it is to break the 50 % ceiling on long-horizon tasks.

Benchmark Mobile-Agent-v3 Score Notes from the Authors
AndroidWorld 73.3 % Long multi-screen Android tasks
OSWorld 37.7 % Cross-application desktop tasks
ScreenSpot-V2 95.7 % Pure UI element grounding
ScreenSpot-Pro 90.4 % High-resolution, dense-control scenarios
MMBench-GUI L1 89.1 % Everyday app controls
MMBench-GUI L2 86.9 % Nested or custom widgets

If your current pipeline (OCR + XPath + brittle sleep statements) sits at ~50 %, the jump to 70 %–90 % is not incremental—it is transformative.


3. How the Pieces Fit Together

The official diagram is busy; here is a distilled view.

User prompt (plain English)
        │
        ▼
Planning Agent (Mobile-Agent-v3)
        │
        ├──> Step list (JSON)
        │
        ▼
GUI-Owl (7 B/32 B)
        │
        ├──> Screenshots + XML
        │
        ▼
Action dispatcher
        │
        ├──> adb tap 120 350
        └──> adb input text "Jinan"
  • Perception happens inside GUI-Owl.
  • Planning & memory live inside Mobile-Agent-v3 agents.
  • Execution uses whatever backend you give it (ADB, UIAutomation, Selenium, etc.).

4. Key Capabilities—Explained Like You’re Five

4.1 GUI-Owl Only

Capability What It Means in Practice
End-to-end One model eats the screenshot and spits out an action.
Cross-platform Same weights work on Android, iOS, Windows, and macOS.
Explainable Outputs intermediate reasoning (“I see a ‘Send’ button at x,y”).
Small footprint 7 B model fits on a single RTX 3090 (14 GB VRAM).

4.2 Mobile-Agent-v3 Add-Ons

Feature v2 Had It? v3 Improvement Example
Task-progress memory No Remembers “I already filled the address, next is payment.”
Pop-up handling Basic Detects a full-screen ad and taps the close button itself.
Key-info logging No Saves order numbers, prices, etc. for later steps.

5. Quick Start—Five Commands and a Video

Everything below is quoted verbatim from the repo README; only formatting and comments were added.

5.1 Clone and Install

# 1. Clone the repo
git clone https://github.com/X-PLUG/MobileAgent.git
cd MobileAgent/Mobile-Agent-v3

# 2. Create a clean environment
conda create -n mobile-v3 python=3.10
conda activate mobile-v3

# 3. Install Python dependencies
pip install -r requirements.txt

5.2 Pick a Model

Size Hugging Face URL VRAM (FP16)
7 B https://huggingface.co/mPLUG/GUI-Owl-7B ~14 GB
32 B https://huggingface.co/mPLUG/GUI-Owl-32B ~64 GB
# Example: 7 B model
huggingface-cli download mPLUG/GUI-Owl-7B --local-dir ./models/gui-owl-7b

5.3 Run the Demo

python demo.py \
  --model_path ./models/gui-owl-7b \
  --task "Search for travel guides to Jinan on Xiaohongshu, sort by most collected, and bookmark the first note." \
  --device android \
  --serial 127.0.0.1:5555

When the script starts, it opens a local HTML file (tree_of_thought.html) in your browser so you can watch the agent reason in real time.


6. Common Questions from the Community (with Straight Answers)

Question Short Answer
Does it work on iOS only? No, the weights are cross-platform; swap the backend to UIAutomation or WebDriver for iOS.
Do I need root access? No. ADB over USB or Wi-Fi is enough.
How is it different from Appium or Airtest? Appium needs element IDs or XPath; GUI-Owl only needs a screenshot and a sentence.
Speed? ~2.3 s per step on an RTX 4090; a 50-step task finishes in under two minutes.
Fully offline? Yes, once the model is downloaded.
Commercial license? Code is Apache-2.0; check the model weights’ license for commercial use.
Chinese text garbled? GUI-Owl was trained with large Chinese corpora; verified on WeChat and Alipay.
Privacy concerns? Training data is from public datasets and synthetic renders—no real user screens.

7. Real-World Usage Map

Scenario Example Task Difficulty Recommended Model
Personal automation Move 100 invoice screenshots into Excel automatically 7 B
Small-team QA Daily regression of login-checkout-pay flow ★★ 7 B
Accessibility Voice-controlled money transfer ★★ 7 B
Cross-platform testing Same script on Android, iOS, and Web ★★★ 32 B
Complex business flow End-to-end expense reimbursement across 5 apps ★★★ 32 B

8. Deep-Dive Resources (Official Links Only)

  • Technical Report (PDF, 23 pages):
    https://github.com/X-PLUG/MobileAgent/blob/main/Mobile-Agent-v3/assets/MobileAgentV3_Tech.pdf

  • GitHub Repository (with Cookbook):
    https://github.com/X-PLUG/MobileAgent

  • Model Weights:

    • GUI-Owl-7B: https://huggingface.co/mPLUG/GUI-Owl-7B
    • GUI-Owl-32B: https://huggingface.co/mPLUG/GUI-Owl-32B
  • Citation BibTeX (copy-paste ready):

@article{ye2025mobileagentv3,
  title={Mobile-Agent-v3: Foundamental Agents for GUI Automation},
  author={Wang, Junyang and Xu, Haiyang and Jia, Haitao and Zhang, Xi and Yan, Ming and Shen, Weizhou and Zhang, Ji and Huang, Fei and Sang, Jitao},
  journal={arXiv preprint},
  year={2025}
}

9. Your Next Three Moves

  1. Today: Clone the repo, run the Xiaohongshu demo, and share the screen recording in your team chat.
  2. This week: Replace your most boring daily task (e.g., filling the daily report) with a 7 B script and reclaim 10 minutes every day.
  3. This month: Extend the script to iOS or Windows and experience the “write once, run anywhere” moment.

If you hit a snag, open an issue on GitHub—the maintainers respond quickly. Happy automating!

Exit mobile version