From First Tap to Cross-App Flow: A Practical Guide to Mobile-Agent-v3 and GUI-Owl for Global Developers
Author: A Mobile-Automation Engineer Who Still Gets Excited by Green CI Pipelines
Last Updated: 21 Aug 2025
What You’ll Get from This Post
-
A plain-language explanation of GUI-Owl and Mobile-Agent-v3—no PhD required -
Exact installation commands copied from the official repo (they really do work) -
Side-by-side performance numbers you can quote to your manager today -
A step-by-step mini-project you can finish during your next coffee break
1. In One Sentence—What Are These Things?
Name | One-Sentence Explanation | Everyday Analogy |
---|---|---|
GUI-Owl | A 7 B–32 B multimodal vision-language model that looks at any screen and turns words into taps, swipes, or keystrokes. | The intern who never sleeps and always clicks the right button. |
Mobile-Agent-v3 | A multi-agent framework that uses GUI-Owl to break big tasks into small steps, keeps track of progress, and retries when pop-ups appear. | The project manager who writes the task list and updates the Kanban board for you. |
2. Why Should You Care Right Now?
Below are the official benchmark scores released by the authors. If you have ever built UI-automation scripts, you know how hard it is to break the 50 % ceiling on long-horizon tasks.
Benchmark | Mobile-Agent-v3 Score | Notes from the Authors |
---|---|---|
AndroidWorld | 73.3 % | Long multi-screen Android tasks |
OSWorld | 37.7 % | Cross-application desktop tasks |
ScreenSpot-V2 | 95.7 % | Pure UI element grounding |
ScreenSpot-Pro | 90.4 % | High-resolution, dense-control scenarios |
MMBench-GUI L1 | 89.1 % | Everyday app controls |
MMBench-GUI L2 | 86.9 % | Nested or custom widgets |
If your current pipeline (OCR + XPath + brittle sleep statements) sits at ~50 %, the jump to 70 %–90 % is not incremental—it is transformative.
3. How the Pieces Fit Together
The official diagram is busy; here is a distilled view.
User prompt (plain English)
│
▼
Planning Agent (Mobile-Agent-v3)
│
├──> Step list (JSON)
│
▼
GUI-Owl (7 B/32 B)
│
├──> Screenshots + XML
│
▼
Action dispatcher
│
├──> adb tap 120 350
└──> adb input text "Jinan"
-
Perception happens inside GUI-Owl. -
Planning & memory live inside Mobile-Agent-v3 agents. -
Execution uses whatever backend you give it (ADB, UIAutomation, Selenium, etc.).
4. Key Capabilities—Explained Like You’re Five
4.1 GUI-Owl Only
Capability | What It Means in Practice |
---|---|
End-to-end | One model eats the screenshot and spits out an action. |
Cross-platform | Same weights work on Android, iOS, Windows, and macOS. |
Explainable | Outputs intermediate reasoning (“I see a ‘Send’ button at x,y”). |
Small footprint | 7 B model fits on a single RTX 3090 (14 GB VRAM). |
4.2 Mobile-Agent-v3 Add-Ons
Feature | v2 Had It? | v3 Improvement Example |
---|---|---|
Task-progress memory | No | Remembers “I already filled the address, next is payment.” |
Pop-up handling | Basic | Detects a full-screen ad and taps the close button itself. |
Key-info logging | No | Saves order numbers, prices, etc. for later steps. |
5. Quick Start—Five Commands and a Video
Everything below is quoted verbatim from the repo README; only formatting and comments were added.
5.1 Clone and Install
# 1. Clone the repo
git clone https://github.com/X-PLUG/MobileAgent.git
cd MobileAgent/Mobile-Agent-v3
# 2. Create a clean environment
conda create -n mobile-v3 python=3.10
conda activate mobile-v3
# 3. Install Python dependencies
pip install -r requirements.txt
5.2 Pick a Model
Size | Hugging Face URL | VRAM (FP16) |
---|---|---|
7 B | https://huggingface.co/mPLUG/GUI-Owl-7B | ~14 GB |
32 B | https://huggingface.co/mPLUG/GUI-Owl-32B | ~64 GB |
# Example: 7 B model
huggingface-cli download mPLUG/GUI-Owl-7B --local-dir ./models/gui-owl-7b
5.3 Run the Demo
python demo.py \
--model_path ./models/gui-owl-7b \
--task "Search for travel guides to Jinan on Xiaohongshu, sort by most collected, and bookmark the first note." \
--device android \
--serial 127.0.0.1:5555
When the script starts, it opens a local HTML file (tree_of_thought.html
) in your browser so you can watch the agent reason in real time.
6. Common Questions from the Community (with Straight Answers)
Question | Short Answer |
---|---|
Does it work on iOS only? | No, the weights are cross-platform; swap the backend to UIAutomation or WebDriver for iOS. |
Do I need root access? | No. ADB over USB or Wi-Fi is enough. |
How is it different from Appium or Airtest? | Appium needs element IDs or XPath; GUI-Owl only needs a screenshot and a sentence. |
Speed? | ~2.3 s per step on an RTX 4090; a 50-step task finishes in under two minutes. |
Fully offline? | Yes, once the model is downloaded. |
Commercial license? | Code is Apache-2.0; check the model weights’ license for commercial use. |
Chinese text garbled? | GUI-Owl was trained with large Chinese corpora; verified on WeChat and Alipay. |
Privacy concerns? | Training data is from public datasets and synthetic renders—no real user screens. |
7. Real-World Usage Map
Scenario | Example Task | Difficulty | Recommended Model |
---|---|---|---|
Personal automation | Move 100 invoice screenshots into Excel automatically | ★ | 7 B |
Small-team QA | Daily regression of login-checkout-pay flow | ★★ | 7 B |
Accessibility | Voice-controlled money transfer | ★★ | 7 B |
Cross-platform testing | Same script on Android, iOS, and Web | ★★★ | 32 B |
Complex business flow | End-to-end expense reimbursement across 5 apps | ★★★ | 32 B |
8. Deep-Dive Resources (Official Links Only)
-
Technical Report (PDF, 23 pages):
https://github.com/X-PLUG/MobileAgent/blob/main/Mobile-Agent-v3/assets/MobileAgentV3_Tech.pdf -
GitHub Repository (with Cookbook):
https://github.com/X-PLUG/MobileAgent -
Model Weights:
-
GUI-Owl-7B: https://huggingface.co/mPLUG/GUI-Owl-7B -
GUI-Owl-32B: https://huggingface.co/mPLUG/GUI-Owl-32B
-
-
Citation BibTeX (copy-paste ready):
@article{ye2025mobileagentv3,
title={Mobile-Agent-v3: Foundamental Agents for GUI Automation},
author={Wang, Junyang and Xu, Haiyang and Jia, Haitao and Zhang, Xi and Yan, Ming and Shen, Weizhou and Zhang, Ji and Huang, Fei and Sang, Jitao},
journal={arXiv preprint},
year={2025}
}
9. Your Next Three Moves
-
Today: Clone the repo, run the Xiaohongshu demo, and share the screen recording in your team chat. -
This week: Replace your most boring daily task (e.g., filling the daily report) with a 7 B script and reclaim 10 minutes every day. -
This month: Extend the script to iOS or Windows and experience the “write once, run anywhere” moment.
If you hit a snag, open an issue on GitHub—the maintainers respond quickly. Happy automating!