MobiAgent Framework: Transforming Mobile Automation with Cutting-Edge AI

高效码农

2 months ago

MobiAgent: The Most Practical and Powerful Open-Source Mobile Agent Framework in 2025

As of November 2025, the mobile intelligent agent race has quietly entered a new stage. While most projects are still showing flashy demos on carefully selected screenshots, a research team from Shanghai Jiao Tong University’s IPADS laboratory has open-sourced a complete, production-ready mobile agent system that actually works on real phones — MobiAgent.

This is not another proof-of-concept. It is a full-stack solution that includes specialized foundation models, an acceleration framework that makes the agent faster the more you use it, a brand-new real-world evaluation benchmark, and even a ready-to-install Android app. Most importantly, in blind tests on real Chinese mainstream apps, its task success rate significantly surpasses GPT-5, Gemini 2.5 Pro, and the previous SOTA GUI agent UI-TARS.

This article explains everything you need to know about MobiAgent in plain English: what it is, why it works so well, how the whole system is designed, the latest features released in November 2025, and most importantly — how you can run it on your own phone in minutes.

What Exactly Is MobiAgent?

MobiAgent consists of four tightly integrated core components:

Component	Purpose	Size	Where to Find It
MobiMind	A family of vision-language models fine-tuned specifically for mobile GUI tasks	3B – 7B	https://huggingface.co/IPADS-SAI
AgentRR	Record & Replay acceleration framework — makes repeated tasks 2–3× faster	—	agent_rr/ directory
MobiFlow	A milestone-DAG-based evaluation benchmark that works on real devices	Covers 20+ popular apps	MobiFlow/ directory
Runner + App	One-click execution engine + official Android app	—	runner/ and app/ directories

The design philosophy is simple but extremely effective: split the three hardest parts of mobile interaction — planning, decision-making, and precise grounding — into three dedicated models, cache past experience with AgentRR, and evaluate everything on real phones with MobiFlow.

Why Most Mobile Agents Still Fail in the Real World

If you have played with any of the current mobile agent demos, you have probably noticed the same problems:

Low success rate, especially on Chinese apps
One wrong step → the rest of the task collapses
Every single action is reasoned from scratch → painfully slow
Benchmarks are done on static screenshots or emulators, not real phones

MobiAgent was built from the ground up to solve these exact pain points.

MobiMind: A Three-Role Specialized Model Family

Role	Model Name	Parameters	Responsibility
Planner	Qwen3-4B-Instruct	4B	Generates high-level task plans
Decider	MobiMind-Decider-7B	7B	Looks at the current screen and decides the next action
Grounder	MobiMind-Grounder-3B or Mixed-7B	3B / 7B	Converts natural-language descriptions into exact coordinates

The biggest news from September 2025: the newly released MobiMind-Mixed-7B can serve as both Decider and Grounder at the same time, meaning the entire agent can run inference on a single 80 GB A100.

Real-World Performance (Measured on MobiFlow Benchmark)

Model Combination	Overall Success Rate	Easy Tasks	Hard Tasks
MobiMind-Decider-7B + Grounder-3B	86.4%	94.2%	78.9%
GPT-5	71.2%	88.5%	55.3%
Gemini 2.5 Pro	73.8%	91.0%	58.1%
UI-TARS-1.5-7B	68.7%	85.4%	52.6%

On complex apps such as Meituan, Taobao, and Ctrip, MobiAgent leads by 20–30 percentage points. It almost never falls into infinite loops or fails to terminate properly — something that still plagues even the most advanced closed-source models.

AgentRR: The Secret Behind “The More You Use, The Faster It Gets”

People use their phones in highly repetitive patterns. Why should an agent re-reason the same sequence every time?

AgentRR (Agent Record & Replay) solves this beautifully:

Every complete execution trace (Planner output → Decider reasoning → Grounder coordinates) is saved as a multi-layer experience tree.
A tiny latent memory model (just a few tens of MB) decides in milliseconds whether the current situation matches a previous successful path.
Real-world reuse rates observed:
- Uniform random tasks: 30% – 60% of actions can be replayed directly
- Power-law distribution (real user behavior, 80/20 rule): 60% – 85% reuse
- Replay accuracy: >99%
- End-to-end speedup: 2–3× on average

This is what genuine “learning from experience” looks like in practice.

MobiFlow: Finally a Trustworthy Mobile Agent Benchmark

Traditional benchmarks suffer from three fatal flaws:

Only one golden trajectory
Tested on screenshots or simulators
Environment noise (pop-ups, network lag, version differences) makes scores meaningless

MobiFlow fixes all of them:

Each task is represented as a Directed Acyclic Graph (DAG) of milestones — multiple correct paths are allowed
Supports AND/OR logic between nodes
Hierarchical validation: XML → regex → OCR → only fall back to LLM judgment when absolutely necessary
Supports offline trace replay to completely eliminate environmental variability

Scores from MobiFlow actually reflect how the agent will perform on your phone.

How to Run MobiAgent Yourself (November 2025 Latest Version)

Option 1: Instant Experience — Just Install the Official App (Recommended for most users)

Download link: https://github.com/IPADS-SAI/MobiAgent/releases/tag/v1.0
Install → open → speak or type your request. All models are hosted in the cloud. Zero configuration.

Option 2: Full Local Deployment (Developer Path)

# 1. Create environment
conda create -n mobiagent python=3.10
conda activate mobiagent
pip install -r requirements.txt        # full environment
# or pip install -r requirements_simple.txt if you only need the runner

# 2. Download auxiliary models
# OmniParser (icon & input box detection)
for f in icon_detect/{train_args.yaml,model.pt,model.yaml}; do
    huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights
done

# Embedding model for experience retrieval
huggingface-cli download BAAI/bge-small-zh --local-dir ./utils/experience

# 3. Phone setup (one-time)
# Install ADBKeyboard.apk on your Android phone
# Enable USB debugging → connect via USB cable

# 4. Start three model services with vLLM
vllm serve IPADS-SAI/MobiMind-Decider-7B --port 8000
vllm serve IPADS-SAI/MobiMind-Grounder-3B --port 8001
vllm serve Qwen/Qwen3-4B-Instruct --port 8002

# 5. Write your task list (runner/mobiagent/task.json)
[
  {"task": "Search for iPhone 16 Pro 256GB Desert Titanium on Taobao and add to cart"},
  {"task": "Order a Mala Xiang Guo on Meituan, extra beef, no cilantro"}
]

# 6. Launch the agent
python -m runner.mobiagent.mobiagent \
  --service_ip localhost \
  --decider_port 8000 \
  --grounder_port 8001 \
  --planner_port 8002

Watch your phone take over itself and finish tasks exactly like a human — usually in under 30 seconds for complex orders.

What’s New in November 2025?

Date	Feature	Details
2025.11.03	User Profile & Preference Memory (Mem0 + optional GraphRAG)	Remembers you love spicy food, always choose SF Express, prefer dark mode — automatically personalizes future plans
2025.11.03	Multi-Task Parallel Execution	Order food + book hotel + buy train tickets simultaneously without interference
2025.09.30	Local Experience Retrieval	Automatically pulls the most similar past experience template based on task description
2025.09.29	MobiMind-Mixed-7B Released	One model handles both Decider and Grounder roles

Frequently Asked Questions (FAQ)

Does MobiAgent support iOS?

Currently only Android (requires ADB). iOS cannot be directly controlled due to system restrictions.

Do I need an internet connection?

Fully offline if you self-host the models. The official app uses cloud inference and requires internet.

Can I plug in my own large models?

Absolutely. As long as the service follows the OpenAI-compatible API used by vLLM, just change the ports.

How big are the models? Can they run on-device?

Right now the models run on a server/PC; the phone only sends screenshots and executes actions. On-device distilled versions are in development.

How can I contribute tasks or data?

Just open an Issue or PR on GitHub. The team provides complete data collection tools in the collect/ directory.

Why does MobiAgent dramatically outperform GPT-5 on Chinese apps?

All training data comes from real human operations on real Chinese phones + VLM-reconstructed reasoning chains. Chinese mobile GUI data is extremely scarce in GPT-5/Gemini training sets.

Final Thoughts

MobiAgent is not a shiny demo that works only under perfect conditions. It is the first mobile agent framework that is truly ready for real-world use:

Models strong enough to beat closed-source giants
Acceleration that genuinely learns from repetition
A benchmark you can actually trust
End-to-end code, models, app, and data pipeline — everything open source

Whether you are building the next super-app assistant, automating accessibility features, writing large-scale UI tests, or simply want to play with the absolute cutting edge of agent technology, MobiAgent is the project you should be looking at right now.

Official repository: https://github.com/IPADS-SAI/MobiAgent
Paper: https://arxiv.org/abs/2509.00531
Model hub: https://huggingface.co/IPADS-SAI