Core question of this article:
What is GELab-Zero, what problems does it solve in real mobile environments, and why does its design matter for the future of GUI-based mobile agents?
This article is a full English rewrite of the selected portions of the original Chinese content. It covers the Background, Capabilities, Application Examples, AndroidDaily Benchmark, and Open Benchmark Results.
All content is strictly derived from the provided source file, translated and adapted for a global technical audience.
No external facts are added.
Table of Contents
-
☾ -
☾ -
☾ -
☾ -
☾ -
☾ -
☾ -
☾ -
☾
Introduction
This article focuses on one essential question:
How can a fully local, open-source mobile GUI Agent be designed so that it works reliably across fragmented mobile ecosystems?
GELab-Zero addresses this challenge by offering both:
-
A lightweight 4B model capable of running locally on consumer hardware -
A complete engineering infrastructure that manages device control, environment setup, task orchestration, and trajectory recording
Unlike many research-only agent projects, GELab-Zero is designed to run on real mobile devices with fully local inference, granting users complete control over data, privacy, and the entire execution chain.
This article presents a clear, structured explanation of the system’s purpose, capabilities, and benchmark performance—based solely on the original source text.
Why Mobile GUI Agents Matter
Core question:
Why is a dedicated system like GELab-Zero necessary for GUI-based mobile automation?
As AI systems expand into consumer environments, mobile agents are at a turning point. They are shifting from proof-of-concept prototypes toward tools that must handle real tasks across diverse devices.
The mobile ecosystem is highly fragmented:
-
☾ Different manufacturers -
☾ Different OS customizations -
☾ Different UI structures -
☾ Highly inconsistent app behaviors
A practical GUI agent must deal with:
-
☾ ADB connections -
☾ Permissions and developer options -
☾ Environment dependencies -
☾ Model inference services -
☾ Multi-device scheduling -
☾ Recording and replaying task trajectories
For researchers, this creates a heavy engineering burden that distracts from strategy innovation and model design.
For enterprise adopters, the variability of mobile environments makes it difficult to build stable automation pipelines.
GELab-Zero was created specifically to remove these barriers.
What GELab-Zero Provides
Core question:
What exactly does GELab-Zero include, and how does it help developers and organizations?
According to the source document, GELab-Zero consists of:
1. A complete plug-and-play inference infrastructure
This infrastructure handles all engineering complexity required to run a GUI agent on mobile devices, including:
-
☾ Device management -
☾ Dependency installation -
☾ Permission configuration -
☾ A unified deployment pipeline -
☾ Automatic environment setup -
☾ Task replay and trajectory recording
2. A locally deployable 4B GUI Agent model
The model is designed for real mobile interaction, offering:
-
☾ Lightweight inference on consumer hardware -
☾ Low latency -
☾ Fully local privacy-preserving execution
Summary of Key Capabilities
GELab-Zero supports several major features:
• Local lightweight inference
Runs the 4B model directly on local hardware—saving latency and ensuring privacy.
• One-click task launch
The system manages all required setup steps without manual handling of dependencies.
• Multi-device task distribution
One task can be dispatched to multiple phones.
Each execution generates an interaction trajectory for later analysis or replay.
• Multiple agent modes
Including:
-
☾ ReAct mode -
☾ Multi-Agent mode -
☾ Scheduled tasks
These modes allow the system to match different real-world workflows.
Application Demonstrations
Core question:
What can the agent actually do on a real phone?
The document provides several application examples across different categories. These demonstrations reflect real interactions with actual mobile apps.
Below is a structured summary of the tasks shown.
1. Recommendation Tasks
Sci-Fi Movie Recommendations
Task: Find some good recent sci-fi movies.
The agent navigates a content app, performs a search, scans options, and produces suggestions.
Family-Friendly Travel Destinations
Task: Find a place suitable for a weekend trip with children.
The agent identifies relevant categories and filters results suitable for families.
2. Practical Everyday Tasks
Claiming a Meal Subsidy
Task: Claim meal vouchers in a corporate benefits platform.
This involves form filling, navigation across menus, and confirming submission.
Metro Line Status and Navigation
Task:
-
Check whether Metro Line 1 is running normally -
Navigate to the nearest entrance for that line
This example demonstrates multi-step reasoning and context-dependent navigation.
3. Complex Tasks
These tasks demonstrate the ability to interpret multi-item, multi-condition instructions.
Bulk Shopping on Ele.me
Task:
Purchase a long list of groceries from the nearest Hema store, including:
-
☾ Fruits -
☾ Vegetables -
☾ Prepared foods -
☾ Snacks -
☾ Drinks
This requires multi-step search, item verification, and cart management.
Knowledge Search on Zhihu
Task:
Search for “How to learn financial management,” then view the first answer with over 10,000 likes.
The system must interpret quantitative constraints and UI elements to identify the target content.
Filter-Based Search on Taobao
Task:
Find size-37 white canvas shoes priced under 100 RMB and favorite the first matching item.
This involves navigation, filtering, item verification, and performing a favorite action.
Completing Vocabulary Practice on Baicizhan
Task:
Perform learning tasks in a vocabulary app.
The agent interacts with question-answer screens and navigates between exercises.
AndroidDaily: A Real-Life Benchmark for Mobile Agents
Core question:
How can we objectively evaluate GUI agents in real-world mobile scenarios?
Most existing benchmarks focus on productivity apps such as email or office tools.
However, daily mobile usage revolves around:
-
☾ Food delivery -
☾ Ride-hailing -
☾ Messaging -
☾ Shopping -
☾ Local services -
☾ Payments
To reflect this reality, AndroidDaily was built as a multi-dimensional, real-life benchmark covering six core aspects of modern living:
-
☾ Eating -
☾ Transportation -
☾ Shopping -
☾ Housing -
☾ Information -
☾ Entertainment
The benchmark uses popular apps representative of each category, ensuring that tasks have real consequences such as purchases, reservations, and navigation.
Static Testing
Core question:
What does the static portion of AndroidDaily measure?
Static testing contains 3,146 actions, each with:
-
☾ Task description -
☾ Step-by-step screenshots -
☾ Expected action type -
☾ Expected action values
The agent must predict each action without running a real app.
Action Types Distribution
| Action Type | Count | Description |
|---|---|---|
| CLICK | 1354 | Click an element |
| COMPLETE | 410 | Task completion |
| AWAKE | 528 | Wake an app |
| TYPE | 371 | Enter text |
| INFO | 305 | Query information |
| WAIT | 85 | Wait |
| SLIDE | 93 | Slide gesture |
This structure supports fast, large-scale testing without full device deployments.
Static Testing Accuracy (from source file)
| Model | Accuracy |
|---|---|
| GPT-4o | 0.196 |
| Gemini-2.5-pro-thinking | 0.366 |
| UI-TARS-1.5 | 0.470 |
| GELab-Zero-4B-preview | 0.734 |
The model achieves the highest accuracy among the listed options.
End-to-End Benchmark
Core question:
How does the agent perform in full real-device tasks?
End-to-end testing includes 235 tasks executed on physical devices or simulators.
The agent must complete each task autonomously.
Scenario Distribution
| Scenario | Tasks | Percentage |
|---|---|---|
| Transportation | 78 | 33.19% |
| Shopping | 61 | 25.96% |
| Social Communication | 43 | 18.3% |
| Content Consumption | 37 | 15.74% |
| Local Services | 16 | 6.81% |
Tasks cover actions such as:
-
☾ Ride-hailing -
☾ Shopping and payment -
☾ Messaging -
☾ Bookmarking content -
☾ Ordering food
The document highlights that GELab-Zero-4B-preview achieved a 75.86% success rate on AndroidWorld, demonstrating strong real-world capability.
Open Benchmark Performance
Core question:
How does the model compare with other open-source GUI-focused models?
The source document includes a comparison chart showing that:
-
☾ GELab-Zero-4B-preview performs strongly across several open benchmarks -
☾ Results are especially competitive in AndroidWorld, which reflects realistic mobile usage -
☾ Its combination of model tuning and engineering infrastructure leads to stable behavior in real tasks
This reinforces its positioning as a practical model for GUI automation rather than a purely experimental system.
Reflections and Practical Insights
What this system teaches us about building practical mobile agents:
1. Engineering matters as much as the model
Even a strong model cannot operate a phone without stable device management, consistent input/output flow, and reliable task orchestration.
2. Real-world tasks require real-world benchmarks
AndroidDaily focuses on everyday apps—food delivery, messaging, payments—rather than artificial interactions.
3. Local execution is increasingly important
Running on local hardware:
-
☾ Reduces latency -
☾ Improves privacy -
☾ Gives researchers full control of the inference chain
4. Multi-device support is essential for scaling
Researchers and companies need to test agents across multiple phones and configurations.
5. ReAct and Multi-Agent modes offer flexibility
The system is prepared for different workflows and task complexities.
These reflections come directly from analyzing the behavior of GELab-Zero as described in the source document.
Conclusion
GELab-Zero presents a clear direction for how mobile GUI agents can be built for real use:
-
☾ It comes with a locally runnable 4B model -
☾ Includes complete engineering infrastructure -
☾ Handles device management, task orchestration, and trajectory replay -
☾ Offers both static and end-to-end evaluation datasets -
☾ Demonstrates strong performance across real-world benchmarks
By focusing on practical tasks and delivering a plug-and-play experience, GELab-Zero lowers the barrier for researchers, developers, and organizations who want to explore or deploy mobile AI agents without depending on cloud services or vendor-specific toolkits.
FAQ
1. What types of tasks can GELab-Zero handle?
It can perform recommendations, complete daily service tasks, navigate apps, filter items, execute bulk shopping, and accomplish multi-step interactions demonstrated in the examples.
2. Does the model run locally?
Yes. The 4B model is designed for local deployment on consumer hardware.
3. Why is AndroidDaily important?
It focuses on real-life applications—shopping, ride-hailing, social interactions—making evaluations more representative of everyday mobile use.
4. What makes the engineering infrastructure significant?
It simplifies ADB management, environment setup, task distribution, and trajectory visualization, reducing the overhead for developers.
5. How does the model perform in benchmarks?
According to the source file, it achieves strong results, particularly in AndroidWorld and in static prediction accuracy.
6. Is multi-device execution supported?
Yes. The system can dispatch tasks to multiple phones and record each trajectory.
7. What interaction modes are available?
ReAct mode, Multi-Agent mode, and scheduled task mode are supported.
