GELab-Zero: A Practical Overview of a Fully Local GUI Agent for Mobile Automation

高效码农

3 months ago

Core question of this article:
What is GELab-Zero, what problems does it solve in real mobile environments, and why does its design matter for the future of GUI-based mobile agents?

This article is a full English rewrite of the selected portions of the original Chinese content. It covers the Background, Capabilities, Application Examples, AndroidDaily Benchmark, and Open Benchmark Results.
All content is strictly derived from the provided source file, translated and adapted for a global technical audience.
No external facts are added.

☾
Introduction
☾
Why Mobile GUI Agents Matter
☾
What GELab-Zero Provides
☾
Application Demonstrations
☾
AndroidDaily: A Real-Life Benchmark for Mobile Agents
- ☾ Static Testing
- ☾ End-to-End Benchmark
☾
Open Benchmark Performance
☾
Reflections and Practical Insights
☾
Conclusion
☾
FAQ

Introduction

This article focuses on one essential question:

How can a fully local, open-source mobile GUI Agent be designed so that it works reliably across fragmented mobile ecosystems?

GELab-Zero addresses this challenge by offering both:

A lightweight 4B model capable of running locally on consumer hardware
A complete engineering infrastructure that manages device control, environment setup, task orchestration, and trajectory recording

Unlike many research-only agent projects, GELab-Zero is designed to run on real mobile devices with fully local inference, granting users complete control over data, privacy, and the entire execution chain.

This article presents a clear, structured explanation of the system’s purpose, capabilities, and benchmark performance—based solely on the original source text.

Why Mobile GUI Agents Matter

Core question:
Why is a dedicated system like GELab-Zero necessary for GUI-based mobile automation?

As AI systems expand into consumer environments, mobile agents are at a turning point. They are shifting from proof-of-concept prototypes toward tools that must handle real tasks across diverse devices.

The mobile ecosystem is highly fragmented:

☾ Different manufacturers
☾ Different OS customizations
☾ Different UI structures
☾ Highly inconsistent app behaviors

A practical GUI agent must deal with:

☾ ADB connections
☾ Permissions and developer options
☾ Environment dependencies
☾ Model inference services
☾ Multi-device scheduling
☾ Recording and replaying task trajectories

For researchers, this creates a heavy engineering burden that distracts from strategy innovation and model design.
For enterprise adopters, the variability of mobile environments makes it difficult to build stable automation pipelines.

GELab-Zero was created specifically to remove these barriers.

What GELab-Zero Provides

Core question:
What exactly does GELab-Zero include, and how does it help developers and organizations?

According to the source document, GELab-Zero consists of:

1. A complete plug-and-play inference infrastructure

This infrastructure handles all engineering complexity required to run a GUI agent on mobile devices, including:

☾ Device management
☾ Dependency installation
☾ Permission configuration
☾ A unified deployment pipeline
☾ Automatic environment setup
☾ Task replay and trajectory recording

2. A locally deployable 4B GUI Agent model

The model is designed for real mobile interaction, offering:

☾ Lightweight inference on consumer hardware
☾ Low latency
☾ Fully local privacy-preserving execution

Summary of Key Capabilities

GELab-Zero supports several major features:

• Local lightweight inference

Runs the 4B model directly on local hardware—saving latency and ensuring privacy.

• One-click task launch

The system manages all required setup steps without manual handling of dependencies.

• Multi-device task distribution

One task can be dispatched to multiple phones.
Each execution generates an interaction trajectory for later analysis or replay.

• Multiple agent modes

Including:

☾ ReAct mode
☾ Multi-Agent mode
☾ Scheduled tasks

These modes allow the system to match different real-world workflows.

Application Demonstrations

Core question:
What can the agent actually do on a real phone?

The document provides several application examples across different categories. These demonstrations reflect real interactions with actual mobile apps.

Below is a structured summary of the tasks shown.

1. Recommendation Tasks

Sci-Fi Movie Recommendations

Task: Find some good recent sci-fi movies.

The agent navigates a content app, performs a search, scans options, and produces suggestions.

Family-Friendly Travel Destinations

Task: Find a place suitable for a weekend trip with children.

The agent identifies relevant categories and filters results suitable for families.

2. Practical Everyday Tasks

Claiming a Meal Subsidy

Task: Claim meal vouchers in a corporate benefits platform.

This involves form filling, navigation across menus, and confirming submission.

Metro Line Status and Navigation

Task:

Check whether Metro Line 1 is running normally
Navigate to the nearest entrance for that line

This example demonstrates multi-step reasoning and context-dependent navigation.

3. Complex Tasks

These tasks demonstrate the ability to interpret multi-item, multi-condition instructions.

Bulk Shopping on Ele.me

Task:
Purchase a long list of groceries from the nearest Hema store, including:

☾ Fruits
☾ Vegetables
☾ Prepared foods
☾ Snacks
☾ Drinks

This requires multi-step search, item verification, and cart management.

Knowledge Search on Zhihu

Task:
Search for “How to learn financial management,” then view the first answer with over 10,000 likes.

The system must interpret quantitative constraints and UI elements to identify the target content.

Filter-Based Search on Taobao

Task:
Find size-37 white canvas shoes priced under 100 RMB and favorite the first matching item.

This involves navigation, filtering, item verification, and performing a favorite action.

Completing Vocabulary Practice on Baicizhan

Task:
Perform learning tasks in a vocabulary app.

The agent interacts with question-answer screens and navigates between exercises.

AndroidDaily: A Real-Life Benchmark for Mobile Agents

Core question:
How can we objectively evaluate GUI agents in real-world mobile scenarios?

Most existing benchmarks focus on productivity apps such as email or office tools.
However, daily mobile usage revolves around:

☾ Food delivery
☾ Ride-hailing
☾ Messaging
☾ Shopping
☾ Local services
☾ Payments

To reflect this reality, AndroidDaily was built as a multi-dimensional, real-life benchmark covering six core aspects of modern living:

☾ Eating
☾ Transportation
☾ Shopping
☾ Housing
☾ Information
☾ Entertainment

The benchmark uses popular apps representative of each category, ensuring that tasks have real consequences such as purchases, reservations, and navigation.

Static Testing

Core question:
What does the static portion of AndroidDaily measure?

Static testing contains 3,146 actions, each with:

☾ Task description
☾ Step-by-step screenshots
☾ Expected action type
☾ Expected action values

The agent must predict each action without running a real app.

Action Types Distribution

Action Type	Count	Description
CLICK	1354	Click an element
COMPLETE	410	Task completion
AWAKE	528	Wake an app
TYPE	371	Enter text
INFO	305	Query information
WAIT	85	Wait
SLIDE	93	Slide gesture

This structure supports fast, large-scale testing without full device deployments.

Static Testing Accuracy (from source file)

Model	Accuracy
GPT-4o	0.196
Gemini-2.5-pro-thinking	0.366
UI-TARS-1.5	0.470
GELab-Zero-4B-preview	0.734

The model achieves the highest accuracy among the listed options.

End-to-End Benchmark

Core question:
How does the agent perform in full real-device tasks?

End-to-end testing includes 235 tasks executed on physical devices or simulators.

The agent must complete each task autonomously.

Scenario Distribution

Scenario	Tasks	Percentage
Transportation	78	33.19%
Shopping	61	25.96%
Social Communication	43	18.3%
Content Consumption	37	15.74%
Local Services	16	6.81%

Tasks cover actions such as:

☾ Ride-hailing
☾ Shopping and payment
☾ Messaging
☾ Bookmarking content
☾ Ordering food

The document highlights that GELab-Zero-4B-preview achieved a 75.86% success rate on AndroidWorld, demonstrating strong real-world capability.

Open Benchmark Performance

Core question:
How does the model compare with other open-source GUI-focused models?

The source document includes a comparison chart showing that:

☾ GELab-Zero-4B-preview performs strongly across several open benchmarks
☾ Results are especially competitive in AndroidWorld, which reflects realistic mobile usage
☾ Its combination of model tuning and engineering infrastructure leads to stable behavior in real tasks

This reinforces its positioning as a practical model for GUI automation rather than a purely experimental system.

Reflections and Practical Insights

What this system teaches us about building practical mobile agents:

1. Engineering matters as much as the model

Even a strong model cannot operate a phone without stable device management, consistent input/output flow, and reliable task orchestration.

2. Real-world tasks require real-world benchmarks

AndroidDaily focuses on everyday apps—food delivery, messaging, payments—rather than artificial interactions.

3. Local execution is increasingly important

Running on local hardware:

☾ Reduces latency
☾ Improves privacy
☾ Gives researchers full control of the inference chain

4. Multi-device support is essential for scaling

Researchers and companies need to test agents across multiple phones and configurations.

5. ReAct and Multi-Agent modes offer flexibility

The system is prepared for different workflows and task complexities.

These reflections come directly from analyzing the behavior of GELab-Zero as described in the source document.

Conclusion

GELab-Zero presents a clear direction for how mobile GUI agents can be built for real use:

☾ It comes with a locally runnable 4B model
☾ Includes complete engineering infrastructure
☾ Handles device management, task orchestration, and trajectory replay
☾ Offers both static and end-to-end evaluation datasets
☾ Demonstrates strong performance across real-world benchmarks

By focusing on practical tasks and delivering a plug-and-play experience, GELab-Zero lowers the barrier for researchers, developers, and organizations who want to explore or deploy mobile AI agents without depending on cloud services or vendor-specific toolkits.

FAQ

1. What types of tasks can GELab-Zero handle?
It can perform recommendations, complete daily service tasks, navigate apps, filter items, execute bulk shopping, and accomplish multi-step interactions demonstrated in the examples.

2. Does the model run locally?
Yes. The 4B model is designed for local deployment on consumer hardware.

3. Why is AndroidDaily important?
It focuses on real-life applications—shopping, ride-hailing, social interactions—making evaluations more representative of everyday mobile use.

4. What makes the engineering infrastructure significant?
It simplifies ADB management, environment setup, task distribution, and trajectory visualization, reducing the overhead for developers.

5. How does the model perform in benchmarks?
According to the source file, it achieves strong results, particularly in AndroidWorld and in static prediction accuracy.

6. Is multi-device execution supported?
Yes. The system can dispatch tasks to multiple phones and record each trajectory.

7. What interaction modes are available?
ReAct mode, Multi-Agent mode, and scheduled task mode are supported.

Table of Contents