AutoGLM-Phone-9B: The AI That Can See Your Phone Screen and Operate It For You

Imagine telling your phone, “Open Xiaohongshu and find me some weekend travel ideas,” and watching as it silently unlocks, opens the app, taps the search bar, types the query, and scrolls through the results to show you the perfect guide. This scene, straight out of science fiction, is now a tangible reality thanks to the open-source project AutoGLM-Phone-9B. This article will demystify this intelligent agent framework that can “see” your phone screen and “act” on your behalf. We’ll provide a comprehensive, step-by-step guide from zero to deployment, showing you exactly how to bring this automated phone assistant to life.

In a nutshell: AutoGLM-Phone-9B is a smartphone intelligent assistant framework built on a multimodal large language model. It understands on-screen content through visual-language comprehension and, combined with intelligent planning, translates natural language commands (like “Open Taobao and search for headphones”) into a series of taps, swipes, and other actions. The system controls devices via ADB, supports over 50 mainstream Chinese apps, and includes safety mechanisms like sensitive action confirmation and human-in-the-loop fallback, offering a secure and powerful platform for research and development in mobile automation.

Part 1: The Big Picture – What Is This “Digital Employee” for Your Phone?

The Phone Agent, at its core, aims to liberate humans from repetitive and tedious phone operations. It is not a simple “macro” or “script” but an intelligent system with the capabilities to perceive, reason, and execute.

Its workflow can be summarized as a precise “See-Think-Act” loop:

Multimodal Perception: The system captures a real-time screenshot of the phone screen via ADB (Android Debug Bridge). This screenshot is fed into a powerful Vision-Language Model (VLM). This model doesn’t just “see” an image; it understands the screen content like a human would: identifying buttons, text fields, and which app is currently running.
Intelligent Planning & Decision-Making: After receiving a user’s natural language instruction (e.g., “Order me the highest-rated pizza nearby”), the system, combined with its understanding of the current screen, decomposes the task and plans the path. It reasons through the sequence of steps needed: Unlock screen -> Open food delivery app -> Tap search box -> Type "pizza" -> Sort by rating -> Select the top restaurant -> Place order.
Automated Execution: The planned action sequence (tap [x, y] coordinates, input text, swipe screen, etc.) is sent precisely to the mobile device via ADB commands, driving the app to complete the entire workflow.

The value of this framework lies not only in automation but in its generalization capability. Traditional automation scripts must be painstakingly written for each specific app and interface. In contrast, the large model-based Phone Agent can learn to adapt to different app UI layouts and handle novel tasks, demonstrating its potential as a “foundation agent for GUIs.”

⚠️ Crucial Notice: This project (including its model and code) is strictly intended for research and educational purposes only. Any use for illegal data access, system interference, or other unlawful activities is strictly prohibited. Please review the project’s Terms of Use carefully before proceeding.

Part 2: Getting Started – The Complete Environment Setup Guide

Theory is one thing, but practice is everything. To get the Phone Agent up and running, you need to prepare both the “stage” (the runtime environment) and the “actor” (the model service). Don’t worry, the process is methodical. Follow this guide step-by-step, and you’ll succeed.

Step 1: Setting the Foundation (Python & ADB)

1. Python Environment
This is the base for running all the code. It is recommended to use Python 3.10 or a higher version.

2. Install and Configure ADB
ADB is the bridge between your computer and the Android device.

Download: Get the Platform Tools from the official Android developer site and extract the package.
Configure Environment Variables: Add the path of the extracted directory (e.g., ~/Downloads/platform-tools) to your system’s PATH variable. This allows you to use the adb command from any terminal location.
Windows Users can refer to detailed third-party setup tutorials.

3. Prepare an Android Device
You will need a phone or emulator running Android 7.0 or above, with two critical switches enabled:

Developer Options: Go to “Settings > About Phone > Build Number” and tap it 7-10 times rapidly until you see a message.
USB Debugging: Enter the newly appeared “Developer Options” menu, find and enable “USB Debugging.”
Connection Test: Connect your phone to the computer via a USB cable and type adb devices in the terminal. If you see your device serial number followed by the word device, congratulations, the connection is successful!

4. Install ADB Keyboard (For Text Input)
To allow the Agent to type text on your phone, a special input method is needed.

Download ADBKeyboard.apk and install it on your phone.
On your phone, go to “Settings > System > Languages & Input > Virtual Keyboard” and enable ADB Keyboard.

Step 2: Bringing On the “Star” (Deploying the Model Service)

The “brain” of the Phone Agent is the AutoGLM-Phone-9B model. You can download it from:

🤗 Hugging Face: zai-org/AutoGLM-Phone-9B
🤖 ModelScope: ZhipuAI/AutoGLM-Phone-9B

Next, we need to load and run this model on a high-performance inference engine, turning it into a service callable via API. The official recommendation is to use vLLM.

1. Install Project Dependencies & Inference Engine

# Clone the project and install base dependencies
git clone https://github.com/zai-org/Open-AutoGLM
cd Open-AutoGLM
pip install -r requirements.txt
pip install -e .

# Install vLLM (choose the appropriate command for your CUDA version)
pip install vllm

2. Launch the Model Service
Run the following command to start an OpenAI-API-compatible server. It is crucial to follow the given parameters strictly for the efficient operation of the multimodal model.

python3 -m vllm.entrypoints.openai.api_server \
 --served-model-name autoglm-phone-9b \
 --allowed-local-media-path /   \
 --mm-encoder-tp-mode data \
 --mm_processor_cache_type shm \
 --mm_processor_kwargs "{\"max_pixels\":5000000}" \
 --max-model-len 25480  \
 --chat-template-content-format string \
 --limit-mm-per-prompt "{\"image\":10}" \
 --model zai-org/AutoGLM-Phone-9B \
 --port 8000

Key Parameter Breakdown:

--max-model-len 25480: Sets the model’s context length to 25,480 tokens to handle long-sequence tasks.
--mm_processor_kwargs “{\“max_pixels\“:5000000}”: Limits maximum pixels processed per image to 5 million, balancing speed and accuracy.
--limit-mm-per-prompt “{\“image\“:10}”: Limits the number of images per prompt to a maximum of 10.

3. Verify the Service
Once successfully launched, the model API service will be running at http://localhost:8000/v1. You can test it with a simple curl command or proceed directly to the next step.

Technical Note: The model architecture is identical to GLM-4.1V-9B-Thinking. For more in-depth deployment details, refer to the GLM-V project repository.

Part 3: Hands-On – Making Your Phone “Move”

With the model service running in the background, let’s interact with the Phone Agent in two ways: via a straightforward command line and a more flexible Python API.

Method 1: Using the Command Line (Quick Experience)

In the project root directory, you can use the main.py script.

# 1. Interactive Mode: Starts a session waiting for your continuous instructions
python main.py --base-url http://localhost:8000/v1 --model “autoglm-phone-9b”

# 2. Single-Task Mode: Executes one instruction directly
python main.py --base-url http://localhost:8000/v1 “Open Meituan and search for hotpot restaurants nearby”

# 3. View all supported applications
python main.py --list-apps

Method 2: Using the Python API (Integrated Development)

This method allows you to seamlessly embed the Phone Agent’s capabilities into your own applications or workflows.

from phone_agent import PhoneAgent
from phone_agent.model import ModelConfig

# Step 1: Configure the model connection
model_config = ModelConfig(
    base_url=“http://localhost:8000/v1”, # Your model service address
    model_name=“autoglm-phone-9b”,       # Model name
)

# Step 2: Create an agent instance
agent = PhoneAgent(model_config=model_config)

# Step 3: Give it a task!
result = agent.run(“Open Taobao and search for wireless headphones”)
print(f“Task execution result: {result}”)

After running the code above, you’ll witness your phone automatically wake up, unlock (if no password is set), find the Taobao icon, tap to open it, navigate to the search page, type “wireless headphones”… The entire process is smooth and natural, as if an invisible finger is doing the work.

Enable Verbose Mode to Peek into the AI’s Thoughts
If you want to understand how the Agent makes each decision, you can enable detailed logging during creation:

from phone_agent.agent import AgentConfig

agent_config = AgentConfig(verbose=True)
agent = PhoneAgent(model_config=model_config, agent_config=agent_config)

Once enabled, the console will output information like the following, letting you clearly see the AI’s “chain of thought”:

💭 Thinking Process:
—————————————————
Currently on the home screen. Need to launch the Xiaohongshu app first.
—————————————————
🎯 Executing Action:
{ “action”: “Launch”, “app”: “Xiaohongshu” }

Part 4: Advanced Skills & Deep Configuration

Skill 1: Cutting the Cord – Remote Wireless Debugging

You don’t need to keep your phone connected via USB cable. Use WiFi for remote ADB connection for more flexible control.

On your phone, enable Wireless Debugging:

Ensure your phone and computer are on the same WiFi network.
Go to “Developer Options,” find “Wireless Debugging,” and enable it.
Note the IP address and port displayed on the screen (e.g., 192.168.1.100:5555).

On your computer, connect:

adb connect 192.168.1.100:5555
adb devices # Verify connection; you should see the device

Specify the remote device in a task:

python main.py --device-id 192.168.1.100:5555 --base-url http://localhost:8000/v1 “Open Douyin and watch videos”

Skill 2: Customizing Your Agent

The Phone Agent offers rich configuration options to adapt to different scenarios.

1. Configure via Environment Variables

export PHONE_AGENT_BASE_URL=“http://your-server-ip:8000/v1”
export PHONE_AGENT_MODEL=“autoglm-phone-9b”
export PHONE_AGENT_MAX_STEPS=50 # Limit a single task to a maximum of 50 steps to prevent infinite loops

2. Fine-Tune Model Parameters

model_config = ModelConfig(
    base_url=“http://localhost:8000/v1”,
    model_name=“autoglm-phone-9b”,
    max_tokens=3000,      # Maximum 3000 tokens generated per model response
    temperature=0.1,      # Lower sampling temperature (0.1) makes output more deterministic and stable
    frequency_penalty=0.2, # Frequency penalty coefficient of 0.2 reduces the probability of word repetition
)

3. Handle Sensitive Operations: Custom Callback Functions
This is a key safety mechanism. When the Agent encounters sensitive pages like payments or password entry, it can trigger your custom functions.

def my_confirmation(message: str) -> bool:
    “”“Sensitive action confirmation callback, e.g., for tapping a 'Pay' button”“”
    answer = input(f“⚠️  About to perform sensitive action: {message}. Proceed? (y/n): “)
    return answer.lower() == ‘y’

def my_takeover(message: str) -> None:
    “”“Human-in-the-loop callback, e.g., for encountering a CAPTCHA”“”
    print(f“🤖 Agent requests human takeover: {message}”)
    input(“👤 Please complete manually, then press Enter for the Agent to continue…”)

# Inject callback functions into the Agent
agent = PhoneAgent(
    model_config=model_config,
    confirmation_callback=my_confirmation,
    takeover_callback=my_takeover
)

Part 5: Capabilities & Supported Ecosystem

Understanding what a tool can and cannot do is key to using it effectively.

What Operations Are Supported?

The Phone Agent can perform the following 9 core atomic operations, covering the vast majority of interaction scenarios:

Operation	Description	Example Scenario
`Launch`	Launch an app	Open WeChat
`Tap`	Tap coordinates	Tap login button `[300, 500]`
`Type`	Input text	Type “weather forecast” in search box
`Swipe`	Swipe screen	Swipe up/down to browse news
`Back`	Go back	Exit current page
`Home`	Return to home screen	Switch to home screen
`Long Press`	Long press	Long press app icon to enter uninstall mode
`Double Tap`	Double tap	Double tap to like
`Wait`	Wait for loading	Wait for page transition
`Take_over`	Request human takeover	Encounter login CAPTCHA

Which Apps Are Supported?

The project comes pre-configured to support over 50 mainstream Chinese apps across eight categories including social messaging, e-commerce, lifestyle services, and content/entertainment. You can always view the complete list by running python main.py --list-apps.

Category	Example Apps
Social & Messaging	WeChat, QQ, Weibo
E-commerce & Shopping	Taobao, JD.com, Pinduoduo
Food & Delivery	Meituan, Ele.me
Travel & Transportation	Ctrip, 12306, Didi Chuxing
Video & Entertainment	bilibili, Douyin, iQiyi
Content & Communities	Xiaohongshu, Zhihu, Douban

Part 6: Common Questions & Solutions (FAQ)

During actual deployment and use, you might encounter some roadblocks. Here are proven solutions.

Q1: After running adb devices, the list is empty. My device is not found.

Check the USB Cable: Ensure you are using a data-transfer-capable USB cable, not one that only charges.
Restart ADB Service: Execute in the terminal sequentially:
```
adb kill-server
adb start-server
adb devices
```
Check Phone Authorization: On first connection, a dialog asking “Allow USB debugging?” will pop up on the phone. Tap “Allow.”

Q2: The Agent cannot input text on the phone.

Confirm Installation & Enablement: Ensure the ADB Keyboard input method is correctly installed and enabled.
Agent Handles Switching: No manual switching is needed. When input is required, the Agent will automatically switch the input method to ADB Keyboard via ADB commands.

Q3: Screenshot fails. The captured screen image is black or blank.

Security Mechanism Triggered: This commonly happens with banking, payment apps, or certain system password input screens. Android systems prohibit screenshots for security reasons. The Phone Agent will detect this situation and automatically trigger the Take_over callback, requesting human intervention.

Q4: Remote ADB connection frequently disconnects.

Network Stability: Check WiFi signal strength. Ensure the device and computer are on the same local network with a stable connection.
Re-enable Port: Some phones disable TCP/IP debugging after a reboot. You need to reconnect via USB once, execute adb tcpip 5555, then disconnect USB and reconnect wirelessly.

Part 7: The Research Behind It & The Open-Source Spirit

AutoGLM-Phone-9B isn’t built in a vacuum. It stands on solid academic research and embodies the spirit of open source and open science. If you use this project in your research, please consider citing the following papers:

@article{liu2024autoglm,
  title={Autoglm: Autonomous foundation agents for guis},
  author={Liu, Xiao and Qin, Bo and Liang, Dongzhu and Dong, Guang and Lai, Hanyu and Zhang, Hanchen and Zhao, Hanlin and Iong, Iat Long and Sun, Jiadai and Wang, Jiaqi and others},
  journal={arXiv preprint arXiv:2411.00820},
  year={2024}
}
@article{xu2025mobilerl,
  title={MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents},
  author={Xu, Yifan and Liu, Xiao and Liu, Xinghan and Fu, Jiaqi and Zhang, Hanchen and Jing, Bohao and Zhang, Shudan and Wang, Yuting and Zhao, Wenyi and Dong, Yuxiao},
  journal={arXiv preprint arXiv:2509.18119},
  year={2025}
}

Through this article, we have systematically broken down the entire process of the AutoGLM-Phone-9B project, from its concept and deployment to practical application. It showcases an exciting direction where large models meet embodied intelligence: enabling AI not only to talk but also to “see the screen and operate it.” Whether as a tool for automated testing, an aid for accessibility, or a cutting-edge platform for research on general-purpose agents, this project opens a door full of possibilities. The environment is ready, the code is within reach. It’s time to start your model and give your phone its first command.