Goodbye, Complex Scripts: Control Your Android Phone with Just a Sentence

Have you ever been frustrated by these scenarios?

Needing to repeat the same taps and swipes across multiple test phones?
Wanting to automate app testing but getting discouraged by complex scripts and steep API learning curves?
Having to manually collect data from apps, a process that’s both tedious and error-prone?
Wishing for a smarter tool to record and replay your actions?

Today, I’m introducing an open-source project that can fundamentally change how you interact with Android devices: AI Auto Touch. This isn’t just a remote control; it’s an AI automation platform that truly “listens” to your commands and “sees” your phone screen. You simply instruct it in everyday language, and the AI handles the rest.

Core Summary: What is AI Auto Touch?

AI Auto Touch is an intelligent Android device control platform powered by Tsinghua University’s AutoGLM multimodal large language model. It allows users to control phones directly using natural language (e.g., “Open Xiaohongshu, search for the blogger ‘Tech Enthusiast Ox'”) without writing any automation scripts. The platform integrates low-latency (30-100ms) real-time screen mirroring, multi-device batch management, and a complete REST API, seamlessly combining AI comprehension with device control. This significantly reduces the technical barrier and time cost for automation testing, data collection, and repetitive operational tasks.

The Core Innovation: When AI “Learns” to Operate a Phone

Traditional mobile automation relies on precise coordinate clicking or element-locating scripts, which are time-consuming to write and break easily when app interfaces update. The breakthrough of AI Auto Touch is that it introduces a “brain” that understands both text and images—the AutoGLM multimodal model.

This means you don’t need to tell the AI “click at screen coordinates (320, 450)”. Instead, you can say: “Open Douyin, swipe through 10 videos, and like the ones that contain ‘food’.”

The AI will autonomously complete the following thought and action chain:

Understand the Command: Parse your natural language to clarify the goal (open Douyin, swipe videos, conditionally like).
Perceive the Screen: Analyze a real-time screenshot of the phone screen, identifying UI elements like app icons, search bars, video content, and like buttons.
Plan the Steps: Generate a sequence of specific action instructions (launch app -> tap search -> input keyword -> recognize video content -> judge and tap like).
Execute and Feedback: Execute the actions via ADB (Android Debug Bridge) and provide real-time feedback on the process and results.

More Examples of “Speak, Don’t Touch”

Social Media Recon: “Open Xiaohongshu, search for the blogger ‘Tech Enthusiast Ox’, see what this blogger is about, is he worth following?” The AI will perform the search, analyze the profile content, and provide a summary recommendation.
Comparison Shopping: “Open Taobao, search for ‘mechanical keyboard’, check the price of the first three products.” The AI automates the entire process of searching, browsing, and information extraction.
Routine Operation: “Open WeChat, send the message ‘test message’ to ‘File Transfer Assistant’.”

Technical Architecture Deep Dive: How “What You Say Is What You Get” is Achieved

How is this platform built? Its technology stack is clear and modern, divided into three layers: frontend, backend, and underlying services.

Overall Architecture Diagram

┌─────────────────────────────────────────────────┐
│               Web Frontend (React + TypeScript) │
└─────────────────┬───────────────────────────────┘
                  │ REST API + WebSocket (Real-time)
┌─────────────────▼───────────────────────────────┐
│              FastAPI Backend Service             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐      │
│  │ Device   │  │ AI       │  │ Screen   │      │
│  │ Management│  │ Service  │  │ Service  │      │
│  └──────────┘  └──────────┘  └──────────┘      │
└─────────────────┬───────────────────────────────┘
                  │
        ┌─────────┼─────────┐
        │         │         │
┌───────▼──┐ ┌───▼────┐ ┌─▼────────┐
│ AutoGLM  │ │  ADB   │ │  scrcpy  │
│ AI Model │ │Device  │ │ Screen   │
│          │ │Control │ │ Mirroring│
└──────────┘ └────────┘ └──────────┘

Backend (FastAPI): Acts as the central hub, coordinating three major services.

Device Management Service: Scans, connects, and manages multiple Android devices via ADB.
AI Service: Calls the AutoGLM model, processes natural language commands and screen images, and returns executable action sequences.
Screen Service: Provides high-performance screen mirroring streams based on Google’s open-source project scrcpy. This is key to achieving real-time visuals with under 100 milliseconds of latency and 20-30 FPS frame rates.

Frontend (React 18 + TypeScript): Provides an intuitive web control interface. Here you can view the device list, send AI commands, and, importantly, see a real-time mirror of the phone screen and control the physical device directly by clicking on the image in the webpage. The frontend receives video streams and real-time logs via WebSocket, ensuring a smooth experience.

A Glimpse at Core Code

AI Decision Process (Backend Simplified Example):

async def execute_ai_command(device_id: str, command: str):
    # 1. Capture the current phone screen
    screenshot = await capture_screen(device_id)
    # 2. Send both the command and screenshot to the AutoGLM model for analysis
    response = await ai_model.analyze(command, screenshot)
    # 3. Parse the action sequence ("click here", "input text", etc.) returned by the AI
    actions = parse_ai_response(response)
    # 4. Execute the actions sequentially via ADB
    for action in actions:
        await execute_action(device_id, action)

Precise “Click-What-You-See” Coordinate Conversion (Frontend):
When you click on the screen mirror in the webpage, the system precisely calculates the physical coordinates on the real phone screen corresponding to that click location, ensuring accurate operation.

Quick Start Guide: Begin AI Control in 5 Minutes

By now, you’re probably eager to try it yourself. The deployment process is very straightforward.

Environment Preparation

Computer: Install Python 3.8+ and Node.js 16+.
Phone: An Android device with “USB Debugging” enabled in the “Developer Options”.
Base Tools: Install ADB and scrcpy on your computer (for underlying device communication and screen mirroring).

The Key Step: Configuring the AI “Brain”

The power of AI Auto Touch relies on the AutoGLM model. You have two choices:

Option A (Recommended for Beginners): Use a cloud API service (like Zhipu AI’s BigModel platform). Simply register to get an API Key and fill it into the project configuration file. No local hardware is needed, perfect for quick experimentation.
Option B (For Performance & Privacy): Deploy the model locally. This requires a computer with an NVIDIA GPU and at least 24GB of VRAM, offering faster response times.

Launching the Project

Clone the open-source project to your local machine.
Start the Backend: Navigate to the backend directory, install Python dependencies, and run the startup script. The service will run at http://localhost:8001 and provide complete API documentation.
Start the Frontend: Navigate to the frontend directory, install Node.js dependencies, and start the development server. Access http://localhost:5173 to open the control interface.
Connect Your Device: Connect your phone via USB, click “Scan Devices” on the webpage, and select your phone to connect.

Upon successful connection, you’ll immediately see two core functional pages:

Real-Time Screen Display & Control: Experience a clear phone mirror with a latency of only 30-100ms. You can control the phone directly by clicking on the screen in the webpage or use virtual Home, Back, Volume buttons.
AI Smart Control: Enter commands in natural language in the input box and watch the AI think through and execute your orders step-by-step.

Real-World Application Scenarios: More Than Just a “Toy”

This platform was designed to solve real problems and can greatly improve efficiency in several scenarios.

Scenario 1: App Automation Testing

No need to learn the complex syntax of Appium or UIAutomator. Testers can write test cases directly in natural language:

test_cases = [
    “Open the app, check if the home page displays correctly”,
    “Click the login button, enter test account credentials, check if login is successful”,
    “Navigate to the personal center, check if user information is correct”,
]

The AI will execute automatically and report results, making writing test cases as simple as writing a checklist.

Scenario 2: Batch Data Collection

When operations or marketing personnel need to collect competitor information, manual screenshots and notes are no longer necessary.

Command: “Open Taobao, search for 'mechanical keyboard', record the name, price, sales volume, and store name of the first 20 products.”

The AI will automatically operate, scroll, recognize, and extract structured data, which can ultimately be exported to Excel.

Scenario 3: Multi-Device Batch Operations

Facing a room full of test phones? No need to operate each one individually.

# Send the same command to three devices simultaneously
await batch_control(“Open WeChat, send the message 'test message' to 'File Transfer Assistant'”)

This is a huge efficiency boost for application compatibility testing, batch installation, or device configuration.

Performance & Optimization: Engineering for a Smooth Experience

To deliver the best experience, the project has been optimized on multiple levels:

Smart Video Stream Transmission: The screen streaming service dynamically adjusts image quality and frame rate. It automatically reduces quality to ensure smoothness when the network is poor and intelligently skips frames when the screen is static to reduce bandwidth consumption.
Concurrency Control: Uses a semaphore mechanism to limit the number of high-load requests (like AI inference) processed simultaneously, preventing system overload.
Caching Strategy: Caches immutable data like device information and screen dimensions to reduce repetitive query overhead.

Frequently Asked Questions (FAQ)

Q1: Are there specific browser requirements?
A: Yes. For the best real-time screen experience, you need to use Chrome 94 or above or Edge 94 or above, as these browsers support modern video decoding technology (WebCodecs API). Firefox and Safari are not currently supported.

Q2: Why does the video stream stutter after I click a control button?
A: This was a known issue that has been fixed in the latest code. The earlier screenshot-based control mode could cause stuttering. Now, after fully switching to the video stream mode, control commands are sent in a non-blocking manner, avoiding interface freezes. Please ensure you are using the latest version of the project.

Q3: I’ve connected my phone, but the webpage doesn’t detect the device. What should I do?
A: Please follow these troubleshooting steps:

Run adb devices in your computer’s terminal to see if the device is listed and shows a “device” status (not “unauthorized”).
If it shows “unauthorized”, check the “Always allow” box and confirm on the “Allow USB debugging?” prompt that appears on your phone screen.
Ensure “Developer Options” and “USB Debugging” are enabled on your phone.

Q4: Can I connect wirelessly instead of via USB?
A: Yes. For devices running Android 11 or above, you can complete a one-time pairing authorization via USB first. After that, you can use wireless ADB connection for control, offering more flexibility.

Q5: The hardware requirements for local AI model deployment are too high. Is there an alternative?
A: Absolutely. The project primarily recommends using cloud API services (like Zhipu AI). This method does not require a powerful local GPU. With just an API Key, you can experience the full AI control functionality, making it ideal for beginners and individual developers.

Future Outlook: Potential Revealed in the Roadmap

The development team has outlined a clear path forward:

Short-term (1-3 months): Will perfect operation recording and playback functionality, allowing you to record sequences of AI or manual actions and replay them like a video.
Mid-term (3-6 months): Plans to support more AI models (like GPT-4V) and explore control support for iOS devices.
Long-term (6-12 months): Aims to progress towards more intelligent, integrated enterprise-level testing platforms with features like visual workflow orchestration and AI-generated test cases.

Conclusion: Open Source Makes Automation Accessible

The AI Auto Touch project combines cutting-edge multimodal AI technology with solid device control engineering, truly lowering the barrier to automation. It’s not just an efficiency tool for developers and testers; it also provides an excellent learning and practical platform for anyone interested in exploring the possibilities of “AI + Automation.”

The project is open-sourced on GitHub under the MIT license, meaning you are free to use, modify, or even use it commercially. Whether you want to apply it directly to solve a current problem or dive into the source code to learn how to integrate large models with hardware control, this project offers immense value.

Project Address: https://github.com/github653224/ai-auto-touch
If you find this project helpful, welcome to give it a Star on GitHub or participate in its development. Progress in the world of technology stems from every act of sharing and collaboration.

Control Your Android Phone with Just a Sentence: AI Automation Without Scripts