ApkClaw: How to Control Android Devices with Natural Language Using an AI Agent
Image source: Unsplash
What is the core question this article answers? How can you turn an old Android phone into a 24/7 AI-powered assistant that autonomously executes tasks just by receiving a chat message from you?
ApkClaw is an AI-driven Android automation application built on a straightforward premise: use natural language to let an LLM Agent control an Android device. You do not need to write scripts, connect data cables, or understand programming. By sending a message through apps like DingTalk, Feishu, QQ, Discord, or Telegram, the AI Agent understands your intent and autonomously executes operations on the phone.
This product is particularly suited for idle, older Android phones. Any spare device running Android 9 or higher, once installed with ApkClaw, transforms into a smart, always-on assistant. In Chinese tech communities, tools like this are sometimes colloquially referred to as a “phone lobster” (a term for device automation agents that mimic human interaction). As one tester noted, trying a phone lobster revealed genuine practical value, turning an otherwise unused device into something that handles both daily tasks and money-making opportunities without conflict.
How Does ApkClaw’s Architecture Turn a Chat Message into a Physical Screen Tap?
What is the core question this section answers? When you type “help me clock in” in a chat app, what exact path does that text message take to become a physical tap on the phone screen?
ApkClaw’s overall architecture is divided into four distinct layers, flowing from top to bottom: the messaging channel layer, the message routing layer, the task orchestration layer, and the Agent execution layer.
Image source: Unsplash
┌───────────────────────────────────────────────────────────────┐
│ Messaging Channels │
│ DingTalk │ Feishu │ QQ │ Discord │ Telegram │ WeChat │
└──────────────────────┬────────────────────────────────────────┘
│ Message received
▼
┌─────────────────┐
│ ChannelManager │ Message routing & distribution
└────────┬────────┘
│
┌────────▼────────┐
│ TaskOrchestrator │ Task lock, lifecycle management
└────────┬────────┘
│
┌────────▼────────┐
│ AgentService │ Agent loop
│ │
│ ┌────────────┐ │
│ │ LLM Call │◄─┼── LangChain4j (OpenAI / Anthropic)
│ └─────┬──────┘ │
│ │ │
│ ┌─────▼──────┐ │
│ │ Tool Exec │◄─┼── ToolRegistry → ClawAccessibilityService
│ └─────┬──────┘ │
│ │ │
│ Loop until │
│ task finished │
└────────┬────────┘
│
▼
Reply to user via channel
Layer 1: Messaging Channels. This is the entry point where users interact with ApkClaw. Currently, it supports DingTalk, Feishu, QQ, Discord, Telegram, and WeChat. Each channel corresponds to an independent protocol and credential system. For instance, DingTalk uses the App Stream Client requiring a Client ID and Client Secret, while Discord uses Gateway WebSocket combined with REST, requiring only a Bot Token. This multi-channel design means you do not need to install a new app to use ApkClaw; you simply send commands from the chat tools you already use daily.
Layer 2: ChannelManager (Message Routing). When any channel receives a message, the ChannelManager intercepts it, performs initial validation (such as checking whether the Accessibility Service is enabled), and passes it downstream.
Layer 3: TaskOrchestrator (Task Orchestration). This layer handles two critical tasks. First, it acquires a task lock—ApkClaw uses a single-task model, meaning it only executes one task at a time to prevent multiple instructions from conflicting on the phone screen. Second, it presses the Home key to reset the device state, ensuring every task starts from a clean desktop rather than being stuck deep inside an application.
Layer 4: AgentService (Agent Loop). This is the brain of the system. It calls the LLM to understand the instruction, decides which tool to invoke, executes the tool, feeds the result back to the LLM, and loops until the task is complete.
Finally, the execution result is routed back through the same channel you used to send the message, allowing you to see the execution process and final result right in your chat window.
Author’s Reflection: The four-layer architecture seems simple, but the “single-task lock + Home key reset” design detail is worth examining. Many automation tools fail not because they “do the wrong thing,” but because “the state gets messy”—a previous task stops at a popup, and the next task starts from an incorrect state. ApkClaw forcefully resets the state at the orchestration layer, moving the complexity of state management out of the Agent layer. This is a highly pragmatic engineering choice.
How Does the Agent Execute Tasks Step-by-Step?
What is the core question this section answers? After the Agent receives a natural language instruction, what exact mechanism does it use to “see the screen, think of a plan, and take action”?
ApkClaw’s core execution flow can be broken down into five stages:
-
User sends a message: You send a natural language message, such as “open Feishu and clock in,” through any connected channel. -
Channel validation: ChannelSetup verifies that the Accessibility Service is enabled. If not, the task halts because all subsequent operations depend on it. -
Task orchestration: TaskOrchestrator acquires the task lock and presses the Home key to reset the phone to the desktop. -
Agent loop: DefaultAgentService enters the core Agent loop, the most critical part of the system. -
Result reply: Once the task is finished, the result is replied to you through the same channel.
Inside the Agent loop, the system follows an Observe → Think → Act → Verify protocol. Each loop iteration performs the following:
Building context. The system prompt injects device context information, including the phone’s brand, model, Android version, screen resolution, and the list of all registered tools and safety constraints. This information lets the LLM “know” what kind of device it is operating and what capabilities it has.
Calling the LLM. Via the LangChain4j bridge layer, the system prompt, user message, and tool definitions are sent to the LLM. The LLM does not return a pure text answer; instead, it returns a tool call instruction—such as “invoke the tap tool at coordinates (540, 1200).”
Executing the tool. The tool call is extracted from the LLM’s response. The ToolRegistry locates the corresponding tool implementation, and ultimately, the ClawAccessibilityService (Accessibility Service) physically executes the operation on the device.
Feedback and looping. The tool’s execution result (success/failure, screen changes) is fed back to the LLM. The LLM decides the next step based on this new information. This loop continues until the LLM actively calls the finish tool to indicate the task is complete, or it hits the maximum iteration limit (40 rounds).
Image source: Unsplash
Four Critical Mechanisms Inside the Agent Loop
LLM Call Retry. Network requests are not always successful. ApkClaw sets a maximum of 3 retries for LLM calls using an exponential backoff strategy (1 second → 2 seconds → 4 seconds). However, if it encounters a 401 (Unauthorized) or 403 (Forbidden) error, it does not retry. These errors are usually API Key configuration issues that retrying will not fix.
Dead Loop Detection. An AI executing tasks might fall into repetitive operations—such as continuously tapping the same location. ApkClaw maintains a 4-turn sliding window recording the (screenHash, toolCall) fingerprint for each round. If the fingerprints for 4 consecutive rounds are completely identical, the system injects a system message forcing the Agent to try a different approach, breaking the dead loop.
Token Optimization. In multi-turn loops, the get_screen_info tool frequently returns UI hierarchy tree data, which consumes a massive amount of tokens. ApkClaw’s strategy is to replace historical get_screen_info results with placeholders, keeping only the most recent complete result. In long-task scenarios, this significantly reduces token consumption.
System Popup Handling. When getRootInActiveWindow() returns null, it means a protected system popup (like a permission confirmation dialog) has been detected. At this point, the Agent can neither read the interface nor inject gestures. It will automatically take a screenshot, send it to the user, and terminate the task for manual handling.
Author’s Reflection: Dead loop detection and token optimization reflect the essential difference between “running an Agent on a physical device” and “running an Agent in a pure text environment.” In a text environment, if tokens run out, you just truncate; if a dead loop occurs, you simply restart. But on a physical device, a dead loop means the phone is operating meaninglessly, and a token explosion means your API bill is burning. These mechanisms are not nice-to-haves; they are necessary thresholds for moving from “it can run” to “it can be used in production.”
What Tools Does the AI Agent Have to Interact with the Screen?
What is the core question this section answers? What exact “hands and eyes” does the ApkClaw AI Agent possess to complete different types of phone operations?
The tool system acts as the “limbs” of ApkClaw. All tools are registered in the ToolRegistry by device type. Each tool inherits from BaseTool, implements the execute(Map<String, Any>): ToolResult method, and provides bilingual (Chinese and English) descriptions with typed parameter declarations.
Tools are divided into two main categories: general tools and phone-specific tools.
General Tools (Available on All Devices)
| Tool | Description | Scenario Example |
|---|---|---|
get_screen_info |
Gets the UI hierarchy tree for AI to analyze the current interface | “Looking” at the screen to see what is there at the start of each loop |
find_node_info |
Finds elements by text or resource ID | Locating the “Sign In” button on the current page |
take_screenshot |
Takes a screenshot of the current screen as a PNG | Capturing the screen to send to the user for confirmation |
input_text |
Inputs text into the focused input field | Typing “CreatorName” into the search box |
open_app |
Opens an app by name | “Open Douyin” |
get_installed_apps |
Gets the list of installed apps | Checking if Feishu is installed on the phone |
press_back |
Goes back to the previous page | Navigating back from a sub-page |
press_home |
Returns to the desktop | Resetting the device state |
open_recent_apps |
Opens the recent tasks list | Switching to another App |
expand_notifications |
Expands the notification shade | Checking for new messages |
collapse_notifications |
Collapses the notification shade | Closing the notification shade |
lock_screen |
Locks the screen | Locking the screen to save battery after a task finishes |
wait |
Waits for a specified duration | Waiting for a page to finish loading |
repeat_actions |
Repeats a set of actions | Batch liking posts |
send_file |
Sends a file to the user via the channel | Sending a screenshot or log file to the user |
finish |
Finishes the task and returns a summary | Telling the user “Clock-in completed” |
Phone-Specific Tools
| Tool | Description | Scenario Example |
|---|---|---|
tap |
Taps specified coordinates (x, y) | Tapping a specific button on the screen |
long_press |
Long presses specified coordinates | Long pressing a message to delete it |
swipe |
Swipes from point A to point B | Swiping up to browse the Douyin video feed |
click_by_text |
Clicks an element by its visible text | Clicking the “Send” button |
click_by_id |
Clicks an element by its resource ID | Clicking com.example.app:id/submit |
search_app_in_store |
Searches for an app in the app store | Searching for and installing an app |
Let us look at a concrete example using these tools. When you send the message “Open Douyin, search for a creator’s video, like it, and comment,” the Agent’s actual execution chain looks roughly like this: call open_app to open Douyin → call tap to click the search box → call input_text to type the creator’s name → call tap to click the search result → call get_screen_info to confirm the video is playing → call tap to like the video → call tap to open the comment section → call input_text to type a comment → call tap to hit send → call finish to end the task.
Author’s Reflection: The granularity of tool design is an interesting choice. ApkClaw provides both
tap(coordinate tap) andclick_by_text(text tap). Coordinate tapping is more universal but fragile (it breaks if the screen resolution changes), while text tapping is more semantic but relies on the text attributes of UI nodes. Having both allows the LLM to choose flexibly based on the actual situation. This approach of “giving the AI the power to choose rather than making the decision for it” is highly worth referencing in Agent tool design.
How Do You Choose and Configure the LLM Backend?
What is the core question this section answers? Which large language models does ApkClaw support, how do you configure them, and why is the temperature parameter set so low?
ApkClaw implements a pluggable LLM backend through LlmClientFactory, currently supporting two providers:
| Provider | Client Class | Model Builders |
|---|---|---|
| OpenAI Compatible | OpenAiLlmClient |
OpenAiChatModel / OpenAiStreamingChatModel |
| Anthropic | AnthropicLlmClient |
AnthropicChatModel / AnthropicStreamingChatModel |
“OpenAI Compatible” means it is not limited to the official OpenAI API; any third-party service compatible with the OpenAI interface format can be connected. This leaves ample room for using domestic LLM service providers, which typically offer OpenAI-compatible endpoints.
Both providers support streaming and non-streaming modes. At the HTTP layer, ApkClaw uses a custom OkHttpClientBuilderAdapter based on OkHttp instead of LangChain4j’s default JDK HttpClient, simply because JDK HttpClient has poor compatibility on Android.
Core Configuration Items
Configuration is centralized in AgentConfig:
-
apiKey: Your API Key, filled in local settings. It is not uploaded to any third-party server. -
baseUrl: The LLM endpoint address, defaulting to https://api.openai.com/v1. If you use a third-party compatible service, you must change this to the corresponding endpoint. -
modelName: The model name, such as gpt-4oorclaude-sonnet-4-20250514, chosen by the user. -
provider: Select OPENAI(default) orANTHROPIC. -
temperature: Defaults to 0.1. This is a very low value, meaning the model output is highly deterministic. For an operation like “tap coordinates (540, 1200),” you need precision and stability, not creativity and divergence. -
maxIterations: The maximum loop count, defaulting to 40. -
streaming: Whether to enable streaming output, defaulting to off.
The LangChain4j Bridge Layer
ApkClaw does not directly use LangChain4j’s @Tool annotation approach to define tools. Instead, it defines its own BaseTool abstraction, then uses LangChain4jToolBridge to convert it into LangChain4j’s ToolSpecification format. Parameter types (string, integer, number, boolean) are mapped to JSON Schema. This bridge design keeps tool definitions entirely under ApkClaw’s control while reusing LangChain4j’s Agent orchestration capabilities.
Author’s Reflection: The detail of setting temperature to 0.1 seems inconsequential, but it reflects the fundamentally different requirements for LLMs in “device control” versus “content generation.” When writing articles, you want a higher temperature for creative flair, but when controlling a phone, you want it to tap the exact same spot every time. Many developers run automation tasks with the default temperature of 0.7, resulting in a different execution path for the same instruction every time, making debugging incredibly painful.
Through Which Platforms Can You Send Remote Commands?
What is the core question this section answers? Besides operating directly on the phone, through which chat platforms can you remotely issue commands to the device?
ApkClaw currently supports the following channels, along with their protocols and credential requirements:
| Channel | Protocol | Required Credentials |
|---|---|---|
| DingTalk | App Stream Client | Client ID + Client Secret |
| Feishu | OAPI SDK | App ID + App Secret |
| QQ Bot API | App ID + App Secret | |
| Discord | Gateway WebSocket + REST | Bot Token |
| Telegram | Bot HTTP API | Bot Token |
Each channel’s implementation resides in its own independent handler module under the channel/ directory of the project.
Take the remote clock-in scenario as an example. You leave your Android phone at the office, plugged into a charger and connected to Wi-Fi. One morning you oversleep. Instead of panicking, you simply send a message in Feishu: “Open Feishu and clock in.” The ApkClaw on the phone receives the message via the Feishu channel, and the Agent autonomously completes the entire process of opening the app, finding the clock-in entry, and executing the clock-in, then replies with the result via Feishu.
For social media management, you might send a message in Discord: “Open Douyin, search for a specific creator’s video, like it, and comment.” ApkClaw receives the instruction through the Discord channel and automatically performs the search, like, and comment operations on the phone. Because this automation is based on real, physical phone operations simulating normal user interaction behavior—rather than script injection—it is fundamentally different from botting and far less likely to trigger platform risk control strategies.
Channel credentials can be configured in two ways: manually entered in the phone app’s settings page, or configured via a PC browser through the LAN HTTP server, which will be detailed later.
Author’s Reflection: Multi-channel design might seem like just “connecting a few more SDKs,” but it solves a real pain point: different teams and scenarios use different communication tools. Product teams use Feishu, tech communities use Discord, overseas users use Telegram, and individual users might use QQ. ApkClaw does not force users to migrate to a specific platform; instead, it adopts a “meet the user where they are” philosophy. This product mindset is crucial for utility tools that sit on the boundary between consumer and enterprise use.
How Does the Accessibility Service Work, and What Are Its Limits?
What is the core question this section answers? Through what underlying mechanism does ApkClaw achieve clicks, swipes, and screen reading at the Android system level, and what operations are impossible to perform?
ClawAccessibilityService is the physical execution layer of the entire ApkClaw system. Implemented in Java, it handles four core types of operations:
Gesture operations. Implemented through dispatchGesture() to achieve clicks, swipes, and long presses. This is the underlying implementation for phone-specific tools like tap, swipe, and long_press. When the system “taps coordinates (x, y),” it ultimately triggers a touch event at the corresponding screen position through this method.
Node traversal. Achieved through getRootInActiveWindow() to obtain the current interface’s UI hierarchy tree. This is the data source for get_screen_info and find_node_info. The UI hierarchy tree contains information about all accessibility nodes on the interface—text content, resource IDs, coordinate bounds, clickability, and so on.
Key injection. Achieved through performGlobalAction() to implement system-level key presses like Home, Back, and Recent Apps. The press_back, press_home, and open_recent_apps tools are backed by this method.
Screen capture. Achieved through takeScreenshot() to generate a PNG image of the screen. This feature requires Android 11 or higher.
Known Limitation: System-Protected Windows
Android has a security mechanism called filterTouchesWhenObscured. When certain system-protected windows appear—such as permission request dialogs, typically the permission confirmation dialog from com.android.permissioncontroller—they simultaneously block two things:
-
Node tree reading: getRootInActiveWindow()returns null, meaning the Agent cannot “see” what is on the screen. -
Gesture injection: dispatchGesture()cannot penetrate the protected window to execute operations.
In other words, the Agent is both “blind” and “paralyzed” in this situation. ApkClaw’s handling strategy is straightforward: when it detects that getRootInActiveWindow() returns null, it automatically takes a screenshot, sends it to the user via the channel, and terminates the current task for the user to handle manually.
Image source: Unsplash
Author’s Reflection: The system-protected window limitation is not an ApkClaw bug; it is an Android security design. Any automation solution based on the Accessibility Service will encounter this issue. What ApkClaw does well is that it does not pretend the problem does not exist. Instead, it implements clear detection and graceful degradation—capturing a screenshot to notify the user rather than freezing or throwing an incomprehensible error. In real-world usage, this honesty about its capabilities—”knowing what it can and cannot do”—is far more valuable than trying to do everything but failing at all of it.
What Real-World Scenarios Can an Old Phone Handle?
What is the core question this section answers? Once you install ApkClaw on an idle Android phone, what practical, everyday uses does it actually enable?
Below are application scenarios that have been practically verified or naturally derived from the product’s capabilities:
Scenario 1: Remote Clock-In
Leave an Android phone at the office, plugged into a charger. If you oversleep one day or get stuck in traffic, open Feishu or DingTalk on your current phone and send a message: “Open Feishu and clock in.” ApkClaw receives the instruction, automatically opens the Feishu app, finds the clock-in entry, completes the operation, and replies with the result. The value here is crossing the physical distance barrier—the phone is at the office, but you are at home. Traditional automation scripts require a PC to be online, whereas ApkClaw only requires the phone to be online.
Scenario 2: Automated Social Media Interaction
Send the instruction: “Open Douyin, search for a specific creator’s video, like it, and comment.” ApkClaw automatically opens Douyin on the phone, enters search, finds the target video, and executes the like and comment. The unique aspect of this scenario is that because the operations happen on a real phone via normal user interaction paths (clicking, swiping, typing) rather than through APIs or script injection, the behavioral pattern mimics a legitimate user, making it much less likely to trigger platform risk control.
Scenario 3: In-App Ticket Snatching
Many ticketing platforms only support ticket purchases within their mobile apps, with no PC endpoint. ApkClaw runs natively on the phone and operates directly within the app, making it a perfect fit for these scenarios.
Scenario 4: Automated App Testing and Reviewing
If you are a product体验官 (experience officer) needing to perform comprehensive operational tests and screenshot records for a new app, you can let ApkClaw automatically open the app, browse various pages, execute key operations, and save screenshots, drastically reducing repetitive manual work.
Scenario 5: Social Media Content Publishing
Social media operators often need to publish content across multiple platforms like Douyin and Xiaohongshu (RED). ApkClaw can simulate normal phone tapping behavior to complete the open, edit, and publish workflows across various apps.
Scenario 6: Flight and Hotel Planning
When traveling for business or leisure, you often need to search across multiple apps for the best routes, suitable hotels, estimated taxi times, and set alarm reminders. These repetitive, cross-app operations can be handed over to ApkClaw all at once—it can check itineraries in Feishu, book hotels in a travel app, and set reminders in the calendar.
Scenario 7: Remotely Helping Parents Use Their Phones
Many parents struggle with using smartphones to handle personal affairs like social security or pension services. Children can send commands via chat tools to remotely control their parents’ phones to complete these operations, eliminating the need to describe “tap the button on the left” over a phone call.
Author’s Reflection: These seven scenarios span a wide range, but they share a common trait: they are all operations that “can only be done on a phone” or “are more naturally done on a phone.” Traditional PC automation tools (like legacy RPA software) are either completely powerless in these scenarios (because there is no app) or have to take a very roundabout path (through emulators). The fact that ApkClaw runs directly on the phone is its biggest competitive moat. A phone-based agent and a PC-based agent are not substitutes; they are complements.
How to Install and Configure ApkClaw from Scratch
What is the core question this section answers? Starting from zero, what are the exact steps to get ApkClaw running and complete your first task?
Environment Requirements
-
Java 17 or higher -
Android Studio (Ladybug or newer recommended) -
Android SDK 36 (compile and target versions), minimum SDK 28 (Android 9)
Building from Source
# Clone the repository
git clone https://github.com/apkclaw-team/ApkClaw.git
cd ApkClaw
# Debug build
./gradlew assembleDebug
# Release build
./gradlew assembleRelease
If you prefer not to compile it yourself, you can directly download the pre-compiled APK and install it on your device.
Installation and Authorization
After installing the APK on a device running Android 9 or higher, open the app. On the home page, enable the following permissions in order:
-
Accessibility Service: The most critical permission; all device interactions depend on it. -
Notification Permission: Required to receive channel messages. -
Overlay Permission: Required to display floating UI components like the floating ball. -
Battery Whitelist: Prevents the system from killing the background service due to power-saving policies. -
File Access Permission: Required for saving screenshots and sending files.
Configuring the LLM
Navigate to Settings > LLM Config and fill in the following information:
-
API Key: Your OpenAI or Anthropic API Key. -
Base URL: The LLM endpoint address. Keep the default https://api.openai.com/v1if using the official OpenAI API; modify it to the corresponding endpoint if using a third-party compatible service. -
Model Name: Enter the model name you wish to use, such as gpt-4oorclaude-sonnet-4-20250514.
Configuring a Messaging Channel
Go to the Settings page, select at least one messaging channel, and fill in the corresponding bot credentials. For Telegram, for instance, you only need to fill in a Bot Token.
Sending Your First Command
Once configuration is complete, send a simple message through your configured channel to test it, such as “take a screenshot and send it to me” or “open settings.” If everything is configured correctly, you should receive the Agent’s execution result in your chat window.
How to Use the LAN Configuration Server for Easier Setup
What is the core question this section answers? Is there a more convenient way to configure settings without typing long API Keys and Tokens on a tiny phone screen?
ApkClaw includes a built-in HTTP server based on NanoHTTPD running on port 9527. To enable it, simply turn on the “LAN Config” switch in the settings, then open a PC browser on the same network and navigate to http://<device-ip>:9527.
This server provides the following endpoints:
| Endpoint | Method | Purpose |
|---|---|---|
/ |
GET | Configuration page (a Web UI) |
/api/channels |
GET | Read channel credentials (sensitive info masked, showing only last 4 characters) |
/api/channels |
POST | Update channel credentials |
/api/llm |
GET | Read LLM configuration (sensitive info masked) |
/api/llm |
POST | Update LLM configuration |
When returning data via GET requests, all sensitive information (like API Keys and Tokens) is masked to display only the last 4 characters, preventing accidental exposure in the browser.
Additionally, Debug build versions provide an extra /debug.html tool debugging console, allowing you to directly test the execution effects of various tools in the browser, which is highly useful during development and debugging phases.
Author’s Reflection: The LAN configuration server feature is a perfect example of user-centric design thinking. From a technical perspective, adding a few input fields in the phone app is the simplest implementation. But from a user perspective, copying and pasting a几十-character API Key on a phone is a terrible experience. A lightweight HTTP server (NanoHTTPD has a very small footprint) trades for a massive improvement in configuration efficiency. Furthermore, the masking design shows that the developers have basic security awareness—it is not just “if it runs, it’s fine,” but “it needs to be secure when it runs.”
Project Structure and Key Technical Dependencies
What is the core question this section answers? How is the ApkClaw codebase organized, and what key technology stacks power it under the hood?
Project Directory Structure
app/src/main/java/com/apk/claw/android/
├── agent/ # Agent loop, configuration, callbacks
│ ├── langchain/ # LangChain4j bridge layer & OkHttp adapter
│ └── llm/ # LLM clients (OpenAI, Anthropic)
├── base/ # BaseActivity (screen density adaptation)
├── channel/ # Messaging channel handlers
│ ├── dingtalk/
│ ├── feishu/
│ ├── qqbot/
│ ├── discord/
│ └── telegram/
├── floating/ # Floating ball UI management
├── server/ # LAN config & debug HTTP server
├── service/ # Accessibility service, foreground service, keep-alive service
├── tool/ # Tool abstraction layer & registry
│ └── impl/ # Tool implementations (general/phone/TV)
├── ui/ # Activities (splash, home, onboarding, settings)
├── utils/ # KVUtils, XLog, formatting utilities
└── widget/ # Custom UI components
The structure is clean with clear module boundaries. agent/ handles the AI brain, channel/ handles message entry points, tool/ handles the limbs, service/ handles system-level capabilities, and server/ handles auxiliary configuration.
Key Dependencies
AI / Agent Layer
| Dependency | Version | Purpose |
|---|---|---|
| LangChain4j | 1.12.2 | Agent orchestration, tool definition, LLM integration |
Messaging Channel Layer
| Dependency | Version | Purpose |
|---|---|---|
| DingTalk Stream Client | 1.3.12 | DingTalk channel integration |
| Feishu OAPI SDK | 2.5.3 | Feishu channel integration |
Network Layer
| Dependency | Version | Purpose |
|---|---|---|
| OkHttp | 4.12.0 | HTTP client for LLM calls |
| Retrofit | 2.11.0 | REST API client |
| NanoHTTPD | 2.3.1 | LAN configuration and debug HTTP server |
Storage & Utility Layer
| Dependency | Version | Purpose |
|---|---|---|
| MMKV | 2.3.0 | High-performance local key-value storage |
| Gson | 2.13.2 | JSON serialization and deserialization |
| ZXing | 3.5.3 | QR code generation |
| UtilCode | 1.31.1 | General Android utility function library |
UI Layer
| Dependency | Version | Purpose |
|---|---|---|
| Glide | 5.0.5 | Image loading |
| EasyFloat | 2.0.4 | Floating window management |
| MultiType | 4.3.0 | RecyclerView multi-type adapter |
The overall technology stack choices are highly pragmatic: LangChain4j for Agent orchestration saves enormous amounts of wheel-reinventing work; OkHttp replaces JDK HttpClient to solve Android compatibility issues; MMKV replaces SharedPreferences to improve storage performance; NanoHTTPD implements a LAN configuration server at a minimal cost.
Final Reflections and Engineering Insights
What is the core question this section answers? What is the essential value of a “mobile AI Agent” product like ApkClaw, and what are its inherent limitations and future directions?
After dissecting all the technical details above, I want to take a step back and discuss the overall understanding of this category of products.
The essence of ApkClaw is not “a script that can control a phone,” but rather extending the reasoning capabilities of large language models to physical devices. An LLM itself can only output text, but through ApkClaw’s tool system, the LLM’s text output is translated into screen taps and swipes. This “translation layer” seems simple—it is just a tool call mapping—but it bridges the boundary between the digital world and the physical world.
From an engineering perspective, ApkClaw makes the correct choices on several key points:
-
The single-task model avoids the chaos of conflicting device states during multi-task concurrency. -
The Home key reset elevates state management from the Agent layer to the orchestration layer. -
Dead loop detection and token optimization make it viable for the system to run for extended periods on a real device. -
The graceful degradation for system popups demonstrates honesty about its own capability boundaries. -
Multi-channel integration lowers the barrier to entry for users.
Of course, the limitations are also clear: system-protected windows cannot be handled automatically; the single-task model limits concurrency; the 40-iteration upper limit might not be enough for extremely complex tasks; and it has a high dependency on the LLM’s reasoning capabilities (if the model is not smart enough, tool calls will fail).
However, judged against its positioning of “turning old phones into smart assistants,” ApkClaw has found a precise entry point. It does not require the latest flagship phone, does not require root access, and does not require complex environment configuration. An Android 9 spare phone plus an API Key is all it takes to run. This low barrier to entry is its most core differential advantage compared to PC-based automation solutions.
Actionable Summary / Setup Checklist
To get ApkClaw up and running from scratch, follow this exact sequence:
-
Prepare an Android phone running version 9 or higher -
Install the ApkClaw APK (compile from source or download a pre-built version) -
On the home page, enable in order: Accessibility Service, Notification Permission, Overlay Permission, Battery Whitelist, File Access Permission -
Go to Settings > LLM Config and fill in your API Key, Base URL, and Model Name -
Go to Settings, select at least one messaging channel, and fill in the corresponding credentials -
Send your first test message through the configured channel -
(Optional) Enable LAN Config in settings, then access http://<device-ip>:9527via a PC browser for easier configuration
One-Page Summary
| Dimension | Details |
|---|---|
| Product Positioning | AI-driven Android automation Agent, remotely controlling phones via natural language |
| Minimum System Requirement | Android 9 (SDK 28) |
| Supported LLMs | OpenAI compatible APIs, Anthropic |
| Agent Framework | LangChain4j 1.12.2 |
| Supported Channels | DingTalk, Feishu, QQ, Discord, Telegram |
| Tool Count | 15 general tools + 6 phone-specific tools |
| Max Iterations | 40 rounds |
| LLM Retry Strategy | 3 attempts with exponential backoff (1s→2s→4s); no retry on 401/403 |
| Dead Loop Detection | 4-turn sliding window fingerprint matching |
| LAN Configuration | Port 9527, Web UI + REST API |
| Known Limitations | Cannot automatically handle system-protected windows; single-task concurrency |
| Open Source License | Apache License 2.0 |
Frequently Asked Questions
Does ApkClaw support iOS?
No, it currently only supports devices running Android 9 and above.
Do I have to use the official OpenAI API?
Not necessarily. The OpenAI client is compatible with any service that uses the OpenAI interface format, so you can use third-party compatible providers by changing the Base URL.
Can it execute multiple tasks at the same time?
No. ApkClaw uses a single-task model and can only execute one task at a time. New tasks must wait for the current one to finish.
What happens when a system permission popup appears?
The Agent detects the protected window, automatically takes a screenshot, sends it to you, and terminates the current task. You can manually handle the popup and then send the next instruction.
Is the LAN configuration server secure?
Sensitive information (API Keys, Tokens) returned by GET requests is masked to show only the last 4 characters. It is recommended to use this feature only within a trusted local network.
Can I use it directly on the phone without configuring a messaging channel?
ApkClaw is designed to receive and execute tasks through messaging channels. Without configuring a channel, it cannot receive instructions.
What is the maximum number of loops it can perform?
The default maximum iteration count is 40 rounds. If the LLM does not call the finish tool within 40 rounds, the task is forcibly terminated.
Why is the temperature defaulted to 0.1?
Device control requires highly deterministic output. A low temperature ensures that the same screen state and instruction produce consistent tool calls, avoiding randomization in the operation path.
