ClickClickClick in Depth: How to Let Any LLM Drive Your Android Phone or Mac Without Writing UI Scripts

“

What’s the shortest path from a spoken sentence to a working UI automation?
Install ClickClickClick, pick an LLM, type one line—done in under three minutes.

What This Article Answers

What exactly is ClickClickClick and how does it turn words into clicks?
Which real-world tasks (with exact commands) can I copy-paste today?
How do I install, configure, and run my first task on both Android and macOS?
How do I mix and match LLMs so the job finishes fast, accurately, and cheaply?
Where did the author fail the first time, and what one-line fix saved the day?

1. Product Snapshot: One Sentence to Remember

ClickClickClick is a pip-installable framework that feeds a screenshot to any LLM (OpenAI, Claude, Gemini, or local Ollama), asks it where to tap or click, and then uses ADB (Android) or macOS accessibility APIs to actually do it—no XML digging, no element IDs, no code.

2. Real Tasks You Can Steal Immediately

Below are five copy-ready commands lifted straight from the framework’s own examples. Each command is followed by the exact screen actions the LLM will perform so you know what to expect.

Task	CLI Command (add `--platform=osx` if you are on Mac)	On-Screen Action Sequence
1. Draft an email	`click3 run "create a draft email to boss@corp.com subject 'Sick Leave' body 'Need one day off'"`	Open Gmail → tap Compose → type recipient → type subject → type body → leave in Drafts
2. Check weather	`click3 run "open weather app and tell me if I need an umbrella today"`	Locate weather icon → tap → read precipitation field → reply “Yes/No”
3. Start a chess game	`click3 run "start a 3+2 game on lichess"`	Open browser → type lichess.org → tap Play → choose 3+2 → tap Start
4. Find bus stops	`click3 run "open Google Maps and find bus stops in Alanson, MI"`	Open Maps → tap search → enter text → tap transit filter → screenshot results
5. Quick tax calc	`click3 run "open calculator and compute 15000*0.8-5000"`	Open calculator → press digits and operators → read final number

Author’s reflection
I originally doubted whether an LLM could reliably find the right icon on a crowded home screen. The surprise: because the Finder model gets the full screenshot plus the natural-language label, it locates the weather icon even when three different weather apps look almost identical. Turning the whole screen into context beats brittle XPath every time.

3. Three-Minute Quick-Start Guide

Core Question Answered

“

“I just want to see it work—what is the absolute minimum to type?”

Answer: Install → plug in cable → export key → type one click3 run line.
Below are the expanded steps with sanity checks.

3.1 Prerequisites Checklist

Platform	Item	One-Line Validation
Android	USB-debugging ON, ADB installed	`adb devices` should print one device ID
macOS	Python ≥3.11, accessibility permission	`python3 -c "import platform; print(platform.version())"`
Either	Pick one API key	`echo $OPENAI_API_KEY` or `GEMINI_API_KEY`

3.2 Installation

# stable build
pip install git+https://github.com/instavm/clickclickclick.git

# or editable for hacking
git clone https://github.com/instavm/clickclickclick
cd clickclickclick && pip install -e .

# confirm
click3 --help          # should show sub-commands: run, gradio, etc.

3.3 Export One API Key

# pick ONE; Gemini gives 15 free hits/day
export GEMINI_API_KEY="AIz********"
# OR
export OPENAI_API_KEY="sk-********"

3.4 Run the Canonical 30-Second Test

click3 run "open calculator and multiply 25 by 47" --platform=android \
       --planner-model=gemini --finder-model=gemini

You will see live logs: screenshot captured → icons detected → tap coordinates sent → final answer “1175” returned. No code, no element IDs, no root.

4. Four Ways to Drive It: CLI, Python, REST, Gradio

Core Question Answered

“

“I’m a tester / data scientist / front-end guy / product manager—what interface should I pick?”

Summary:

CLI is fastest for ad-hoc jobs.
Python API embeds into notebooks and cron.
REST API lets any stack call it.
Gradio gives non-tech users a clickable web page.

4.1 CLI Deep Dive

Basic shape

click3 run "your sentence" [--platform android|osx] [--planner-model MODEL] [--finder-model MODEL]

Real-world loop

# batch test top-3 Android launchers
for app in "Chrome" "Gmail" "Maps"; do
  click3 run "open $app and go back to home screen" --platform=android \
         --planner-model=gpt-4o-mini --finder-model=gemini
done

4.2 Python API Example

from clickclickclick.config import get_config
from clickclickclick.planner.task import execute_task
from clickclickclick.utils import get_executor, get_planner, get_finder
import schedule, time

config = get_config("android", "gemini", "gemini")
executor = get_executor("android")
planner = get_planner("gemini", config, executor)
finder = get_finder("gemini", config, executor)

def morning_report():
    execute_task("open weather app and screenshot today's forecast",
                 executor, planner, finder, config)

schedule.every().day.at("07:00").do(morning_report)

while True:
    schedule.run_pending()
    time.sleep(1)

Author’s reflection
I usually avoid while True loops, but here the process sleeps 99.9 % of the time and CPU stays at zero. The bigger lesson: once the heavy lifting (screenshot + LLM) is abstracted, the rest is plain Python—no new syntax to memorize.

4.3 REST API in Two Commands

Start server:

uvicorn clickclickclick.api:app --host 0.0.0.0 --port 8000

Call it:

curl -X POST http://localhost:8000/execute \
  -H "Content-Type: application/json" \
  -d '{"task_prompt":"open settings","platform":"android",
       "planner_model":"gemini","finder_model":"gemini"}'

Return body: {"result": true}

4.4 Gradio Web UI

Launch:

click3 gradio

Browser opens; drop-downs let you pick models, the text-box accepts plain English, and a live screenshot pane shows what the phone is seeing—perfect for demos to management.

5. Choosing Models: Speed, Cost, Privacy Triangle

Core Question Answered

“

“Which planner + finder combo gives me the best trade-off for everyday tasks?”

Official benchmark plus my own bill at 2025-12 list prices (1 USD ≈ 7.2 CNY):

Setup	Planner	Finder	Median Time	Cost per 100 Calls	Notes
Best Overall	GPT-4o	Gemini-1.5-Flash	8 s	~3.6 USD	Highest success rate on multi-step flows
Budget King	GPT-4o-mini	Gemini-1.5-Flash	6 s	~0.9 USD	My daily driver; fails only if screen glare hides icon
Offline	Ollama llama3.2-vision	Ollama	25 s	0 USD	Good for NDAs; needs 8 GB VRAM
Speed Demon	Gemini-Flash	Gemini-Flash	5 s	Free 15 / day	Starred in all my live demos

Author’s reflection
I used to think “bigger model always wins.” Reality: for single-screen tasks (toggle Wi-Fi, screenshot a QR code) GPT-4o-mini equals GPT-4o in accuracy but costs 4× less. The real edge of GPT-4o appears only when the Planner must reason across five or more screens—then the extra few cents pay for themselves by avoiding one failed re-run.

6. Configuration YAML Line-by-Line

Core Question Answered

“

“What can I tweak if the click lands three pixels off or the scroll overshoots?”

Key snippets from config/models.yaml with commentary:

openai:
  api_key: !ENV OPENAI_API_KEY        # never hard-code secrets
  model_name: gpt-4o-mini
  image_width: 512                    # resize before upload; 384→50 % faster, slightly less accurate
  image_height: 512

gemini:
  api_key: !ENV GEMINI_API_KEY
  model_name: gemini-1.5-flash
  image_width: 768                    # Gemini free tier caps at 768 anyway

executor:
  android:
    screen_center_x: 500              # default tap centre; tweak if device has notch
    screen_center_y: 1000
    scroll_distance: 1000             # pixels per swipe; smaller = more steps, better precision
    swipe_distance: 600
    long_press_duration: 1000         # ms; raise if app mis-reads short taps as swipe

Tuning recipe:

Icons too small? Raise image_width to 768 so the Finder keeps more detail.
List scrolls past the button? Lower scroll_distance to 400 and retry.
Long-press triggers edit mode instead of popup? Increase long_press_duration to 1500.

7. Troubleshooting Checklist

Core Question Answered

“

“I typed the command and nothing happened—what now?”

Symptom	Likely Cause	Instant Fix
`adb devices` empty	Cable in “charge only”	Swipe notification → Tap “File transfer”
macOS click does nothing	Terminal not in Security → Accessibility	 → Settings → Privacy → Accessibility → add Terminal
Gemini 429 quota exceeded	15 free hits used	Switch to secondary key or pay-as-you-go
Ollama model not found	Forgot pull	`ollama pull llama3.2-vision`
Click lands in Narnia	Screen density non-standard	Edit `screen_center_x/y` in YAML

Author’s reflection
My first run failed because I ignored the tiny “allow USB debugging” RSA fingerprint dialog on the phone. Moral: when a step feels too obvious, that’s exactly where the devil sits.

8. Roadmap: What the Maintainers Have Publicly Committed

iOS support via WebDriverAgent (signing headache, no ETA)
Windows executor using Win32API (high demand, 2026 Q1 target)
Voice-command integration (whisper + ClickClickClick)
Multi-device orchestration (phone + tablet in same script)
Plugin system for custom actions (open door for RPA vendors)

9. Practical Action Checklist

Install ADB → enable USB-debugging → adb devices shows ID
pip install git+https://github.com/instavm/clickclickclick.git
export GEMINI_API_KEY="***" (or OpenAI key)
Run click3 run "open calculator and multiply 25 by 47" → success
Adjust config/models.yaml scroll/center values if clicks are off
Migrate working command into cron or Python schedule
Monitor quota; switch to GPT-4o-mini when free Gemini hits limit

One-page Overview

ClickClickClick turns plain English into UI automation by asking an LLM to look at a screenshot and return coordinates. It works on Android (ADB) and macOS (accessibility API), supports OpenAI, Claude, Gemini, and local Ollama, and is packaged as CLI, Python API, REST server, and Gradio web UI. Installation needs Python 3.11+, ADB for Android, and an API key. Cost ranges from free (Ollama) to ~3.6 USD per 100 calls (GPT-4o + Gemini Flash). Best accuracy on multi-step flows is GPT-4o planner + Gemini finder; cheapest reliable combo is GPT-4o-mini + Gemini Flash. Tune scroll_distance and image_width in YAML to fix missed buttons. Roadmap adds iOS, Windows, and a plugin system.

FAQ

Does it work offline?
Yes—run local Ollama models; download once with ollama pull llama3.2-vision.
Can it handle Chinese app names?
The LLM sees the same screenshot you see; Chinese labels work, but pinyin or English labels lower error rate slightly.
Is root or jail-break required?
No. Android uses normal ADB, macOS uses granted accessibility permission.
How accurate is it?
On 1–5-step tasks success rate >95%; accuracy drops as scroll depth increases.
What if the click misses by a few pixels?
Edit screen_center_x/y or reduce scroll_distance for finer control.
Can I chain tasks?
Today you call execute_task multiple times; future plugin system will expose formal chaining.
Is there a Windows version?
Not yet; roadmap lists Win32API support next year.

ClickClickClick: How Any LLM Can Control Your Android or Mac with Simple Commands

ClickClickClick in Depth: How to Let Any LLM Drive Your Android Phone or Mac Without Writing UI Scripts

What This Article Answers

1. Product Snapshot: One Sentence to Remember

2. Real Tasks You Can Steal Immediately

3. Three-Minute Quick-Start Guide

Core Question Answered

3.1 Prerequisites Checklist

3.2 Installation

3.3 Export One API Key

3.4 Run the Canonical 30-Second Test

4. Four Ways to Drive It: CLI, Python, REST, Gradio

Core Question Answered

4.1 CLI Deep Dive

4.2 Python API Example

4.3 REST API in Two Commands

4.4 Gradio Web UI

5. Choosing Models: Speed, Cost, Privacy Triangle

Core Question Answered

6. Configuration YAML Line-by-Line

Core Question Answered

7. Troubleshooting Checklist

Core Question Answered

8. Roadmap: What the Maintainers Have Publicly Committed

9. Practical Action Checklist

One-page Overview

FAQ

Related Posts