ClickClickClick in Depth: How to Let Any LLM Drive Your Android Phone or Mac Without Writing UI Scripts

What’s the shortest path from a spoken sentence to a working UI automation?
Install ClickClickClick, pick an LLM, type one line—done in under three minutes.


What This Article Answers

  1. What exactly is ClickClickClick and how does it turn words into clicks?
  2. Which real-world tasks (with exact commands) can I copy-paste today?
  3. How do I install, configure, and run my first task on both Android and macOS?
  4. How do I mix and match LLMs so the job finishes fast, accurately, and cheaply?
  5. Where did the author fail the first time, and what one-line fix saved the day?

1. Product Snapshot: One Sentence to Remember

ClickClickClick is a pip-installable framework that feeds a screenshot to any LLM (OpenAI, Claude, Gemini, or local Ollama), asks it where to tap or click, and then uses ADB (Android) or macOS accessibility APIs to actually do it—no XML digging, no element IDs, no code.


2. Real Tasks You Can Steal Immediately

Below are five copy-ready commands lifted straight from the framework’s own examples. Each command is followed by the exact screen actions the LLM will perform so you know what to expect.

Task CLI Command (add --platform=osx if you are on Mac) On-Screen Action Sequence
1. Draft an email click3 run "create a draft email to boss@corp.com subject 'Sick Leave' body 'Need one day off'" Open Gmail → tap Compose → type recipient → type subject → type body → leave in Drafts
2. Check weather click3 run "open weather app and tell me if I need an umbrella today" Locate weather icon → tap → read precipitation field → reply “Yes/No”
3. Start a chess game click3 run "start a 3+2 game on lichess" Open browser → type lichess.org → tap Play → choose 3+2 → tap Start
4. Find bus stops click3 run "open Google Maps and find bus stops in Alanson, MI" Open Maps → tap search → enter text → tap transit filter → screenshot results
5. Quick tax calc click3 run "open calculator and compute 15000*0.8-5000" Open calculator → press digits and operators → read final number

Author’s reflection
I originally doubted whether an LLM could reliably find the right icon on a crowded home screen. The surprise: because the Finder model gets the full screenshot plus the natural-language label, it locates the weather icon even when three different weather apps look almost identical. Turning the whole screen into context beats brittle XPath every time.


3. Three-Minute Quick-Start Guide

Core Question Answered

“I just want to see it work—what is the absolute minimum to type?”

Answer: Install → plug in cable → export key → type one click3 run line.
Below are the expanded steps with sanity checks.

3.1 Prerequisites Checklist

Platform Item One-Line Validation
Android USB-debugging ON, ADB installed adb devices should print one device ID
macOS Python ≥3.11, accessibility permission python3 -c "import platform; print(platform.version())"
Either Pick one API key echo $OPENAI_API_KEY or GEMINI_API_KEY

3.2 Installation

# stable build
pip install git+https://github.com/instavm/clickclickclick.git

# or editable for hacking
git clone https://github.com/instavm/clickclickclick
cd clickclickclick && pip install -e .

# confirm
click3 --help          # should show sub-commands: run, gradio, etc.

3.3 Export One API Key

# pick ONE; Gemini gives 15 free hits/day
export GEMINI_API_KEY="AIz********"
# OR
export OPENAI_API_KEY="sk-********"

3.4 Run the Canonical 30-Second Test

click3 run "open calculator and multiply 25 by 47" --platform=android \
       --planner-model=gemini --finder-model=gemini

You will see live logs: screenshot captured → icons detected → tap coordinates sent → final answer “1175” returned. No code, no element IDs, no root.


4. Four Ways to Drive It: CLI, Python, REST, Gradio

Core Question Answered

“I’m a tester / data scientist / front-end guy / product manager—what interface should I pick?”

Summary:

  • CLI is fastest for ad-hoc jobs.
  • Python API embeds into notebooks and cron.
  • REST API lets any stack call it.
  • Gradio gives non-tech users a clickable web page.

4.1 CLI Deep Dive

Basic shape

click3 run "your sentence" [--platform android|osx] [--planner-model MODEL] [--finder-model MODEL]

Real-world loop

# batch test top-3 Android launchers
for app in "Chrome" "Gmail" "Maps"; do
  click3 run "open $app and go back to home screen" --platform=android \
         --planner-model=gpt-4o-mini --finder-model=gemini
done

4.2 Python API Example

from clickclickclick.config import get_config
from clickclickclick.planner.task import execute_task
from clickclickclick.utils import get_executor, get_planner, get_finder
import schedule, time

config = get_config("android", "gemini", "gemini")
executor = get_executor("android")
planner = get_planner("gemini", config, executor)
finder = get_finder("gemini", config, executor)

def morning_report():
    execute_task("open weather app and screenshot today's forecast",
                 executor, planner, finder, config)

schedule.every().day.at("07:00").do(morning_report)

while True:
    schedule.run_pending()
    time.sleep(1)

Author’s reflection
I usually avoid while True loops, but here the process sleeps 99.9 % of the time and CPU stays at zero. The bigger lesson: once the heavy lifting (screenshot + LLM) is abstracted, the rest is plain Python—no new syntax to memorize.

4.3 REST API in Two Commands

Start server:

uvicorn clickclickclick.api:app --host 0.0.0.0 --port 8000

Call it:

curl -X POST http://localhost:8000/execute \
  -H "Content-Type: application/json" \
  -d '{"task_prompt":"open settings","platform":"android",
       "planner_model":"gemini","finder_model":"gemini"}'

Return body: {"result": true}

4.4 Gradio Web UI

Launch:

click3 gradio

Browser opens; drop-downs let you pick models, the text-box accepts plain English, and a live screenshot pane shows what the phone is seeing—perfect for demos to management.


5. Choosing Models: Speed, Cost, Privacy Triangle

Core Question Answered

“Which planner + finder combo gives me the best trade-off for everyday tasks?”

Official benchmark plus my own bill at 2025-12 list prices (1 USD ≈ 7.2 CNY):

Setup Planner Finder Median Time Cost per 100 Calls Notes
Best Overall GPT-4o Gemini-1.5-Flash 8 s ~3.6 USD Highest success rate on multi-step flows
Budget King GPT-4o-mini Gemini-1.5-Flash 6 s ~0.9 USD My daily driver; fails only if screen glare hides icon
Offline Ollama llama3.2-vision Ollama 25 s 0 USD Good for NDAs; needs 8 GB VRAM
Speed Demon Gemini-Flash Gemini-Flash 5 s Free 15 / day Starred in all my live demos

Author’s reflection
I used to think “bigger model always wins.” Reality: for single-screen tasks (toggle Wi-Fi, screenshot a QR code) GPT-4o-mini equals GPT-4o in accuracy but costs 4× less. The real edge of GPT-4o appears only when the Planner must reason across five or more screens—then the extra few cents pay for themselves by avoiding one failed re-run.


6. Configuration YAML Line-by-Line

Core Question Answered

“What can I tweak if the click lands three pixels off or the scroll overshoots?”

Key snippets from config/models.yaml with commentary:

openai:
  api_key: !ENV OPENAI_API_KEY        # never hard-code secrets
  model_name: gpt-4o-mini
  image_width: 512                    # resize before upload; 384→50 % faster, slightly less accurate
  image_height: 512

gemini:
  api_key: !ENV GEMINI_API_KEY
  model_name: gemini-1.5-flash
  image_width: 768                    # Gemini free tier caps at 768 anyway

executor:
  android:
    screen_center_x: 500              # default tap centre; tweak if device has notch
    screen_center_y: 1000
    scroll_distance: 1000             # pixels per swipe; smaller = more steps, better precision
    swipe_distance: 600
    long_press_duration: 1000         # ms; raise if app mis-reads short taps as swipe

Tuning recipe:

  • Icons too small? Raise image_width to 768 so the Finder keeps more detail.
  • List scrolls past the button? Lower scroll_distance to 400 and retry.
  • Long-press triggers edit mode instead of popup? Increase long_press_duration to 1500.

7. Troubleshooting Checklist

Core Question Answered

“I typed the command and nothing happened—what now?”

Symptom Likely Cause Instant Fix
adb devices empty Cable in “charge only” Swipe notification → Tap “File transfer”
macOS click does nothing Terminal not in Security → Accessibility  → Settings → Privacy → Accessibility → add Terminal
Gemini 429 quota exceeded 15 free hits used Switch to secondary key or pay-as-you-go
Ollama model not found Forgot pull ollama pull llama3.2-vision
Click lands in Narnia Screen density non-standard Edit screen_center_x/y in YAML

Author’s reflection
My first run failed because I ignored the tiny “allow USB debugging” RSA fingerprint dialog on the phone. Moral: when a step feels too obvious, that’s exactly where the devil sits.


8. Roadmap: What the Maintainers Have Publicly Committed

  • iOS support via WebDriverAgent (signing headache, no ETA)
  • Windows executor using Win32API (high demand, 2026 Q1 target)
  • Voice-command integration (whisper + ClickClickClick)
  • Multi-device orchestration (phone + tablet in same script)
  • Plugin system for custom actions (open door for RPA vendors)

9. Practical Action Checklist

  1. Install ADB → enable USB-debugging → adb devices shows ID
  2. pip install git+https://github.com/instavm/clickclickclick.git
  3. export GEMINI_API_KEY="***" (or OpenAI key)
  4. Run click3 run "open calculator and multiply 25 by 47" → success
  5. Adjust config/models.yaml scroll/center values if clicks are off
  6. Migrate working command into cron or Python schedule
  7. Monitor quota; switch to GPT-4o-mini when free Gemini hits limit

One-page Overview

ClickClickClick turns plain English into UI automation by asking an LLM to look at a screenshot and return coordinates. It works on Android (ADB) and macOS (accessibility API), supports OpenAI, Claude, Gemini, and local Ollama, and is packaged as CLI, Python API, REST server, and Gradio web UI. Installation needs Python 3.11+, ADB for Android, and an API key. Cost ranges from free (Ollama) to ~3.6 USD per 100 calls (GPT-4o + Gemini Flash). Best accuracy on multi-step flows is GPT-4o planner + Gemini finder; cheapest reliable combo is GPT-4o-mini + Gemini Flash. Tune scroll_distance and image_width in YAML to fix missed buttons. Roadmap adds iOS, Windows, and a plugin system.


FAQ

  1. Does it work offline?
    Yes—run local Ollama models; download once with ollama pull llama3.2-vision.

  2. Can it handle Chinese app names?
    The LLM sees the same screenshot you see; Chinese labels work, but pinyin or English labels lower error rate slightly.

  3. Is root or jail-break required?
    No. Android uses normal ADB, macOS uses granted accessibility permission.

  4. How accurate is it?
    On 1–5-step tasks success rate >95%; accuracy drops as scroll depth increases.

  5. What if the click misses by a few pixels?
    Edit screen_center_x/y or reduce scroll_distance for finer control.

  6. Can I chain tasks?
    Today you call execute_task multiple times; future plugin system will expose formal chaining.

  7. Is there a Windows version?
    Not yet; roadmap lists Win32API support next year.