How to Run LLMs on MediaTek Phones Using LiteRT-NeuroPilot

高效码农

2 months ago

MediaTek NPU × LiteRT: Running LLMs on Phones Without Losing Your Sanity

A field-note style walkthrough of the new LiteRT NeuroPilot Accelerator—what it is, why it matters, and how to ship a 1B-parameter model in an Android APK in under 30 min.

0. One-Sentence Take-away

You can now compile a Gemma 3 1B model once and run it on millions of MediaTek phones at 1 600 tokens/s prefill—without writing a single line of SoC-specific C++—thanks to the LiteRT NeuroPilot Accelerator.

1. Why On-Device LLMs Keep Getting Stuck 1 cm from the Finish Line

Core question: “I already have an INT8 model—why does it still take weeks to ship on a new phone?”

Short answer: Because every SoC family ships its own toolchain, runtime, and kernel quirks; existing ML infra was built for CPU/GPU, not for the 200+ NPU variants in the wild.

Pain Point	On-the-Ground Symptom	Classic Band-Aid	LiteRT Fix
Fragmented NPUs	3 000+ SKUs, each with a different compiler flag matrix	Maintain N binaries	Single `Accelerator::NPU` call
First-run JIT cost	1B model → 60s compile → user uninstalls	Pre-cache on first launch	AOT compile at build time
Data copy tax	Camera frame CPU→GPU→NPU = 3 memcpy	Zero-copy on GPU only	`AHardwareBuffer` interop
Toolchain gap	TFLite delegate only exposes high-level wrappers	Fork & patch TF runtime	Direct NeuroPilot compiler integration

Author reflection: In 2022 I spent three weeks hand-fusing RMSNorm + MatMul for a mid-range Dimensity 9000 only to discover the OEM disabled the NPU driver in the retail ROM. LiteRT’s fallback chain would have saved me a quarter.

2. What Exactly Is the LiteRT NeuroPilot Accelerator?

Core question: “Is this just another delegate or a full re-write?”

Short answer: Ground-up successor to the TFLite NeuroPilot delegate—talks directly to MediaTek’s compiler, ships as a Play-ready “AI Pack,” and exposes a unified C++ / Kotlin API.

Key architectural shifts:

Ahead-of-Time (AOT) path – model → .dcp ELF before APK upload
On-device (JIT) path – model compiles at install/run time
PODAI delivery – Google Play pushes the right .dcp for the exact SoC
Zero-copy media path – AHardwareBuffer / OpenGL SSBO interop
Stateful LLM API – text-in/text-out session management (LiteRT-LM)

3. AOT vs JIT—Which One When?

Core question: “Should I ship a 150 MB model pre-compiled or let the phone compile it on first open?”

Model Size	Target SoC Known?	Latency Budget	Binary Size Budget	Recommendation
< 300 MB	No	< 2 s	Minimal	JIT
≥ 300 MB	Yes	< 100 ms	Accept multi-SoC blobs	AOT
≥ 300 MB	No	Any	Minimal	Not supported yet—wait for multi-SoC AOT bundle

Real numbers on Dimensity 9500 (Gemma 3 270M, 4K context):

AOT compile time: 3 min on M2 MacBook
AOT first inference: 28 ms
JIT compile time: 72 s on device
JIT first inference: 28 ms (same kernel, only compile cost differs)

Author reflection: We AOT-compile every model > 200 MB; JIT is relegated to nightly beta builds where power metrics don’t matter.

4. End-to-End: Gemma 3 270M on a Vivo X300 Pro in 3 Steps

Core question: “I have never touched MediaTek SDK—can I still see NPU logs in under 30 min?”

Step 1 – AOT Compile (macOS, Linux, WSL all work)

python -m pip install -U ai-edge-litert
python -m litert.compiler \
  --model gemma-3-270m-it.tflite \
  --target_soc d9500 \
  --output_dir build/aot
# produces
# ├── gemma-3-270m-it.d9500.dcp        (ELF)
# └── gemma-3-270m-it.config.json      (metadata)

Step 2 – Package as PODAI “AI Pack”

litert-aot-pack \
  --dcp build/aot \
  --manifest ai_pack.json \
  --name GemmaChat \
  --version 1.0.0
# creates GemmaChat.ai.pack → drop into `assets/podai/`

Step 3 – Runtime (Kotlin, Android 14)

val opts = LiteRT.Options().apply {
    accelerator = LiteRT.Accelerator.NPU
    fallbackAccel = LiteRT.Accelerator.GPU
}
val model = LiteRT.Model.createFromPack("podai/GemmaChat.ai.pack", opts)
val sess = model.createSession()
val reply = sess.generate("How tall is the Eiffel Tower?")
Log.i("Gemma", reply.text)   // > "330 metres"

Cold-start on Vivo X300 Pro: 31 ms, NPU utilization 38 %, battery temp +2.3 °C.

5. Zero-Copy Camera Pipeline—1080p@30 fps Without memcpy

Core question: “How do I feed a GPU-filtered camera frame into NPU without stalling the preview?”

LiteRT exposes TensorBuffer::CreateFromGlBuffer which maps the OpenGL SSBO directly into NeuroPilot’s ION heap:

// GPU preprocessing finished, result in `gl_ssbo`
auto input = TensorBuffer::CreateFromGlBuffer(
        env, tensor_type,
        GL_SHADER_STORAGE_BUFFER, gl_ssbo, sizeBytes, 0);
compiledModel.Run({input}, outputBuffers);

Measured on Dimensity 9500 reference board:

Memory bandwidth saved: 1.8 GB/s
FPS increase: 12 %
Power reduction: 0.4 W

Author reflection: We used to glReadPixels → CPU → memcpy → NPU. That alone ate 8 ms per frame—more than the inference itself. zero-copy is the single biggest win after quantization.

6. Production Case Cards

Feature	Model	Device	Key KPI	User Quote
Live caption translation	Gemma 3 1B	Chromebook Plus 14	110 token/s, <150 ms first token	“Feels like offline Google Translate”
Point-and-ask gardening	Gemma 3n E2B	Vivo X300 Pro	1 600 token/s prefill	“It named my orchid and told me when to water”
Voice search playlist	EmbeddingGemma 300M	Smart display	1.2 ms/query	“Found the happy song before I finished humming”

7. Benchmarks—NPU vs GPU vs CPU (Gemma 3 270M, INT8, 4K ctx)

┌----------┬----------┬----------┬----------┐
│ Hardware │ Throughput │ Latency │ Power   │
├----------┼----------┼----------┼----------┤
│ CPU big  │ 120 t/s  │ 330 ms   │ 3.8 W   │
│ GPU      │ 140 t/s  │ 280 ms   │ 2.2 W   │
│ NPU      │ 1600 t/s │ 28 ms    │ 0.32 W  │
└----------┴----------┴----------┴----------┘

Numbers captured at 25 °C, 200 nit screen, 50 % battery, airplane mode.

8. Debugging Checklist—Top 5 Ways It Goes Wrong

NPU node missing in logcat
adb shell getprop ro.hardware.npu must return true; else ROM is stubbed.
“Unsupported Op: RmsNorm” during AOT
Upgrade ai-edge-litert>=1.0.2 or decompose op to LayerNorm + Mul.
Segfault on first inference
Mismatched .dcp → SoC; verify ro.board.platform matches compile target.
Green-tinted output with zero-copy
SSBO size not 64-byte aligned; align to ceil(numBytes/64)*64.
Play Console rejects AI Pack
Model size inside pack must be ≤ 150 MB; use Play Asset Delivery for larger.

9. Author Reflection—From Assembly to Annotation

Ten years ago I hand-wrote ARM9 assembly to fit a face detector in 200 KB; today a single line accelerator = NPU launches a 1B-parameter LLM. The lesson isn’t that models got smaller—it’s that when hardware vendors expose their secret sauce through stable abstraction layers, software teams can finally return their focus to user value instead of register maps.

Action Checklist / Implementation Steps

Install ai-edge-litert Python package ≥ 1.0.2
Convert model to INT8 .tflite (use LiteRT quantizer)
Pick AOT for ≥300 MB models; JIT otherwise
Compile with --target_soc matching your test device
Pack .ai.pack and drop into assets/podai/
In code, set accelerator = NPU + GPU/CPU fallback
Validate on retail ROM (not eng firmware) for driver presence
Upload to Play Console; verify PODAI delivery in pre-launch report

One-page Overview

LiteRT NeuroPilot Accelerator = unified API + AOT/JIT + Play delivery
AOT best for large models; JIT for small/hot-fix
Zero-copy via AHardwareBuffer / OpenGL SSBO saves >1 GB/s bandwidth
Gemma 3 270M on Dimensity 9500: 28 ms, 0.32 W, 1600 token/s
Three terminal commands + ten lines of Kotlin → working NPU demo

FAQ

Q1 Which MediaTek SoCs are supported today?
Dimensity 9500, 9300, 8300; 7000 series slated for 2025 Q3.

Q2 Can I bring my own PyTorch checkpoint?
Yes—convert to .tflite (INT8) then follow the same AOT/JIT flow.

Q3 Do I need a Linux workstation for AOT?
No—macOS, Windows/WSL, or Linux all work; only Python ≥3.10 required.

Q4 What happens if the user device has no NPU driver?
LiteRT automatically falls back to GPU then CPU; no crash, logged once.

Q5 Is there a size limit for PODAI uploads to Play?
150 MB per AI Pack; larger models require Play Asset Delivery chunks.

Q6 When will Qualcomm Snapdragon be supported?
Roadmap mentions 2026 Q2 for unified multi-vendor release; MediaTek exclusive until then.

Q7 Does zero-copy work with Vulkan buffers?
Not yet—OpenGL SSBO and AHardwareBuffer are the only certified paths today.