Site icon Efficient Coder

How to Run LLMs on MediaTek Phones Using LiteRT-NeuroPilot

MediaTek NPU × LiteRT: Running LLMs on Phones Without Losing Your Sanity

A field-note style walkthrough of the new LiteRT NeuroPilot Accelerator—what it is, why it matters, and how to ship a 1B-parameter model in an Android APK in under 30 min.


0. One-Sentence Take-away

You can now compile a Gemma 3 1B model once and run it on millions of MediaTek phones at 1 600 tokens/s prefill—without writing a single line of SoC-specific C++—thanks to the LiteRT NeuroPilot Accelerator.


1. Why On-Device LLMs Keep Getting Stuck 1 cm from the Finish Line

Core question: “I already have an INT8 model—why does it still take weeks to ship on a new phone?”

Short answer: Because every SoC family ships its own toolchain, runtime, and kernel quirks; existing ML infra was built for CPU/GPU, not for the 200+ NPU variants in the wild.

Pain Point On-the-Ground Symptom Classic Band-Aid LiteRT Fix
Fragmented NPUs 3 000+ SKUs, each with a different compiler flag matrix Maintain N binaries Single Accelerator::NPU call
First-run JIT cost 1B model → 60s compile → user uninstalls Pre-cache on first launch AOT compile at build time
Data copy tax Camera frame CPU→GPU→NPU = 3 memcpy Zero-copy on GPU only AHardwareBuffer interop
Toolchain gap TFLite delegate only exposes high-level wrappers Fork & patch TF runtime Direct NeuroPilot compiler integration

Author reflection: In 2022 I spent three weeks hand-fusing RMSNorm + MatMul for a mid-range Dimensity 9000 only to discover the OEM disabled the NPU driver in the retail ROM. LiteRT’s fallback chain would have saved me a quarter.


2. What Exactly Is the LiteRT NeuroPilot Accelerator?

Core question: “Is this just another delegate or a full re-write?”

Short answer: Ground-up successor to the TFLite NeuroPilot delegate—talks directly to MediaTek’s compiler, ships as a Play-ready “AI Pack,” and exposes a unified C++ / Kotlin API.

Key architectural shifts:

  • Ahead-of-Time (AOT) path – model → .dcp ELF before APK upload
  • On-device (JIT) path – model compiles at install/run time
  • PODAI delivery – Google Play pushes the right .dcp for the exact SoC
  • Zero-copy media pathAHardwareBuffer / OpenGL SSBO interop
  • Stateful LLM APItext-in/text-out session management (LiteRT-LM)

3. AOT vs JIT—Which One When?

Core question: “Should I ship a 150 MB model pre-compiled or let the phone compile it on first open?”

Model Size Target SoC Known? Latency Budget Binary Size Budget Recommendation
< 300 MB No < 2 s Minimal JIT
≥ 300 MB Yes < 100 ms Accept multi-SoC blobs AOT
≥ 300 MB No Any Minimal Not supported yet—wait for multi-SoC AOT bundle

Real numbers on Dimensity 9500 (Gemma 3 270M, 4K context):

  • AOT compile time: 3 min on M2 MacBook
  • AOT first inference: 28 ms
  • JIT compile time: 72 s on device
  • JIT first inference: 28 ms (same kernel, only compile cost differs)

Author reflection: We AOT-compile every model > 200 MB; JIT is relegated to nightly beta builds where power metrics don’t matter.


4. End-to-End: Gemma 3 270M on a Vivo X300 Pro in 3 Steps

Core question: “I have never touched MediaTek SDK—can I still see NPU logs in under 30 min?”

Step 1 – AOT Compile (macOS, Linux, WSL all work)

python -m pip install -U ai-edge-litert
python -m litert.compiler \
  --model gemma-3-270m-it.tflite \
  --target_soc d9500 \
  --output_dir build/aot
# produces
# ├── gemma-3-270m-it.d9500.dcp        (ELF)
# └── gemma-3-270m-it.config.json      (metadata)

Step 2 – Package as PODAI “AI Pack”

litert-aot-pack \
  --dcp build/aot \
  --manifest ai_pack.json \
  --name GemmaChat \
  --version 1.0.0
# creates GemmaChat.ai.pack → drop into `assets/podai/`

Step 3 – Runtime (Kotlin, Android 14)

val opts = LiteRT.Options().apply {
    accelerator = LiteRT.Accelerator.NPU
    fallbackAccel = LiteRT.Accelerator.GPU
}
val model = LiteRT.Model.createFromPack("podai/GemmaChat.ai.pack", opts)
val sess = model.createSession()
val reply = sess.generate("How tall is the Eiffel Tower?")
Log.i("Gemma", reply.text)   // > "330 metres"

Cold-start on Vivo X300 Pro: 31 ms, NPU utilization 38 %, battery temp +2.3 °C.


5. Zero-Copy Camera Pipeline—1080p@30 fps Without memcpy

Core question: “How do I feed a GPU-filtered camera frame into NPU without stalling the preview?”

LiteRT exposes TensorBuffer::CreateFromGlBuffer which maps the OpenGL SSBO directly into NeuroPilot’s ION heap:

// GPU preprocessing finished, result in `gl_ssbo`
auto input = TensorBuffer::CreateFromGlBuffer(
        env, tensor_type,
        GL_SHADER_STORAGE_BUFFER, gl_ssbo, sizeBytes, 0);
compiledModel.Run({input}, outputBuffers);

Measured on Dimensity 9500 reference board:

  • Memory bandwidth saved: 1.8 GB/s
  • FPS increase: 12 %
  • Power reduction: 0.4 W

Author reflection: We used to glReadPixels → CPU → memcpy → NPU. That alone ate 8 ms per frame—more than the inference itself. zero-copy is the single biggest win after quantization.


6. Production Case Cards

Feature Model Device Key KPI User Quote
Live caption translation Gemma 3 1B Chromebook Plus 14 110 token/s, <150 ms first token “Feels like offline Google Translate”
Point-and-ask gardening Gemma 3n E2B Vivo X300 Pro 1 600 token/s prefill “It named my orchid and told me when to water”
Voice search playlist EmbeddingGemma 300M Smart display 1.2 ms/query “Found the happy song before I finished humming”

7. Benchmarks—NPU vs GPU vs CPU (Gemma 3 270M, INT8, 4K ctx)

┌----------┬----------┬----------┬----------┐
│ Hardware │ Throughput │ Latency │ Power   │
├----------┼----------┼----------┼----------┤
│ CPU big  │ 120 t/s  │ 330 ms   │ 3.8 W   │
│ GPU      │ 140 t/s  │ 280 ms   │ 2.2 W   │
│ NPU      │ 1600 t/s │ 28 ms    │ 0.32 W  │
└----------┴----------┴----------┴----------┘

Numbers captured at 25 °C, 200 nit screen, 50 % battery, airplane mode.


8. Debugging Checklist—Top 5 Ways It Goes Wrong

  1. NPU node missing in logcat
    adb shell getprop ro.hardware.npu must return true; else ROM is stubbed.
  2. “Unsupported Op: RmsNorm” during AOT
    Upgrade ai-edge-litert>=1.0.2 or decompose op to LayerNorm + Mul.
  3. Segfault on first inference
    Mismatched .dcp → SoC; verify ro.board.platform matches compile target.
  4. Green-tinted output with zero-copy
    SSBO size not 64-byte aligned; align to ceil(numBytes/64)*64.
  5. Play Console rejects AI Pack
    Model size inside pack must be ≤ 150 MB; use Play Asset Delivery for larger.

9. Author Reflection—From Assembly to Annotation

Ten years ago I hand-wrote ARM9 assembly to fit a face detector in 200 KB; today a single line accelerator = NPU launches a 1B-parameter LLM. The lesson isn’t that models got smaller—it’s that when hardware vendors expose their secret sauce through stable abstraction layers, software teams can finally return their focus to user value instead of register maps.


Action Checklist / Implementation Steps

  1. Install ai-edge-litert Python package ≥ 1.0.2
  2. Convert model to INT8 .tflite (use LiteRT quantizer)
  3. Pick AOT for ≥300 MB models; JIT otherwise
  4. Compile with --target_soc matching your test device
  5. Pack .ai.pack and drop into assets/podai/
  6. In code, set accelerator = NPU + GPU/CPU fallback
  7. Validate on retail ROM (not eng firmware) for driver presence
  8. Upload to Play Console; verify PODAI delivery in pre-launch report

One-page Overview

  • LiteRT NeuroPilot Accelerator = unified API + AOT/JIT + Play delivery
  • AOT best for large models; JIT for small/hot-fix
  • Zero-copy via AHardwareBuffer / OpenGL SSBO saves >1 GB/s bandwidth
  • Gemma 3 270M on Dimensity 9500: 28 ms, 0.32 W, 1600 token/s
  • Three terminal commands + ten lines of Kotlin → working NPU demo

FAQ

Q1 Which MediaTek SoCs are supported today?
Dimensity 9500, 9300, 8300; 7000 series slated for 2025 Q3.

Q2 Can I bring my own PyTorch checkpoint?
Yes—convert to .tflite (INT8) then follow the same AOT/JIT flow.

Q3 Do I need a Linux workstation for AOT?
No—macOS, Windows/WSL, or Linux all work; only Python ≥3.10 required.

Q4 What happens if the user device has no NPU driver?
LiteRT automatically falls back to GPU then CPU; no crash, logged once.

Q5 Is there a size limit for PODAI uploads to Play?
150 MB per AI Pack; larger models require Play Asset Delivery chunks.

Q6 When will Qualcomm Snapdragon be supported?
Roadmap mentions 2026 Q2 for unified multi-vendor release; MediaTek exclusive until then.

Q7 Does zero-copy work with Vulkan buffers?
Not yet—OpenGL SSBO and AHardwareBuffer are the only certified paths today.

Exit mobile version