LiteRT NeuroPilot Unlocks Phone NPUs: The Secret to 1600+ Tokens/sec On-Device LLMs

高效码农

2 weeks ago

Google LiteRT NeuroPilot: Making Phone NPUs “First-Class Citizens” for On-Device LLMs

In the era of pursuing faster, more private AI experiences, running Large Language Models (LLMs) directly on devices is the critical next step. Yet, fitting models with billions of parameters into smartphones and running them smoothly has remained a significant challenge for developers. Recently, the LiteRT NeuroPilot Accelerator stack, launched by Google and MediaTek, aims to turn the NPUs (Neural Processing Units) in MediaTek’s Dimensity series chips into the “preferred target” for on-device LLMs. This is not just another technical update; it seeks to fundamentally change how developers interact with dedicated AI hardware in phones.

Overview: Goodbye Fragmentation, Hello Unified NPU Development

Imagine developing a mobile AI app that requires different optimization, compilation, and packaging for dozens of different MediaTek chip models on the market—this was the norm for on-device machine learning development in the past. While powerful, NPUs from different vendors came with distinct software development kits (SDKs) and toolchains, leading to complex workflows and high adaptation costs.

The core goal of the LiteRT NeuroPilot Accelerator is to end this chaos. By deeply integrating Google’s LiteRT runtime with MediaTek’s NeuroPilot NPU software stack, it provides developers with a unified, high-level API abstraction layer. Now, developers can deploy models to a wide range of Dimensity chips—covering mid-range to flagship segments—by simply selecting the Accelerator.NPU option, much like they would for CPU or GPU, without writing custom code for each chip.

Core Summary: Google’s LiteRT NeuroPilot Accelerator stack uses a unified API to deeply integrate MediaTek Dimensity series NPUs into the on-device AI runtime. It supports open-weight models like Qwen3 and the Gemma-3 series. On the Dimensity 9500, it enables the Gemma 3n E2B model to achieve over 1600 tokens/s in prefill speed and reduces first-run compilation time for large models from minutes to zero via AOT (Ahead-of-Time) compilation, dramatically simplifying the deployment of high-performance on-device LLMs.

Deep Dive: What is the LiteRT NeuroPilot Accelerator?

To understand this technology, let’s clarify a few key components:

LiteRT: Think of it as the evolution of TensorFlow Lite. It’s a high-performance, on-device runtime designed to execute models in the .tflite format. Its key feature is a unified hardware acceleration layer that can intelligently dispatch computational tasks to the CPU, GPU, and now, seamlessly to the NPU via the new Accelerator.
NeuroPilot: This is MediaTek’s software ecosystem for its NPUs, comprising the compiler, drivers, and runtime.
LiteRT NeuroPilot Accelerator: This is the bridge connecting the two above. It no longer treats the NPU as a peripheral plugin like the old “TFLite NeuroPilot Delegate” did. Instead, it achieves direct integration with the NeuroPilot compiler and runtime. This brings a fundamental shift: an upgrade from the “delegate” model to a “Compiled Model” API.

This new API supports both Ahead-of-Time (AOT) compilation and on-device Just-in-Time (JIT) compilation, exposed to developers through the same C++ and Kotlin APIs. This means developers can use the same code logic whether they are pre-optimizing a model for a specific chip for peak performance or compiling it on the user’s device at first run for distribution flexibility.

Supported Hardware: Currently, the integration explicitly targets MediaTek Dimensity 7300, 8300, 9000, 9200, 9300, and 9400 series Systems-on-Chip (SoCs). These chips cover a large portion of the Android mid-range and flagship smartphone market, providing applications with a vast potential hardware base.

Why Should Developers Care? A Workflow Revolution

For engineers, the most immediate value of LiteRT NeuroPilot is a highly simplified three-step workflow. Whether the target device has a Dimensity 7300 or 9400, the process remains identical:

Model Preparation: Convert or load your .tflite model as usual.
Optional AOT Compilation: Use LiteRT’s Python tools for AOT compilation to produce an “AI Pack” tied to one or more target SoCs. This is the critical step for deploying production-grade LLMs.
Deployment & Execution: Distribute the AI Pack via Google Play’s on-device AI services. At runtime, simply select Accelerator.NPU. LiteRT handles device targeting, runtime loading, and automatically falls back to GPU or CPU if the NPU is unavailable.

This design strips complex device adaptation logic from the application code, moving it into structured configuration files and distribution channels. As a developer, your main interaction is with the CompiledModel and that decisive hardware accelerator option.

AOT vs. On-Device Compilation: A Critical Performance Decision

Both compilation modes have pros and cons; the choice depends on your model size and deployment strategy:

AOT Compilation: The model is compiled in advance on the developer’s side for a known SoC model. This is the recommended approach for deploying larger models (like LLMs). It completely eliminates the time-consuming first-run compilation on the user’s device. Data shows that for a model like Gemma-3-270M, pure on-device compilation can take over 1 minute, which is unacceptable in practical applications. AOT absorbs this cost upfront, allowing users to experience peak performance immediately upon opening the app.
On-Device JIT Compilation: The generic .tflite file is distributed and compiled on the user’s device during the first run. This is better suited for smaller models or scenarios requiring maximum distribution flexibility, at the cost of significant “first-run latency” for the user.

Performance in Practice: Which Models Can Sprint on MediaTek NPUs?

This technology stack is not just a theoretical framework; it’s built around a set of concrete, open-weight models with explicit production-ready support. These models cover various tasks from text generation to multimodal understanding:

Qwen3 0.6B: Designed for text generation, particularly for markets like Mainland China.
Gemma-3-270M: A compact base model, easy to fine-tune for tasks like sentiment analysis and entity extraction.
Gemma-3-1B: A multilingual text-only model for summarization and general reasoning.
Gemma-3n E2B: A multimodal model capable of handling text, audio, and visual inputs for use cases like real-time translation and visual question answering.
EmbeddingGemma 300M: A text embedding model optimized for retrieval-augmented generation, semantic search, and classification.

Quantified Performance:
On a Vivo X300 Pro device powered by the latest Dimensity 9500 chip, the Gemma 3n E2B model running on the NPU demonstrated impressive performance:

Prefill Phase: Speed exceeds 1600 tokens per second (at a 4K context length).
Decode/Generation Phase: Speed reaches 28 tokens per second (also at 4K context length).
Comparative Advantage: For LLM workloads, measured throughput on the NPU was up to 12 times faster than CPU and 10 times faster than GPU.

API Layers for Different Tasks:

Text Generation: Uses the higher-level LiteRT-LM engine. Built on top of LiteRT, it provides a stateful, text-in/text-out API. A typical flow involves creating model assets, building an engine specifying litert::lm::Backend::NPU, then creating a session to generate content.
Embedding Tasks: Uses the lower-level LiteRT CompiledModel API in a tensor-in/tensor-out configuration, again selecting the NPU via hardware accelerator options.

Developer Experience Guide: C++ API & Zero-Copy Memory

LiteRT introduces a new, object-oriented C++ API that replaces the older C entry points, centered around core objects like Environment, Model, CompiledModel, and TensorBuffer.

For developers targeting MediaTek NPUs, a key advantage of this API is its deep integration with Android hardware buffers. You can construct input TensorBuffer instances directly from OpenGL or OpenCL buffers using methods like TensorBuffer::CreateFromGlBuffer. This allows data from camera or video processing pipelines to be fed directly into the NPU for computation without an intermediate copy through CPU memory. Avoiding multiple copies per frame is crucial for real-time video processing and other scenarios highly sensitive to memory bandwidth.

Here is a simplified C++ code example illustrating how to use the new API to run a compiled model on-device:

// 1. Load the model (could be a .tflite or a pre-compiled AI Pack)
auto model = Model::CreateFromFile("model.tflite");

// 2. Create configuration options and explicitly specify NPU acceleration
auto options = Options::Create();
options->SetHardwareAccelerators(kLiteRtHwAcceleratorNpu);

// 3. Create the compiled model instance (LiteRT chooses the optimal path based on options & hardware)
auto compiled = CompiledModel::Create(*env, *model, *options);

// 4. Allocate input and output buffers
auto input_buffers = compiled->CreateInputBuffers();
auto output_buffers = compiled->CreateOutputBuffers();

// 5. Write input data, execute inference, read output data
input_buffers[0].Write<float>(input_span); // This could be a zero-copy buffer from a camera stream
compiled->Run(input_buffers, output_buffers);
output_buffers[0].Read(output_span);

Whether your ultimate target is CPU, GPU, or a MediaTek NPU, this code structure remains consistent, significantly reducing conditional logic and platform-specific code in the application.

Frequently Asked Questions

Q1: What’s the difference between LiteRT NeuroPilot Accelerator and the old TensorFlow Lite with NeuroPilot Delegate?
A1: This is an architectural upgrade from a “plugin delegate” to a “first-class integration.” The old approach treated the NPU as an add-on delegate with a disjointed flow. The new solution integrates the NPU as a primary acceleration target within the LiteRT runtime, offering a unified CompiledModel API that supports both AOT and on-device compilation, resulting in a more cohesive developer experience and better performance scheduling.

Q2: Do I have to use AOT compilation to make my LLM app usable on phones?
A2: For LLMs like Gemma-3-270M or larger, AOT compilation is strongly recommended. Real-world measurements show that such models can take over 1 minute to compile on-device during the first run, severely impacting user experience. AOT compilation moves this time-consuming step upstream; users install a version already optimized for their chip, ready to run at high speed immediately.

Q3: What happens if a user’s phone NPU doesn’t support my model?
A3: The LiteRT runtime has built-in graceful fallback mechanisms. When you specify Accelerator.NPU in your code, if the target device’s NPU cannot execute the model for any reason (unsupported, insufficient memory, incompatible operators), LiteRT will automatically attempt to use the GPU. If the GPU is also unavailable, it will fall back to the CPU. This ensures the application remains functional on all devices, albeit with potential performance differences.

Q4: Which specific MediaTek chips are supported?
A4: Currently, SoCs explicitly supported in the official documentation include the Dimensity 7300, 8300, 9000, 9200, 9300, and 9400 series. This covers most mainstream mid-to-high-end 5G mobile platforms available today.

Q5: Can I run my own custom model, aside from the mentioned Gemma and Qwen models?
A5: Yes. The foundation of this stack is support for the standard .tflite format. You can convert your own model to this format and attempt to compile and deploy it via the LiteRT NeuroPilot workflow. However, whether the model gains full acceleration on the NPU depends on whether its operators are supported by the NeuroPilot NPU hardware and software stack. For the best experience, it’s advisable to reference the officially optimized and validated model architectures.

Key Takeaways

Unified Abstraction Layer: LiteRT NeuroPilot Accelerator establishes a primary, deep integration between Google LiteRT and MediaTek NeuroPilot, shielding developers from NPU hardware fragmentation through a unified CompiledModel API.
Production-Ready Model Support: The stack focuses on concrete open-weight models like Qwen3-0.6B, the Gemma-3 series, and EmbeddingGemma, providing validated, high-performance solutions for on-device text generation, multimodal understanding, and embedding tasks.
AOT Compilation is Key: For the practical deployment of on-device LLMs, ahead-of-time compilation is indispensable. It turns minutes of user-side waiting into zero, making it a mandatory step for production deployment.
Performance Leap: On a Dimensity 9500-class NPU, complex models can achieve prefill speeds exceeding 1600 tokens/s, offering order-of-magnitude improvements over CPU and GPU, making real-time, multi-turn interactive AI assistants on phones a tangible reality.
Modern Developer Experience: The new C++/Kotlin APIs, support for zero-copy memory transfers, and automatic fallback mechanisms combine to create a powerful yet easy-to-use development environment, allowing developers to focus on application logic rather than low-level hardware adaptation.

Through the LiteRT NeuroPilot Accelerator, Google and MediaTek are collaborating to bring on-device AI, particularly large language models, from labs and high-end prototypes into the palms of hundreds of millions of ordinary users. This is not just a demonstration of chip performance but a profound shift in software development paradigms, making the utilization of dedicated phone AI hardware simpler and more efficient than ever before.