Pocket-Sized Powerhouse: Liquid AI Launches LFM2, the Fastest On-Device Generative Model You Can Actually Run Today

Performance overview of LFM2

If you have ever tried to run a large language model on your laptop, you probably faced three headaches:

  1. The model is huge—several gigabytes before you even start chatting.
  2. RAM usage shoots up and the cooling fan sounds like a jet engine.
  3. Each new word appears slowly, one… token… at… a… time.

Liquid AI’s new LFM2 (Liquid Foundation Models v2) is built to solve exactly these problems:

  • 350 M to 1.2 B parameters, small enough for a phone.
  • 2× faster inference on a laptop CPU compared with Qwen3 of similar size.
  • 3× cheaper to train than Liquid’s previous generation.
  • Open weights, with a commercial license for smaller companies.

Below you will find the plain-English story of what LFM2 is, how it works, and how you can run it on your own hardware today—without a data-center in sight.


Table of Contents

  1. What is LFM2?
  2. Why is it so fast?
  3. Benchmark results: scores, chat quality, and throughput
  4. Under the hood: Liquid networks + hybrid convolution-attention blocks
  5. Training at a glance: 10 T tokens, distillation, and alignment
  6. Quick-start guide: running LFM2 on your phone, laptop, or car
  7. License and commercial boundaries
  8. FAQ: 12 common questions answered

1. What is LFM2?

In one sentence: LFM2 is a family of small, production-grade generative models designed to run entirely on local hardware—phones, tablets, laptops, or embedded boards.

Model Parameters Context length Typical use cases
LFM2-350M 0.35 B 32 k tokens Smartwatch, earbuds, ultra-low-power IoT
LFM2-700M 0.7 B 32 k tokens Smartphone assistant, on-device search
LFM2-1.2B 1.2 B 32 k tokens Laptop copilot, in-car voice interface

A 7 B version is on the roadmap but has not been released yet.


2. Why is it so fast?

2.1 A new hybrid architecture

Most small models choose either pure attention (memory-hungry) or pure convolution (weak at long-range reasoning).
LFM2 uses 16 blocks:

  • 10 blocks are double-gated short convolutions that fit neatly into CPU vector units or mobile NPUs.
  • 6 blocks are grouped-query attention (GQA) for global context when needed.

2.2 Algorithmic tweaks

  • Short 3–5 tap convolutions keep the compute budget linear and cache-friendly.
  • Dual gating (input-dependent multiplicative gates) acts like a lightweight LSTM, trimming irrelevant signals early.
  • Grouped KV heads cut memory traffic by 4–8× compared with standard multi-head attention.

2.3 Tool-chain integration

Liquid AI tested the same weights in both ExecuTorch (PyTorch mobile) and llama.cpp (C++ inference). Quantization recipes were tuned for each backend:

Backend Quantization scheme Typical size (1.2 B model)
ExecuTorch 8da4w ≈ 1.2 GB
llama.cpp Q4_0 ≈ 750 MB

Real numbers on Snapdragon 8 Gen 3 (Samsung Galaxy S24 Ultra):

  • 21 tokens/sec generation (LFM2-700M) vs 12 tokens/sec (Qwen-0.6B).
  • 1.8× faster prompt processing (prefill) for the same prompt length.

3. Benchmark results

3.1 Academic scores

Liquid AI evaluated LFM2 on seven public benchmarks covering knowledge, math, instruction following, and multilingual tasks.

Benchmark (higher is better) LFM2-350M 700M 1.2B Qwen3-0.6B Qwen3-1.7B Llama-3.2-1B
MMLU 5-shot 43.4 49.9 55.2 44.9 59.1 46.6
GSM8K 0-shot 30.1 46.4 58.3 36.5 51.4 35.7
IFEval 65.1 72.2 74.9 64.2 74.0 52.4
MMMLU 5-shot (multilingual) 38.0 43.3 46.7 30.8 46.5 38.2

Key takeaway: LFM2-1.2B matches a 47 % larger Qwen3-1.7B on average, while LFM2-350M already competes with Qwen3-0.6B despite being 42 % smaller.

3.2 Human preference

Liquid AI sampled 1 000 real conversations from the WildChat dataset and asked five LLM judges to pick the better answer.

Pairwise comparison LFM2-1.2B win rate
vs Llama-3.2-1B-Instruct 68 %
vs Gemma-3-1B-it 64 %
vs Qwen3-1.7B ~50 % (tie)

4. Under the hood

4.1 From Liquid Time-Constant networks to LIV operators

In 2018 the Liquid team introduced LTCs—continuous-time neural ODEs whose state evolves at input-dependent speeds.
Discrete LTCs can be viewed as Linear Input-Varying (LIV) operators where the convolution weights are generated on-the-fly from the input itself.
This single idea unifies convolutions, recurrences, and attention under one algebraic form, making neural architecture search (NAS) easier.

4.2 STAR architecture search

Liquid’s STAR engine searched for the best mix of LIV convolutions and GQA blocks under three constraints:

  1. Real validation loss on 50+ internal tasks (not just perplexity).
  2. Peak RAM and prefill + decode latency measured on a Snapdragon CPU.
  3. No custom kernels—only operations already accelerated on mobile SoCs.

The final blueprint is the 16-block hybrid described earlier.


5. Training snapshot

5.1 Pre-training corpus

  • 10 T tokens from web crawl, licensed books, and code.
  • Language mix: 75 % English, 20 % multilingual (Arabic, French, German, Spanish, Japanese, Korean, Chinese), 5 % code.
  • Context length: progressively stretched to 32 k.

5.2 Knowledge distillation

Liquid used its own LFM1-7B as a teacher for the entire run.
The student (LFM2) was trained to match the teacher’s logits, stabilizing convergence and boosting downstream quality.

5.3 Alignment pipeline

  1. Supervised Fine-Tuning (SFT) on a curated mix of open-source and synthetic instruction data.
  2. Direct Preference Optimization (DPO) with length normalization.

    • Semi-online rollouts → LLM judges rank → preference pairs.
  3. Checkpoint ensembling: several DPO runs were merged to balance helpfulness and conciseness.

6. Quick-start guide

6.1 Download the weights

  • Hugging Face: liquid/lfm2-1.2b
  • File sizes:

    • fp16: 2.4 GB
    • Q4_0: 0.75 GB

6.2 Run with llama.cpp (macOS / Linux / Windows WSL)

# 1. Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j

# 2. Convert to GGUF
python3 convert-hf-to-gguf.py ../lfm2-1.2b --outtype q4_0

# 3. Chat
./main -m lfm2-1.2b-q4_0.gguf -p "Explain quantum entanglement in simple terms." -n 200

6.3 Android demo with ExecuTorch

  1. Install Android Studio and NDK.
  2. Import the official llama_demo sample.
  3. Copy lfm2-700m-8da4w.pte into app/src/main/assets/.
  4. Build and deploy to any recent Android phone (API 28+).

Expected performance on Galaxy S24 Ultra:

  • 700 M model, 8-bit: ~18 tokens/sec.

7. License and commercial use

Company revenue License needed Notes
Academic / personal / research ✅ Free Apache 2.0-style permissive license
Commercial < USD 10 M per year ✅ Free No additional paperwork
Commercial ≥ USD 10 M per year ❗ Contact Liquid sales@liquid.ai for enterprise terms

8. FAQ: Twelve quick answers

Q1. How is LFM2 different from LFM1?
A2. Completely new architecture, 3× faster training, 2× faster inference, and the first small-size checkpoints released under an open license.

Q2. Can a Raspberry Pi 4 run it?
A2. Yes, the 350 M Q4_0 model is ~500 MB and runs at 3–4 tokens/sec on a 1.5 GHz Cortex-A72—usable for demos.

Q3. Does it handle Chinese?
A3. Yes, Chinese is in the multilingual corpus; the 1.2 B model answers fluently.

Q4. Can it write code?
A4. Basic Python and C++ completion is possible, but it is not specialized like CodeLlama.

Q5. Will larger sizes (3 B, 7 B) be released?
A5. A 7 B model is in training and expected in Q4 2025.

Q6. VRAM requirements?
A6.

  • fp16: parameter count × 2 bytes
  • 4-bit: parameter count × 0.5 bytes
    Example: 1.2 B Q4_0 ≈ 600 MB VRAM.

Q7. Fine-tuning with LoRA?
A7. Yes. Load the model via transformers and apply PEFT/TRL scripts provided in the official repo.

Q8. Compared with MobileVLM or Gemma-3-1B?
A8. LFM2 shows the lowest CPU latency and better Chinese performance at the same parameter count.

Q9. Do I need internet?
A9. No. Once downloaded, the model is fully offline.

Q10. Privacy concerns?
A10. All inference happens locally; no data is sent to Liquid AI or third parties.

Q11. Can it run in a web browser?
A11. llama.cpp has a WebGPU build; community demos load the 350 M model in ~10 s over fast Wi-Fi.

Q12. How does it stack up against GPT-4o-mini?
A12. Cloud models still lead on raw capability, but LFM2 wins on offline, low-cost, low-latency use cases.


Closing thoughts

Moving generative AI from the cloud to the edge is less about “beating OpenAI” and more about giving every device millisecond-level, privacy-first intelligence.
With 350 M, 700 M, and 1.2 B checkpoints, LFM2 offers a practical path for:

  • Hardware makers to embed AI inside earphones, watches, and cars.
  • SMBs to run customer support bots on-prem without data-leak risk.
  • Developers to fine-tune a personal assistant for less than a hundred dollars.

If you have been waiting for a truly pocket-sized large language model, download the weights from Hugging Face today and see how it feels to chat with an AI that never phones home.