Pocket-Sized Powerhouse: Liquid AI Launches LFM2, the Fastest On-Device Generative Model You Can Actually Run Today

If you have ever tried to run a large language model on your laptop, you probably faced three headaches:

The model is huge—several gigabytes before you even start chatting.
RAM usage shoots up and the cooling fan sounds like a jet engine.
Each new word appears slowly, one… token… at… a… time.

Liquid AI’s new LFM2 (Liquid Foundation Models v2) is built to solve exactly these problems:

350 M to 1.2 B parameters, small enough for a phone.
2× faster inference on a laptop CPU compared with Qwen3 of similar size.
3× cheaper to train than Liquid’s previous generation.
Open weights, with a commercial license for smaller companies.

Below you will find the plain-English story of what LFM2 is, how it works, and how you can run it on your own hardware today—without a data-center in sight.

What is LFM2?
Why is it so fast?
Benchmark results: scores, chat quality, and throughput
Under the hood: Liquid networks + hybrid convolution-attention blocks
Training at a glance: 10 T tokens, distillation, and alignment
Quick-start guide: running LFM2 on your phone, laptop, or car
License and commercial boundaries
FAQ: 12 common questions answered

1. What is LFM2?

In one sentence: LFM2 is a family of small, production-grade generative models designed to run entirely on local hardware—phones, tablets, laptops, or embedded boards.

Model	Parameters	Context length	Typical use cases
LFM2-350M	0.35 B	32 k tokens	Smartwatch, earbuds, ultra-low-power IoT
LFM2-700M	0.7 B	32 k tokens	Smartphone assistant, on-device search
LFM2-1.2B	1.2 B	32 k tokens	Laptop copilot, in-car voice interface

A 7 B version is on the roadmap but has not been released yet.

2. Why is it so fast?

2.1 A new hybrid architecture

Most small models choose either pure attention (memory-hungry) or pure convolution (weak at long-range reasoning).
LFM2 uses 16 blocks:

10 blocks are double-gated short convolutions that fit neatly into CPU vector units or mobile NPUs.
6 blocks are grouped-query attention (GQA) for global context when needed.

2.2 Algorithmic tweaks

Short 3–5 tap convolutions keep the compute budget linear and cache-friendly.
Dual gating (input-dependent multiplicative gates) acts like a lightweight LSTM, trimming irrelevant signals early.
Grouped KV heads cut memory traffic by 4–8× compared with standard multi-head attention.

2.3 Tool-chain integration

Liquid AI tested the same weights in both ExecuTorch (PyTorch mobile) and llama.cpp (C++ inference). Quantization recipes were tuned for each backend:

Backend	Quantization scheme	Typical size (1.2 B model)
ExecuTorch	8da4w	≈ 1.2 GB
llama.cpp	Q4_0	≈ 750 MB

Real numbers on Snapdragon 8 Gen 3 (Samsung Galaxy S24 Ultra):

21 tokens/sec generation (LFM2-700M) vs 12 tokens/sec (Qwen-0.6B).
1.8× faster prompt processing (prefill) for the same prompt length.

3. Benchmark results

3.1 Academic scores

Liquid AI evaluated LFM2 on seven public benchmarks covering knowledge, math, instruction following, and multilingual tasks.

Benchmark (higher is better)	LFM2-350M	700M	1.2B	Qwen3-0.6B	Qwen3-1.7B	Llama-3.2-1B
MMLU 5-shot	43.4	49.9	55.2	44.9	59.1	46.6
GSM8K 0-shot	30.1	46.4	58.3	36.5	51.4	35.7
IFEval	65.1	72.2	74.9	64.2	74.0	52.4
MMMLU 5-shot (multilingual)	38.0	43.3	46.7	30.8	46.5	38.2

Key takeaway: LFM2-1.2B matches a 47 % larger Qwen3-1.7B on average, while LFM2-350M already competes with Qwen3-0.6B despite being 42 % smaller.

3.2 Human preference

Liquid AI sampled 1 000 real conversations from the WildChat dataset and asked five LLM judges to pick the better answer.

Pairwise comparison	LFM2-1.2B win rate
vs Llama-3.2-1B-Instruct	68 %
vs Gemma-3-1B-it	64 %
vs Qwen3-1.7B	~50 % (tie)

4. Under the hood

4.1 From Liquid Time-Constant networks to LIV operators

In 2018 the Liquid team introduced LTCs—continuous-time neural ODEs whose state evolves at input-dependent speeds.
Discrete LTCs can be viewed as Linear Input-Varying (LIV) operators where the convolution weights are generated on-the-fly from the input itself.
This single idea unifies convolutions, recurrences, and attention under one algebraic form, making neural architecture search (NAS) easier.

4.2 STAR architecture search

Liquid’s STAR engine searched for the best mix of LIV convolutions and GQA blocks under three constraints:

Real validation loss on 50+ internal tasks (not just perplexity).
Peak RAM and prefill + decode latency measured on a Snapdragon CPU.
No custom kernels—only operations already accelerated on mobile SoCs.

The final blueprint is the 16-block hybrid described earlier.

5. Training snapshot

5.1 Pre-training corpus

10 T tokens from web crawl, licensed books, and code.
Language mix: 75 % English, 20 % multilingual (Arabic, French, German, Spanish, Japanese, Korean, Chinese), 5 % code.
Context length: progressively stretched to 32 k.

5.2 Knowledge distillation

Liquid used its own LFM1-7B as a teacher for the entire run.
The student (LFM2) was trained to match the teacher’s logits, stabilizing convergence and boosting downstream quality.

5.3 Alignment pipeline

Supervised Fine-Tuning (SFT) on a curated mix of open-source and synthetic instruction data.
Direct Preference Optimization (DPO) with length normalization.
- Semi-online rollouts → LLM judges rank → preference pairs.
Checkpoint ensembling: several DPO runs were merged to balance helpfulness and conciseness.

6. Quick-start guide

6.1 Download the weights

Hugging Face: liquid/lfm2-1.2b
File sizes:
- fp16: 2.4 GB
- Q4_0: 0.75 GB

6.2 Run with llama.cpp (macOS / Linux / Windows WSL)

# 1. Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j

# 2. Convert to GGUF
python3 convert-hf-to-gguf.py ../lfm2-1.2b --outtype q4_0

# 3. Chat
./main -m lfm2-1.2b-q4_0.gguf -p "Explain quantum entanglement in simple terms." -n 200

6.3 Android demo with ExecuTorch

Install Android Studio and NDK.
Import the official llama_demo sample.
Copy lfm2-700m-8da4w.pte into app/src/main/assets/.
Build and deploy to any recent Android phone (API 28+).

Expected performance on Galaxy S24 Ultra:

700 M model, 8-bit: ~18 tokens/sec.

7. License and commercial use

Company revenue	License needed	Notes
Academic / personal / research	✅ Free	Apache 2.0-style permissive license
Commercial < USD 10 M per year	✅ Free	No additional paperwork
Commercial ≥ USD 10 M per year	❗ Contact Liquid	sales@liquid.ai for enterprise terms

8. FAQ: Twelve quick answers

Q1. How is LFM2 different from LFM1?
A2. Completely new architecture, 3× faster training, 2× faster inference, and the first small-size checkpoints released under an open license.

Q2. Can a Raspberry Pi 4 run it?
A2. Yes, the 350 M Q4_0 model is ~500 MB and runs at 3–4 tokens/sec on a 1.5 GHz Cortex-A72—usable for demos.

Q3. Does it handle Chinese?
A3. Yes, Chinese is in the multilingual corpus; the 1.2 B model answers fluently.

Q4. Can it write code?
A4. Basic Python and C++ completion is possible, but it is not specialized like CodeLlama.

Q5. Will larger sizes (3 B, 7 B) be released?
A5. A 7 B model is in training and expected in Q4 2025.

Q6. VRAM requirements?
A6.

fp16: parameter count × 2 bytes
4-bit: parameter count × 0.5 bytes
Example: 1.2 B Q4_0 ≈ 600 MB VRAM.

Q7. Fine-tuning with LoRA?
A7. Yes. Load the model via transformers and apply PEFT/TRL scripts provided in the official repo.

Q8. Compared with MobileVLM or Gemma-3-1B?
A8. LFM2 shows the lowest CPU latency and better Chinese performance at the same parameter count.

Q9. Do I need internet?
A9. No. Once downloaded, the model is fully offline.

Q10. Privacy concerns?
A10. All inference happens locally; no data is sent to Liquid AI or third parties.

Q11. Can it run in a web browser?
A11. llama.cpp has a WebGPU build; community demos load the 350 M model in ~10 s over fast Wi-Fi.

Q12. How does it stack up against GPT-4o-mini?
A12. Cloud models still lead on raw capability, but LFM2 wins on offline, low-cost, low-latency use cases.

Closing thoughts

Moving generative AI from the cloud to the edge is less about “beating OpenAI” and more about giving every device millisecond-level, privacy-first intelligence.
With 350 M, 700 M, and 1.2 B checkpoints, LFM2 offers a practical path for:

Hardware makers to embed AI inside earphones, watches, and cars.
SMBs to run customer support bots on-prem without data-leak risk.
Developers to fine-tune a personal assistant for less than a hundred dollars.

If you have been waiting for a truly pocket-sized large language model, download the weights from Hugging Face today and see how it feels to chat with an AI that never phones home.

On-Device Generative AI Model LFM2: Liquid AI’s Pocket-Sized Powerhouse for Fast, Offline AI

Pocket-Sized Powerhouse: Liquid AI Launches LFM2, the Fastest On-Device Generative Model You Can Actually Run Today

Table of Contents

1. What is LFM2?

2. Why is it so fast?

2.1 A new hybrid architecture

2.2 Algorithmic tweaks

2.3 Tool-chain integration

3. Benchmark results

3.1 Academic scores

3.2 Human preference

4. Under the hood

4.1 From Liquid Time-Constant networks to LIV operators

4.2 STAR architecture search

5. Training snapshot

5.1 Pre-training corpus

5.2 Knowledge distillation

5.3 Alignment pipeline

6. Quick-start guide

6.1 Download the weights

6.2 Run with llama.cpp (macOS / Linux / Windows WSL)

6.3 Android demo with ExecuTorch

7. License and commercial use

8. FAQ: Twelve quick answers

Closing thoughts

On-Device Generative AI Model LFM2: Liquid AI’s Pocket-Sized Powerhouse for Fast, Offline AI

Pocket-Sized Powerhouse: Liquid AI Launches LFM2, the Fastest On-Device Generative Model You Can Actually Run Today

Table of Contents

1. What is LFM2?

2. Why is it so fast?

2.1 A new hybrid architecture

2.2 Algorithmic tweaks

2.3 Tool-chain integration

3. Benchmark results

3.1 Academic scores

3.2 Human preference

4. Under the hood

4.1 From Liquid Time-Constant networks to LIV operators

4.2 STAR architecture search

5. Training snapshot

5.1 Pre-training corpus

5.2 Knowledge distillation

5.3 Alignment pipeline

6. Quick-start guide

6.1 Download the weights

6.2 Run with llama.cpp (macOS / Linux / Windows WSL)

6.3 Android demo with ExecuTorch

7. License and commercial use

8. FAQ: Twelve quick answers

Closing thoughts

Related Posts