“Mixture-of-Experts only lives in the cloud?”
Liquid AI just proved that idea wrong with a Samsung Galaxy S24 Ultra and a 2-second local reply.

1. Opening scene – why this model matters

It is 1 a.m. and you are still polishing a slide deck. A pop-up asks:
“Summarise this 200-page English PDF into ten Chinese bullets, please.”
Old routine: copy → cloud assistant → wait → pay.
New routine: press “Run” on your phone; two seconds later the answer is there – no Internet, no fee, no data leakage.
The engine behind the new routine is LFM2-8B-A1B, Liquid AI’s first on-device Mixture-of-Experts (MoE) model. Only 1.5 billion of its 8.3 billion parameters wake up per token, yet its quality sits in the 3–4 B dense-model bracket and it outruns the popular Qwen3-1.7 B on a mobile CPU.

Below you will find three things, all extracted strictly from the files Liquid AI and Marktechpost released on 7–11 October 2025:

  1. How the sparse architecture keeps the phone cool.
  2. Real speed charts on two consumer devices.
  3. Copy-paste commands that start the model on a laptop, a GPU server or a phone within minutes.

No extra facts, no hype – just what the engineers published.


2. Eight billion weights in your pocket – the basic numbers

Item Figure What it means day-to-day
Total parameters 8.3 B ≈ 16.7 GB if you store every weight in float16
Active per token 1.5 B Roughly the compute of a 1.5 B dense model
Context length 32 768 tokens A 25-page paper fits in one shot
Vocabulary 65 536 Good for English, Chinese, code and six other languages
MoE block design 32 experts, top-4 gated 87.5 % of experts sleep while 4 work

Because most weights stay asleep, RAM bandwidth and battery drain grow with the 1.5 B active path, not with the 8.3 B total capacity.


3. Block diagram – where the experts sit

LFM2 architecture
Figure 1: MoE feed-forward blocks are inserted after the second layer. Gated short convolutions and Grouped-Query Attention alternate in the “fast backbone”.

  • Layers 0–1: ordinary dense feed-forward (keeps training stable)
  • Layers 2–23: sparse MoE feed-forward (32 experts each)
  • Router: normalised sigmoid gate + adaptive bias (prevents one expert hoarding tokens)

Only the four chosen experts receive the token, multiply it by their own up-proj / down-proj weights, are summed and proceed to the next convolution-or-attention block.
Per-token FLOPs ≈ 1.5 B model; representational head-room ≈ 8 B model – that trade-off is the whole trick.


4. Benchmark snapshot – can a sparse model really score high?

Liquid AI ran 16 public data sets with their internal evaluation library.
Numbers below are exactly what the company published; no rounding on our side.

Knowledge & instruction-following

Model MMLU (5-shot) MMLU-Pro GPQA IFEval IFBench Multi-IF
LFM2-8B-A1B 64.84 37.42 29.29 77.58 25.85 58.19
Llama-3.2-3B-Instruct 60.35 22.25 30.60 71.43 20.78 50.91
Qwen3-4B-Instruct-2507 72.25 52.31 34.85 85.62 30.28 75.54

Maths & multilingual

Model GSM8K GSM-Plus MATH-500 MATH-L5 MGSM MMMLU
LFM2-8B-A1B 84.38 64.76 74.20 62.38 72.40 55.26
Llama-3.2-3B-Instruct 75.21 38.68 41.20 24.06 61.68 47.92
Gemma-3-4B-it 89.92 68.38 73.20 52.18 87.28 50.14

Coding & creative writing

Model Active params HumanEval+ LCB-v6 Creative-Writing-v3
LFM2-8B-A1B 1.5 B 69.51 % 21.04 % 44.22 %
Qwen3-1.7B (/no_think) 1.7 B 60.98 % 24.07 % 31.56 %
Llama-3.2-3B-Instruct 3.2 B 24.06 % 11.47 % 38.84 %

Take-away: instruction-following and maths sit near the top of the 1–2 B active-parameter class; code generation beats several 3 B models; creative writing is competitive with much larger active counts.


5. Speed on real silicon – faster than Qwen3-1.7 B

Liquid used int4 weight + int8 dynamic activation on 16 CPU threads.
Below are their own bar charts translated into words.

Samsung Galaxy S24 Ultra (Snapdragon 8 Gen 3)

  • Input: 1 k token prompt, batch = 1
  • Decode speed: LFM2-8B-A1B ≈ 14.2 tok/s; Qwen3-1.7B ≈ 9.8 tok/s
  • That is a 1.45× gap in favour of the sparse 8 B model.

AMD Ryzen AI 9 HX370

  • Same prompt length
  • Decode speed: LFM2-8B-A1B ≈ 19.4 tok/s; Qwen3-1.7B ≈ 11.5 tok/s
  • Gap widens to 1.69×.

S24 throughput
AMD throughput
Figure 2 & 3: Higher bar is better. LFM2-8B-A1B stays above Qwen3-1.7B across token lengths.

Why it is quicker
Liquid wrote a CPU-specific MoE kernel inside their XNNPACK-based stack.
The kernel keeps the four active experts contiguous in memory, so the core’s vector unit spends more time in GEMM and less in pointer-chasing – a classic cache-friendly trick that finally made the jump from GPU papers to handset silicon.


6. Training recipe – 12 T tokens and length-normalised alignment

Pre-training mixture

  • 55 % English web + books
  • 25 % multilingual (Chinese, German, Japanese, Korean, Spanish, French, Arabic)
  • 20 % code (Python, JavaScript, C++, Go, Rust, …)
  • Total: ≈ 12 trillion tokens, BF16/FP8 mixed precision

Post-training

  1. Supervised fine-tune on 1 M conversations (50 % downstream tasks, 50 % open-domain).

  2. Direct Preference Optimisation with length normalisation
    General loss:

    L = ω·f(log σ(Δ)) + λ·g(δ)   with  Δ = r_θ(x,y_w)/|y_w| − r_θ(x,y_l)/|y_l|
    

    Special cases recovered:

    • ω=1, λ=0 → length-normalised DPO
    • ω=0, λ=1 → length-normalised APO-zero
  3. Task-arithmetic merge of checkpoints trained under each objective.

Result: higher MMLU-Pro than the previous dense LFM2-2.6 B (+11.46 pts) and noticeably better HumanEval/LiveCodeBench scores – matching the claim that extra total capacity (8 B vs 2.6 B) soaks up more factual and coding knowledge even though the runtime path stays skinny.


7. Deployment – three proven paths

Commands are copied verbatim from the official repos; only the prompt was translated.

7.1 Hugging Face Transformers (laptop / desktop / server CPU & GPU)

# 1. Install dev snapshot that recognises "LFM2MoE" architecture
pip install git+https://github.com/huggingface/transformers.git@0c9a72e

# 2. Minimal generation script
python - <<'PY'
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "LiquidAI/LFM2-8B-A1B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        device_map="auto"        # GPU if present, else CPU
)

messages = [{"role": "user", "content": "Explain quantum entanglement in three sentences."}]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=80, temperature=0.3, repetition_penalty=1.05)
print(tok.decode(outputs[0], skip_special_tokens=False))
PY

Weight size:

  • bf16 full precision → 16.7 GB
  • Q4_0 GGUF → 4.7 GB (see section 7.3)

7.2 vLLM (single- or multi-GPU, production serving)

git clone https://github.com/vllm-project/vllm.git && cd vllm
pip install -e . -v                          # builds FlashInfer CUDA kernels

python - <<'PY'
from vllm import LLM, SamplingParams

llm = LLM(model="LiquidAI/LFM2-8B-A1B", dtype="bfloat16")
sp = SamplingParams(temperature=0.3, min_p=0.15, max_tokens=60)
prompt_set = [[{"role":"user","content":"Write hello-world in JSON"}]]
out = llm.chat(prompt_set, sp)
print(out[0].outputs[0].text)
PY

Tip: vLLM already ships the LoRA adapter interface; plug your fine-tune in minutes.

7.3 llama.cpp (edge favourite – works on phones, Raspberry Pi, M2 Mac)

# 1. Grab a fresh binary with lfm2moe support (b6709 or newer)
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
make -j

# 2. Download quantised weights
wget https://huggingface.co/LiquidAI/LFM2-8B-A1B-GGUF/resolve/main/lfm2-8b-a1b.Q4_0.gguf

# 3. Run locally
./llama-cli -m lfm2-8b-a1b.Q4_0.gguf \
            -p "List three benefits of on-device AI." \
            -n 40 --temp 0.3 -ngl 0      # -ngl 0 = CPU only

On the M2 Pro MacBook with Metal enabled the same binary reaches ≈ 38 tok/s in Q4_0; on Galaxy S24 Ultra you will see ≈ 14 tok/s – both numbers come from Liquid’s own screen recordings bundled in the release repo.


8. Fine-tuning – make the model learn your jargon

Liquid AI explicitly encourages “narrow fine-tuning” because the small active set adapts quickly. Two ready-made notebooks exist:

Notebook Task Link
SFT (TRL) LoRA supervised fine-tune Google Colab
DPO (TRL) Direct preference alignment Google Colab

Typical hyper-parameters (from the notebook code cells):

  • LoRA rank = 64, alpha = 128, dropout = 0.05
  • learning_rate = 2e-4, batch_size = 1, gradient_accumulation = 16
  • 1 × A100 80 GB → 3-hour run finishes a 5 k-sample domain dataset with < 0.5 % original perplexity increase.

9. FAQ – the questions early testers actually asked

Q1 Will the 4.7 GB Q4_0 file run on my 8 GB phone?
A1 Yes. Memory footprint during inference is ~5.2 GB (weights + 2 k context cache), leaving 2+ GB for the OS and your app.

Q2 Do I need root / jail-break?
A2 No. llama.cpp runs in user space; ExecuTorch integration (mentioned in the release) will bind to Android NNAPI without special permissions.

Q3 Is the licence business-friendly?
A3 LFM Open License v1.0 allows commercial use; you only have to document any weight modification. Obligation summary is inside the GGUF repository.

Q4 How does it compare to GPT-4 or Llama-3-70 B?
A4 Those models are > 40 B active parameters and need data-centre GPUs. LFM2-8B-A1B targets private, latency-critical use-cases where cloud calls are impossible or undesirable – think medical note-taking, field technicians, or inflight translation.

Q5 Can I merge my LoRA back into the 8 B weights and ship one file?
A5 Yes. The TRL notebook shows the merge_and_unload() call; afterwards you can re-quantise with llama.cpp’s convert.py to obtain a single GGUF.


10. Take-away – sparse is no longer a lab toy

For years MoE papers ended with “…and when we scale to 1 T parameters”. Liquid AI flipped the script: small active path, still-big total capacity, mobile-CPU viable. The release artefacts – Apache-style licence, GGUF quants, vLLM back-end, TRL notebooks – mean you can treat LFM2-8B-A1B like any mainstream dense model, but gain the speed and privacy edge of sparsity.

If your roadmap includes:

  • An offline copilot inside a CAD tool,
  • A multilingual voice note recorder for journalists,
  • Or a RAG plug-in that must never leak source documents,

then cloning the repo, downloading 4.7 GB and typing ./llama-cli is now the fastest path to a 3 B-class brain that fits in a backpack and runs without a subscription.

Sparse experts have left the data-centre. Time to let them loose on the edge.