The 2025 Landscape of Open-Weight Large Language Models: A Plain-English Tour from DeepSeek-V3 to Kimi 2
“Seven years after the first GPT paper, are we still stacking the same Lego blocks?”
“Which model can I actually run on a single RTX 4090?”
“What do MoE, MLA, NoPE, and QK-Norm mean for my weekend side-project?”
This article answers those questions in plain language. Every fact, number, and code snippet comes from the official papers or repositories of the eight model families discussed—no outside sources, no hype.
Table of Contents
-
Why Architecture Still Matters in 2025 -
One Map, Eight Models -
Model-by-Model Walk-Through
3.1 DeepSeek-V3 / R1 – MLA + MoE for Memory-Smart Serving
3.2 OLMo 2 – Moving the Layer-Norm Around
3.3 Gemma 3 – Sliding-Window Attention and a “Sandwich” of Norms
3.4 Gemma 3n – Running 4 B on a Phone
3.5 Mistral Small 3.1 – Back to Vanilla for Latency
3.6 Llama 4 – A Classic Flavor of MoE
3.7 Qwen3 – Dense and Sparse, from 0.6 B to 235 B
3.8 SmolLM3 – A 3 B Model That Drops Positional Embeddings
3.9 Kimi 2 – A 1 T Parameter DeepSeek-V3 Clone -
Quick-Look Decision Table -
Developer FAQ – 10 Real-World Questions -
Key Takeaways and Next Steps
1. Why Architecture Still Matters in 2025
In 2018, the original GPT stacked Transformer blocks and stunned the world.
In 2025, DeepSeek-V3, Llama 4, and Gemma 3 still stack Transformer blocks—but the details have quietly changed:
-
KV-cache compression (MLA) shrinks memory by 50-70 %. -
Mixture-of-Experts (MoE) lets a 671 B model run with only 37 B active weights. -
Sliding-window attention cuts context memory by 5× on consumer GPUs. -
NoPE (No Positional Embedding) removes positional encodings entirely in some layers.
These tweaks decide whether a model fits on your laptop, your phone, or just your cloud budget.
2. One Map, Eight Models
Model | Stand-out Trick | Total Params | Active Params | Sweet-Spot Use Case |
---|---|---|---|---|
DeepSeek-V3 | MLA + MoE + Shared Expert | 671 B | 37 B | High-throughput APIs |
OLMo 2 | Post-Norm + QK-Norm + MHA | 7 B / 13 B | 7 B / 13 B | Reproducible research |
Gemma 3 | Sliding Window + Dual Norm + GQA | 27 B | 27 B | Single-GPU local chat |
Gemma 3n | PLE + MatFormer + Sliding Window | 4 B | 4 B | On-device mobile |
Mistral Small 3.1 | Vanilla GQA + Slim Layers | 24 B | 24 B | Low-latency serving |
Llama 4 | MoE + GQA | 400 B | 17 B | General-purpose base |
Qwen3 | Dense & MoE lines | 0.6 B–235 B | 0.6 B–22 B | Any size you need |
SmolLM3 | NoPE (¼ layers) | 3 B | 3 B | Tiny local assistant |
Kimi 2 | Scaled-up DeepSeek-V3 | 1 T | ~55 B | Public-weight SOTA |
3. Model-by-Model Walk-Through
3.1 DeepSeek-V3 / R1 – MLA + MoE for Memory-Smart Serving
3.1.1 Multi-Head Latent Attention (MLA) – KV-Cache in a Zip File
Imagine the KV cache as a warehouse shelf.
-
Standard MHA: every shelf is full. -
GQA: two shelves share one box—cheaper, sometimes messy. -
MLA: vacuum-seal each box before storing, inflate when needed—same contents, 70 % less space.
Implementation notes
-
Training: compress queries too. -
Inference: only keys/values are compressed; one extra matmul brings them back. -
DeepSeek-V2 ablation: MLA beats both GQA and MHA in perplexity while using the least cache.
3.1.2 Mixture-of-Experts (MoE) – 256 Stoves, Only 9 Are On
-
256 feed-forward experts = 256 stoves. -
Router picks 8 experts + 1 “shared” stove per token. -
Inference footprint drops from 671 B to 37 B params.
Shared expert idea
Keep one universal stove always hot; the rest specialize. First shown in DeepSpeed-MoE 2022, DeepSeek keeps it in 2025.
3.2 OLMo 2 – Moving the Layer-Norm Around
3.2.1 Post-Norm vs Pre-Norm vs OLMo 2’s Post-Norm
Style | Where Norm Lives | Gradient Stability | Warm-up Needed |
---|---|---|---|
Post-Norm (2017 Transformer) | After attention & FFN | Fragile | Yes |
Pre-Norm (GPT-2 → today) | Before attention & FFN | Safer | No |
OLMo 2 | After attention & FFN, inside residual path | Stable (paper fig. 9) | No |
Two lines of code change the order; training loss becomes smoother.
3.2.2 QK-Norm – One More RMSNorm for Queries and Keys
-
Extra RMSNorm on queries and keys before RoPE. -
Stabilizes training when combined with Post-Norm. -
First used in 2023 vision transformers; OLMo 2 brings it to language models.
3.3 Gemma 3 – Sliding-Window Attention and a “Sandwich” of Norms
3.3.1 Sliding-Window Attention – Only Look at Your Neighbors
-
Regular attention: every token sees the whole sentence. -
Sliding-window: each token sees only 1024 neighbors. -
Gemma 3 uses 5:1 ratio—five local layers, one global layer—KV-cache shrinks 5× with < 0.3 % perplexity loss (paper fig. 13).
3.3.2 Norm “Sandwich” – Pre-Norm + Post-Norm Around Attention
Input → RMSNorm → Attention → RMSNorm → Residual → FeedForward
Gemma 3 puts an RMSNorm both before and after the attention block—cheap insurance.
3.4 Gemma 3n – Running 4 B on a Phone
Trick | Plain-English What & Why |
---|---|
PLE (Per-Layer Embedding) | Keep embeddings on SSD; stream to GPU when needed—like a mobile game loading level assets. |
MatFormer | One trained model can be sliced into ½, ¼, ⅛ size sub-models, each usable standalone—Russian-doll transformers. |
3.5 Mistral Small 3.1 – Back to Vanilla for Latency
-
Dropped sliding-window attention → can use FlashAttention’s fastest kernels. -
Fewer layers + custom tokenizer → 15–25 % lower first-token latency vs Gemma 3 27 B. -
Still vanilla GQA—no MoE, no MLA—proving simpler sometimes wins.
3.6 Llama 4 – A Classic Flavor of MoE
Detail | Llama 4 Maverick | DeepSeek-V3 |
---|---|---|
Total Params | 400 B | 671 B |
Active Params | 17 B | 37 B |
Expert Hidden Size | 8 k (large) | 2 k (small) |
MoE Layer Pattern | Every other layer | Almost every layer |
Take-away: there is more than one right way to mix experts.
3.7 Qwen3 – Dense and Sparse Lines for Every Appetite
3.7.1 Dense Line – 0.6 B to 32 B
-
0.6 B: 1800 token/s on A100, < 2 GB VRAM—perfect for class demos. -
Full training logs, data cards, and PyTorch reference code released.
3.7.2 MoE Line – 30 B-A3B and 235 B-A22B
Naming cheat-sheet
235B-A22B
= 235 B total, 22 B active.
-
Qwen3 removes the shared expert that Qwen2.5-MoE used—possibly because 8 experts already cover common patterns. -
Provides both dense (easy to fine-tune) and MoE (cheap to serve) for the same tokenizer and vocabulary.
3.8 SmolLM3 – A 3 B Model That Drops Positional Embeddings
-
NoPE (No Positional Embedding) -
No sinusoidal, no RoPE, nothing. -
Model learns order from the causal mask alone. -
Paper shows better length generalization on 100 M-param GPT models.
-
-
Practical compromise: apply NoPE only in every 4th layer to stay safe.
3.9 Kimi 2 – A 1 T Parameter DeepSeek-V3 Clone
-
Scale: 1 T parameters, the largest open-weight model as of July 2025. -
Architecture: same MLA + MoE blueprint as DeepSeek-V3, but -
more experts per layer -
fewer MLA heads
-
-
Training: first production-scale use of Muon optimizer (> 16 B params). -
Result: public weights that rival proprietary giants.
4. Quick-Look Decision Table
If You Need … | Pick | One-Line Reason |
---|---|---|
27 B-class on one RTX 4090 | Gemma 3 27 B | 27 B power, 5× less KV cache |
Cheap high-throughput API | DeepSeek-V3 | 37 B active, 671 B knowledge |
100 % reproducible paper | OLMo 2 | Full data, code, configs |
Offline phone chat | Gemma 3n 4 B | PLE + MatFormer fit in RAM |
Lowest latency | Mistral Small 3.1 24 B | 24 B, FlashAttention-ready |
One family, all sizes | Qwen3 series | 0.6 B–235 B with same tokenizer |
Tiny model with NoPE | SmolLM3 3 B | 3 B, no positional hassle |
Public SOTA brute force | Kimi 2 1 T | Largest open weights |
5. Developer FAQ – 10 Real-World Questions
Q1: Can I run a 70 B model on a single RTX 4090?
A: Use the MoE version of Llama 4 or DeepSeek-V3; only 17–37 B are active at once.
Q2: Does sliding-window attention hurt accuracy?
A: Gemma 3’s ablation shows < 0.3 % perplexity loss when 5/6 layers use local attention.
Q3: MLA vs GQA—worth the switch?
A: If you already use KV-cache, MLA gives better perplexity and smaller cache. Implementation takes ~20 extra lines of PyTorch.
Q4: Post-Norm looks scary—will my gradients explode?
A: OLMo 2’s logs show the opposite when combined with QK-Norm. You can usually drop learning-rate warm-up.
Q5: Can I port NoPE to a 70 B model tomorrow?
A: SmolLM3 cautiously uses NoPE only every 4th layer. Test at small scale first.
Q6: How complex is the MoE router?
A: DeepSeek uses plain top-k gating; CUDA kernels are available in DeepSpeed and xFormers.
Q7: Shared expert—should I keep it?
A: DeepSeek keeps it, Qwen3 drops it. The accuracy delta is < 0.2 %. Keep if compute budget allows.
Q8: MatFormer doubles training cost?
A: One “nested” forward pass trains all sub-sizes simultaneously; cost increase is < 10 %.
Q9: Muon optimizer—drop-in for AdamW?
A: Kimi 2 proves it scales, but you must shard optimizer states differently. Code is not yet upstream.
Q10: I need a 1 B model for class—safest starting point?
A: Qwen3 0.6 B or SmolLM3 3 B. Both publish exact training scripts and tokenizers.
6. Key Takeaways and Next Steps
Seven years of Transformer tweaks have not changed the foundation, but they have changed the cost equation:
-
Memory: MLA and sliding-window attention cut KV-cache by 3–5×. -
Compute: MoE lets you “own” a trillion parameters while paying for 20–40 B during inference. -
Edge: PLE and MatFormer move 4 B models onto phones.
The next leap may be a brand-new architecture—or simply smarter stacking of today’s bricks. Keep the receipts (open weights), keep the benchmarks, and keep experimenting.