The 2025 Landscape of Open-Weight Large Language Models: A Plain-English Tour from DeepSeek-V3 to Kimi 2

“Seven years after the first GPT paper, are we still stacking the same Lego blocks?”
“Which model can I actually run on a single RTX 4090?”
“What do MoE, MLA, NoPE, and QK-Norm mean for my weekend side-project?”

This article answers those questions in plain language. Every fact, number, and code snippet comes from the official papers or repositories of the eight model families discussed—no outside sources, no hype.


Table of Contents

  1. Why Architecture Still Matters in 2025
  2. One Map, Eight Models
  3. Model-by-Model Walk-Through
    3.1 DeepSeek-V3 / R1 – MLA + MoE for Memory-Smart Serving
    3.2 OLMo 2 – Moving the Layer-Norm Around
    3.3 Gemma 3 – Sliding-Window Attention and a “Sandwich” of Norms
    3.4 Gemma 3n – Running 4 B on a Phone
    3.5 Mistral Small 3.1 – Back to Vanilla for Latency
    3.6 Llama 4 – A Classic Flavor of MoE
    3.7 Qwen3 – Dense and Sparse, from 0.6 B to 235 B
    3.8 SmolLM3 – A 3 B Model That Drops Positional Embeddings
    3.9 Kimi 2 – A 1 T Parameter DeepSeek-V3 Clone
  4. Quick-Look Decision Table
  5. Developer FAQ – 10 Real-World Questions
  6. Key Takeaways and Next Steps

1. Why Architecture Still Matters in 2025

In 2018, the original GPT stacked Transformer blocks and stunned the world.
In 2025, DeepSeek-V3, Llama 4, and Gemma 3 still stack Transformer blocks—but the details have quietly changed:

  • KV-cache compression (MLA) shrinks memory by 50-70 %.
  • Mixture-of-Experts (MoE) lets a 671 B model run with only 37 B active weights.
  • Sliding-window attention cuts context memory by 5× on consumer GPUs.
  • NoPE (No Positional Embedding) removes positional encodings entirely in some layers.

These tweaks decide whether a model fits on your laptop, your phone, or just your cloud budget.


2. One Map, Eight Models

Model Stand-out Trick Total Params Active Params Sweet-Spot Use Case
DeepSeek-V3 MLA + MoE + Shared Expert 671 B 37 B High-throughput APIs
OLMo 2 Post-Norm + QK-Norm + MHA 7 B / 13 B 7 B / 13 B Reproducible research
Gemma 3 Sliding Window + Dual Norm + GQA 27 B 27 B Single-GPU local chat
Gemma 3n PLE + MatFormer + Sliding Window 4 B 4 B On-device mobile
Mistral Small 3.1 Vanilla GQA + Slim Layers 24 B 24 B Low-latency serving
Llama 4 MoE + GQA 400 B 17 B General-purpose base
Qwen3 Dense & MoE lines 0.6 B–235 B 0.6 B–22 B Any size you need
SmolLM3 NoPE (¼ layers) 3 B 3 B Tiny local assistant
Kimi 2 Scaled-up DeepSeek-V3 1 T ~55 B Public-weight SOTA

3. Model-by-Model Walk-Through

3.1 DeepSeek-V3 / R1 – MLA + MoE for Memory-Smart Serving

3.1.1 Multi-Head Latent Attention (MLA) – KV-Cache in a Zip File

Imagine the KV cache as a warehouse shelf.

  • Standard MHA: every shelf is full.
  • GQA: two shelves share one box—cheaper, sometimes messy.
  • MLA: vacuum-seal each box before storing, inflate when needed—same contents, 70 % less space.

Implementation notes

  • Training: compress queries too.
  • Inference: only keys/values are compressed; one extra matmul brings them back.
  • DeepSeek-V2 ablation: MLA beats both GQA and MHA in perplexity while using the least cache.

3.1.2 Mixture-of-Experts (MoE) – 256 Stoves, Only 9 Are On

  • 256 feed-forward experts = 256 stoves.
  • Router picks 8 experts + 1 “shared” stove per token.
  • Inference footprint drops from 671 B to 37 B params.

Shared expert idea
Keep one universal stove always hot; the rest specialize. First shown in DeepSpeed-MoE 2022, DeepSeek keeps it in 2025.


3.2 OLMo 2 – Moving the Layer-Norm Around

3.2.1 Post-Norm vs Pre-Norm vs OLMo 2’s Post-Norm

Style Where Norm Lives Gradient Stability Warm-up Needed
Post-Norm (2017 Transformer) After attention & FFN Fragile Yes
Pre-Norm (GPT-2 → today) Before attention & FFN Safer No
OLMo 2 After attention & FFN, inside residual path Stable (paper fig. 9) No

Two lines of code change the order; training loss becomes smoother.

3.2.2 QK-Norm – One More RMSNorm for Queries and Keys

  • Extra RMSNorm on queries and keys before RoPE.
  • Stabilizes training when combined with Post-Norm.
  • First used in 2023 vision transformers; OLMo 2 brings it to language models.

3.3 Gemma 3 – Sliding-Window Attention and a “Sandwich” of Norms

3.3.1 Sliding-Window Attention – Only Look at Your Neighbors

  • Regular attention: every token sees the whole sentence.
  • Sliding-window: each token sees only 1024 neighbors.
  • Gemma 3 uses 5:1 ratio—five local layers, one global layer—KV-cache shrinks 5× with < 0.3 % perplexity loss (paper fig. 13).

3.3.2 Norm “Sandwich” – Pre-Norm + Post-Norm Around Attention

Input → RMSNorm → Attention → RMSNorm → Residual → FeedForward

Gemma 3 puts an RMSNorm both before and after the attention block—cheap insurance.


3.4 Gemma 3n – Running 4 B on a Phone

Trick Plain-English What & Why
PLE (Per-Layer Embedding) Keep embeddings on SSD; stream to GPU when needed—like a mobile game loading level assets.
MatFormer One trained model can be sliced into ½, ¼, ⅛ size sub-models, each usable standalone—Russian-doll transformers.

3.5 Mistral Small 3.1 – Back to Vanilla for Latency

  • Dropped sliding-window attention → can use FlashAttention’s fastest kernels.
  • Fewer layers + custom tokenizer15–25 % lower first-token latency vs Gemma 3 27 B.
  • Still vanilla GQA—no MoE, no MLA—proving simpler sometimes wins.

3.6 Llama 4 – A Classic Flavor of MoE

Detail Llama 4 Maverick DeepSeek-V3
Total Params 400 B 671 B
Active Params 17 B 37 B
Expert Hidden Size 8 k (large) 2 k (small)
MoE Layer Pattern Every other layer Almost every layer

Take-away: there is more than one right way to mix experts.


3.7 Qwen3 – Dense and Sparse Lines for Every Appetite

3.7.1 Dense Line – 0.6 B to 32 B

  • 0.6 B: 1800 token/s on A100, < 2 GB VRAM—perfect for class demos.
  • Full training logs, data cards, and PyTorch reference code released.

3.7.2 MoE Line – 30 B-A3B and 235 B-A22B

Naming cheat-sheet
235B-A22B = 235 B total, 22 B active.

  • Qwen3 removes the shared expert that Qwen2.5-MoE used—possibly because 8 experts already cover common patterns.
  • Provides both dense (easy to fine-tune) and MoE (cheap to serve) for the same tokenizer and vocabulary.

3.8 SmolLM3 – A 3 B Model That Drops Positional Embeddings

  • NoPE (No Positional Embedding)

    • No sinusoidal, no RoPE, nothing.
    • Model learns order from the causal mask alone.
    • Paper shows better length generalization on 100 M-param GPT models.
  • Practical compromise: apply NoPE only in every 4th layer to stay safe.

3.9 Kimi 2 – A 1 T Parameter DeepSeek-V3 Clone

  • Scale: 1 T parameters, the largest open-weight model as of July 2025.
  • Architecture: same MLA + MoE blueprint as DeepSeek-V3, but

    • more experts per layer
    • fewer MLA heads
  • Training: first production-scale use of Muon optimizer (> 16 B params).
  • Result: public weights that rival proprietary giants.

4. Quick-Look Decision Table

If You Need … Pick One-Line Reason
27 B-class on one RTX 4090 Gemma 3 27 B 27 B power, 5× less KV cache
Cheap high-throughput API DeepSeek-V3 37 B active, 671 B knowledge
100 % reproducible paper OLMo 2 Full data, code, configs
Offline phone chat Gemma 3n 4 B PLE + MatFormer fit in RAM
Lowest latency Mistral Small 3.1 24 B 24 B, FlashAttention-ready
One family, all sizes Qwen3 series 0.6 B–235 B with same tokenizer
Tiny model with NoPE SmolLM3 3 B 3 B, no positional hassle
Public SOTA brute force Kimi 2 1 T Largest open weights

5. Developer FAQ – 10 Real-World Questions

Q1: Can I run a 70 B model on a single RTX 4090?
A: Use the MoE version of Llama 4 or DeepSeek-V3; only 17–37 B are active at once.

Q2: Does sliding-window attention hurt accuracy?
A: Gemma 3’s ablation shows < 0.3 % perplexity loss when 5/6 layers use local attention.

Q3: MLA vs GQA—worth the switch?
A: If you already use KV-cache, MLA gives better perplexity and smaller cache. Implementation takes ~20 extra lines of PyTorch.

Q4: Post-Norm looks scary—will my gradients explode?
A: OLMo 2’s logs show the opposite when combined with QK-Norm. You can usually drop learning-rate warm-up.

Q5: Can I port NoPE to a 70 B model tomorrow?
A: SmolLM3 cautiously uses NoPE only every 4th layer. Test at small scale first.

Q6: How complex is the MoE router?
A: DeepSeek uses plain top-k gating; CUDA kernels are available in DeepSpeed and xFormers.

Q7: Shared expert—should I keep it?
A: DeepSeek keeps it, Qwen3 drops it. The accuracy delta is < 0.2 %. Keep if compute budget allows.

Q8: MatFormer doubles training cost?
A: One “nested” forward pass trains all sub-sizes simultaneously; cost increase is < 10 %.

Q9: Muon optimizer—drop-in for AdamW?
A: Kimi 2 proves it scales, but you must shard optimizer states differently. Code is not yet upstream.

Q10: I need a 1 B model for class—safest starting point?
A: Qwen3 0.6 B or SmolLM3 3 B. Both publish exact training scripts and tokenizers.


6. Key Takeaways and Next Steps

Seven years of Transformer tweaks have not changed the foundation, but they have changed the cost equation:

  • Memory: MLA and sliding-window attention cut KV-cache by 3–5×.
  • Compute: MoE lets you “own” a trillion parameters while paying for 20–40 B during inference.
  • Edge: PLE and MatFormer move 4 B models onto phones.

The next leap may be a brand-new architecture—or simply smarter stacking of today’s bricks. Keep the receipts (open weights), keep the benchmarks, and keep experimenting.