From GPT-2 to Kimi 2: A Visual Guide to 2025’s Leading Large Language Model Architectures

If you already use large language models but still get lost in technical jargon, this post is for you. In one long read you’ll learn:

Why DeepSeek-V3’s 671 B parameters run cheaper than Llama 3’s 405 B

How sliding-window attention lets a 27 B model run on a Mac Mini

Which open-weight model to download for your next side project

Seven Years of the Same Backbone—What Actually Changed?
DeepSeek-V3 / R1: MLA + MoE, the Memory-Saving Duo
OLMo 2: Moving RMSNorm One Step Back for Stable Training
Gemma 3: Sliding-Window Attention Shrinks the KV Cache
Mistral Small 3.1: 24 B Beating 27 B with Smaller Tricks
Llama 4: Meta’s Take on Mixture-of-Experts
Qwen3: Dense vs MoE—Pick Your Flavor
SmolLM3: A 3 B Model That Drops Positional Embeddings
Kimi 2: Scaling DeepSeek-V3 to 1 T Parameters
One Comparison Table to Rule Them All
FAQ: Quick Answers to Common Reader Questions

1. Seven Years of the Same Backbone—What Actually Changed?

Component	2019 GPT-2	2025 Mainstream	One-Sentence Summary
Position Encoding	Fixed Absolute	RoPE (Rotary)	Longer context, less memory
Attention	Multi-Head	GQA / MLA	Fewer K/V heads, smaller cache
Activation	GELU	SwiGLU	Faster and slightly more accurate
Feed-Forward	Single Dense	MoE or Slimmer FFN	Train huge, infer small
Norm Placement	Post-LN	Pre-LN / Hybrid	Stabilize gradients or memory

2. DeepSeek-V3 / R1: MLA + MoE, the Memory-Saving Duo

2.1 Multi-Head Latent Attention (MLA)

Problem: KV cache is the GPU killer during inference.
Classic Fix (GQA): Group queries so 2 heads share 1 K/V pair.
DeepSeek’s Fix (MLA): Compress K/V into a low-dimensional latent before caching, then decompress on the fly.
- Cache size drops ≈ 50 %.
- DeepSeek-V2 ablations show MLA outperforms vanilla MHA and GQA.

# Pseudocode for MLA decompression
compressed_kv = cache[token_id]     # (batch, latent_dim)
key, value = up_project(compressed_kv)  # back to full dim

2.2 Mixture-of-Experts (MoE)

Concept: Replace every Feed-Forward block with 256 smaller experts, but let a router pick only 8 + 1 shared experts per token.
Numbers that matter
- Total params: 671 B
- Active params: 37 B (5.5 %)
Shared Expert Trick
- Always-on expert for common patterns, freeing the others to specialize.

3. OLMo 2: Moving RMSNorm One Step Back for Stable Training

Placement	Original Transformer	GPT-2 / Llama 3	OLMo 2
Where is the norm?	After block (Post-LN)	Before block (Pre-LN)	After block but inside residual

Why change? Authors show Post-Norm + RMSNorm + QK-Norm yields smoother loss curves (see their Figure 9).
QK-Norm = extra RMSNorm inside the attention layer on queries and keys before RoPE.
Take-away: If your pre-training keeps diverging, try OLMo 2’s recipe.

4. Gemma 3: Sliding-Window Attention Shrinks the KV Cache

4.1 Sliding-Window Attention

Local vs Global
- Local: each token attends to only 1024 neighbors.
- Global: full self-attention.
Gemma 3 ratio: 5 local layers : 1 global layer.
Memory win: KV cache 5× smaller with negligible perplexity increase (see Gemma 3 paper Figure 13).

4.2 Sandwich Norm

RMSNorm before and after the attention block.
Combines Pre-Norm stability and Post-Norm signal strength.

5. Mistral Small 3.1: 24 B Beating 27 B with Smaller Tricks

Dropped sliding-window attention → uses vanilla GQA + FlashAttention for lower latency.
Shrunk tokenizer vocabulary → fewer tokens per sentence → faster generation.
Fewer layers → less serial computation.
Benchmark snapshot: beats Gemma 3 27 B on most tasks except math.

6. Llama 4: Meta’s Take on Mixture-of-Experts

Metric	DeepSeek-V3	Llama 4 Maverick
Total Params	671 B	400 B
Active Params	37 B	17 B
Experts per layer	256 (9 active)	128 (2 active)
Expert Hidden Dim	2048	8192
MoE frequency	Every layer (after 3rd)	Every other layer

Key difference: Llama 4 alternates MoE and dense layers; DeepSeek goes all-in.
Impact: Still too early to call a winner, but both prove MoE is mainstream in 2025.

7. Qwen3: Dense vs MoE—Pick Your Flavor

7.1 Dense Models (0.6 B → 32 B)

0.6 B checkpoint: smallest current-gen open model; runs on a laptop CPU.
Architecture: deeper and narrower than Llama 3 1 B → smaller memory footprint.

7.2 MoE Models

30 B-A3 B & 235 B-A22 B
- A22 B = 22 B active params, 235 B total.
- No shared expert unlike earlier Qwen2.5-MoE.
- Use-case matrix
  - Dense: easier fine-tuning, predictable latency.
  - MoE: higher knowledge capacity at fixed inference cost.

8. SmolLM3: A 3 B Model That Drops Positional Embeddings

NoPE (No Positional Embedding)
- Removes all explicit position signals; relies solely on causal mask order.
- Paper shows better length generalization on small-scale tests.
- SmolLM3 applies NoPE every 4th layer to stay safe.
Sweet spot: 3 B parameters—bigger than Qwen3 1.7 B, smaller than 4 B.

9. Kimi 2: Scaling DeepSeek-V3 to 1 T Parameters

Architecture: same blueprint as DeepSeek-V3
- More experts (exact count undisclosed)
- Fewer MLA heads
Optimizer: first production-scale use of Muon instead of AdamW.
Claim: matches proprietary giants (Gemini, Claude, GPT-4o) on public benchmarks.
Status: largest open-weight model to date at 1 T total params.

10. One Comparison Table to Rule Them All

Model	Total / Active Params	VRAM (Inference)	Key Feature	When to Use
DeepSeek-V3	671 B / 37 B	Medium	MLA + MoE	Production, open-source
Llama 4	400 B / 17 B	Low	MoE, alternating	Existing Llama ecosystem
Gemma 3 27 B	27 B / 27 B	High	Sliding-window	24 GB GPU at home
Mistral Small 3.1	24 B / 24 B	Low	Fast tokenizer	Low-latency APIs
Qwen3 0.6 B	0.6 B / 0.6 B	Tiny	Smallest open	On-device / education
Qwen3 235 B-A22 B	235 B / 22 B	Medium	MoE, no shared expert	High-throughput serving
SmolLM3 3 B	3 B / 3 B	Low	NoPE	Personal projects
Kimi 2	1000 B / ?	High	1 T scale	Research, SOTA demos

11. FAQ: Quick Answers to Common Reader Questions

Q1: What exactly is KV cache and why does everyone try to shrink it?
A: In autoregressive generation we cache previously computed Keys and Values so we don’t recompute them for every new token. KV cache size = batch × seq_len × num_heads × head_dim × layers. Anything that cuts seq_len, heads, or dimension saves GPU RAM.

Q2: Does MoE make serving infrastructure harder?
A: You need a router and expert parallelism, but major frameworks (vLLM, TensorRT-LLM, DeepSpeed) already support it out-of-the-box.

Q3: Will Post-Norm bring back gradient explosions?
A: OLMo 2 and Gemma 3 add RMSNorm + QK-Norm to mitigate that; results show stable loss curves.

Q4: Is NoPE safe for large models?
A: Evidence exists only up to ~100 M params. SmolLM3 uses it selectively; at 100 B+ scale you may still want RoPE or ALiBi.

Q5: How do I run Gemma 3 27 B on my Mac Mini?
A: M2 Pro 32 GB unified memory + 8-bit quantization gives ~8 tokens/s—good enough for chat.

Closing Thoughts

Seven years on, the Transformer backbone is still standing, but the knobs around it—how we compress KV cache, route experts, or place normalization—have become the real battleground. Whether you’re shipping to production or tinkering on a laptop, the 2025 lineup has a model that fits. Pick one, benchmark it, and keep iterating; the next breakthrough may just be another combination of the tricks above.

Inside 2025’s LLM Revolution: From GPT-2 to Kimi 2 Architectures Explained

From GPT-2 to Kimi 2: A Visual Guide to 2025’s Leading Large Language Model Architectures

Table of Contents

1. Seven Years of the Same Backbone—What Actually Changed?

2. DeepSeek-V3 / R1: MLA + MoE, the Memory-Saving Duo

2.1 Multi-Head Latent Attention (MLA)

2.2 Mixture-of-Experts (MoE)

3. OLMo 2: Moving RMSNorm One Step Back for Stable Training

4. Gemma 3: Sliding-Window Attention Shrinks the KV Cache

4.1 Sliding-Window Attention

4.2 Sandwich Norm

5. Mistral Small 3.1: 24 B Beating 27 B with Smaller Tricks

6. Llama 4: Meta’s Take on Mixture-of-Experts

7. Qwen3: Dense vs MoE—Pick Your Flavor

7.1 Dense Models (0.6 B → 32 B)

7.2 MoE Models

8. SmolLM3: A 3 B Model That Drops Positional Embeddings

9. Kimi 2: Scaling DeepSeek-V3 to 1 T Parameters

10. One Comparison Table to Rule Them All

11. FAQ: Quick Answers to Common Reader Questions

Closing Thoughts

Inside 2025’s LLM Revolution: From GPT-2 to Kimi 2 Architectures Explained

From GPT-2 to Kimi 2: A Visual Guide to 2025’s Leading Large Language Model Architectures

Table of Contents

1. Seven Years of the Same Backbone—What Actually Changed?

2. DeepSeek-V3 / R1: MLA + MoE, the Memory-Saving Duo

2.1 Multi-Head Latent Attention (MLA)

2.2 Mixture-of-Experts (MoE)

3. OLMo 2: Moving RMSNorm One Step Back for Stable Training

4. Gemma 3: Sliding-Window Attention Shrinks the KV Cache

4.1 Sliding-Window Attention

4.2 Sandwich Norm

5. Mistral Small 3.1: 24 B Beating 27 B with Smaller Tricks

6. Llama 4: Meta’s Take on Mixture-of-Experts

7. Qwen3: Dense vs MoE—Pick Your Flavor

7.1 Dense Models (0.6 B → 32 B)

7.2 MoE Models

8. SmolLM3: A 3 B Model That Drops Positional Embeddings

9. Kimi 2: Scaling DeepSeek-V3 to 1 T Parameters

10. One Comparison Table to Rule Them All

11. FAQ: Quick Answers to Common Reader Questions

Closing Thoughts

Related Posts