From GPT-2 to Kimi 2: A Visual Guide to 2025’s Leading Large Language Model Architectures

If you already use large language models but still get lost in technical jargon, this post is for you. In one long read you’ll learn:

  • Why DeepSeek-V3’s 671 B parameters run cheaper than Llama 3’s 405 B
  • How sliding-window attention lets a 27 B model run on a Mac Mini
  • Which open-weight model to download for your next side project

Table of Contents

  1. Seven Years of the Same Backbone—What Actually Changed?
  2. DeepSeek-V3 / R1: MLA + MoE, the Memory-Saving Duo
  3. OLMo 2: Moving RMSNorm One Step Back for Stable Training
  4. Gemma 3: Sliding-Window Attention Shrinks the KV Cache
  5. Mistral Small 3.1: 24 B Beating 27 B with Smaller Tricks
  6. Llama 4: Meta’s Take on Mixture-of-Experts
  7. Qwen3: Dense vs MoE—Pick Your Flavor
  8. SmolLM3: A 3 B Model That Drops Positional Embeddings
  9. Kimi 2: Scaling DeepSeek-V3 to 1 T Parameters
  10. One Comparison Table to Rule Them All
  11. FAQ: Quick Answers to Common Reader Questions

1. Seven Years of the Same Backbone—What Actually Changed?

Component 2019 GPT-2 2025 Mainstream One-Sentence Summary
Position Encoding Fixed Absolute RoPE (Rotary) Longer context, less memory
Attention Multi-Head GQA / MLA Fewer K/V heads, smaller cache
Activation GELU SwiGLU Faster and slightly more accurate
Feed-Forward Single Dense MoE or Slimmer FFN Train huge, infer small
Norm Placement Post-LN Pre-LN / Hybrid Stabilize gradients or memory

2. DeepSeek-V3 / R1: MLA + MoE, the Memory-Saving Duo

2.1 Multi-Head Latent Attention (MLA)

  • Problem: KV cache is the GPU killer during inference.
  • Classic Fix (GQA): Group queries so 2 heads share 1 K/V pair.
  • DeepSeek’s Fix (MLA): Compress K/V into a low-dimensional latent before caching, then decompress on the fly.

    • Cache size drops ≈ 50 %.
    • DeepSeek-V2 ablations show MLA outperforms vanilla MHA and GQA.
# Pseudocode for MLA decompression
compressed_kv = cache[token_id]     # (batch, latent_dim)
key, value = up_project(compressed_kv)  # back to full dim

2.2 Mixture-of-Experts (MoE)

  • Concept: Replace every Feed-Forward block with 256 smaller experts, but let a router pick only 8 + 1 shared experts per token.
  • Numbers that matter

    • Total params: 671 B
    • Active params: 37 B (5.5 %)
  • Shared Expert Trick

    • Always-on expert for common patterns, freeing the others to specialize.

3. OLMo 2: Moving RMSNorm One Step Back for Stable Training

Placement Original Transformer GPT-2 / Llama 3 OLMo 2
Where is the norm? After block (Post-LN) Before block (Pre-LN) After block but inside residual
  • Why change? Authors show Post-Norm + RMSNorm + QK-Norm yields smoother loss curves (see their Figure 9).
  • QK-Norm = extra RMSNorm inside the attention layer on queries and keys before RoPE.
  • Take-away: If your pre-training keeps diverging, try OLMo 2’s recipe.

4. Gemma 3: Sliding-Window Attention Shrinks the KV Cache

4.1 Sliding-Window Attention

  • Local vs Global

    • Local: each token attends to only 1024 neighbors.
    • Global: full self-attention.
  • Gemma 3 ratio: 5 local layers : 1 global layer.
  • Memory win: KV cache 5× smaller with negligible perplexity increase (see Gemma 3 paper Figure 13).

4.2 Sandwich Norm

  • RMSNorm before and after the attention block.
  • Combines Pre-Norm stability and Post-Norm signal strength.

5. Mistral Small 3.1: 24 B Beating 27 B with Smaller Tricks

  • Dropped sliding-window attention → uses vanilla GQA + FlashAttention for lower latency.
  • Shrunk tokenizer vocabulary → fewer tokens per sentence → faster generation.
  • Fewer layers → less serial computation.
  • Benchmark snapshot: beats Gemma 3 27 B on most tasks except math.

6. Llama 4: Meta’s Take on Mixture-of-Experts

Metric DeepSeek-V3 Llama 4 Maverick
Total Params 671 B 400 B
Active Params 37 B 17 B
Experts per layer 256 (9 active) 128 (2 active)
Expert Hidden Dim 2048 8192
MoE frequency Every layer (after 3rd) Every other layer
  • Key difference: Llama 4 alternates MoE and dense layers; DeepSeek goes all-in.
  • Impact: Still too early to call a winner, but both prove MoE is mainstream in 2025.

7. Qwen3: Dense vs MoE—Pick Your Flavor

7.1 Dense Models (0.6 B → 32 B)

  • 0.6 B checkpoint: smallest current-gen open model; runs on a laptop CPU.
  • Architecture: deeper and narrower than Llama 3 1 B → smaller memory footprint.

7.2 MoE Models

  • 30 B-A3 B & 235 B-A22 B

    • A22 B = 22 B active params, 235 B total.
    • No shared expert unlike earlier Qwen2.5-MoE.
    • Use-case matrix

      • Dense: easier fine-tuning, predictable latency.
      • MoE: higher knowledge capacity at fixed inference cost.

8. SmolLM3: A 3 B Model That Drops Positional Embeddings

  • NoPE (No Positional Embedding)

    • Removes all explicit position signals; relies solely on causal mask order.
    • Paper shows better length generalization on small-scale tests.
    • SmolLM3 applies NoPE every 4th layer to stay safe.
  • Sweet spot: 3 B parameters—bigger than Qwen3 1.7 B, smaller than 4 B.

9. Kimi 2: Scaling DeepSeek-V3 to 1 T Parameters

  • Architecture: same blueprint as DeepSeek-V3

    • More experts (exact count undisclosed)
    • Fewer MLA heads
  • Optimizer: first production-scale use of Muon instead of AdamW.
  • Claim: matches proprietary giants (Gemini, Claude, GPT-4o) on public benchmarks.
  • Status: largest open-weight model to date at 1 T total params.

10. One Comparison Table to Rule Them All

Model Total / Active Params VRAM (Inference) Key Feature When to Use
DeepSeek-V3 671 B / 37 B Medium MLA + MoE Production, open-source
Llama 4 400 B / 17 B Low MoE, alternating Existing Llama ecosystem
Gemma 3 27 B 27 B / 27 B High Sliding-window 24 GB GPU at home
Mistral Small 3.1 24 B / 24 B Low Fast tokenizer Low-latency APIs
Qwen3 0.6 B 0.6 B / 0.6 B Tiny Smallest open On-device / education
Qwen3 235 B-A22 B 235 B / 22 B Medium MoE, no shared expert High-throughput serving
SmolLM3 3 B 3 B / 3 B Low NoPE Personal projects
Kimi 2 1000 B / ? High 1 T scale Research, SOTA demos

11. FAQ: Quick Answers to Common Reader Questions

Q1: What exactly is KV cache and why does everyone try to shrink it?
A: In autoregressive generation we cache previously computed Keys and Values so we don’t recompute them for every new token. KV cache size = batch × seq_len × num_heads × head_dim × layers. Anything that cuts seq_len, heads, or dimension saves GPU RAM.

Q2: Does MoE make serving infrastructure harder?
A: You need a router and expert parallelism, but major frameworks (vLLM, TensorRT-LLM, DeepSpeed) already support it out-of-the-box.

Q3: Will Post-Norm bring back gradient explosions?
A: OLMo 2 and Gemma 3 add RMSNorm + QK-Norm to mitigate that; results show stable loss curves.

Q4: Is NoPE safe for large models?
A: Evidence exists only up to ~100 M params. SmolLM3 uses it selectively; at 100 B+ scale you may still want RoPE or ALiBi.

Q5: How do I run Gemma 3 27 B on my Mac Mini?
A: M2 Pro 32 GB unified memory + 8-bit quantization gives ~8 tokens/s—good enough for chat.


Closing Thoughts

Seven years on, the Transformer backbone is still standing, but the knobs around it—how we compress KV cache, route experts, or place normalization—have become the real battleground. Whether you’re shipping to production or tinkering on a laptop, the 2025 lineup has a model that fits. Pick one, benchmark it, and keep iterating; the next breakthrough may just be another combination of the tricks above.