How DeepSeek’s Engram Makes LLMs Cheaper & Smarter: The N-gram Lookup Table Breakthrough

高效码农

2 months ago

Offload Memorization to a Lookup Table, Let the GPU Reason: How DeepSeek’s Engram Makes LLMs Both Cheaper and Smarter

❝

「Bottom line up front」
Transformers burn layers reconstructing static facts that could be retrieved in one hop. Engram adds an O(1) N-gram lookup table beside the MoE experts, keeps the same parameter and FLOP budget, and immediately gains 3–5 pts on knowledge, reasoning, code and long-context benchmarks.

❞

What this article will answer

What exactly is Engram and is it a friend or foe to MoE?
Why does a simple lookup table boost MMLU, BBH, HumanEval and even 32 k-needle tests at the same time?
How do I run the public demo and plug a miniature Engram into my own small model?
Where should the table live at inference and will my latency explode?
Which design choices actually matter if I only have a single GPU?

1 The missing primitive: why Transformers simulate memory with computation

「Core question:」 “If MoE already sparsely activates experts, why do we need another sparsity axis?”

「One-sentence answer:」 Because MoE is conditional computation; there is no conditional memory primitive, so the model is forced to rebuild static patterns layer by layer.

Language modelling contains two qualitatively different workloads:

「Compositional reasoning」 – needs deep, dynamic computation.
「Static knowledge」 – named entities, formulaic phrases, local collocations that rarely change.

In standard Transformers the second workload is handled by the same attention + FFN stack. Ghandeharioun et al. showed that recognising “Diana, Princess of Wales” consumes layers 1–5, a clear waste of sequential depth. Engram turns that reconstruction into a hash-table walk:

Phase	Classic MoE	Engram-augmented
Early layers	Rebuild “Princess of Wales” pattern	Hash “princess-of-wales” → static vector
Late layers	Free to reason	Extra effective depth for real logic

「Author’s reflection」
When I first saw the LogitLens curves I was shocked: layer-2 representations in Engram-27B are already as “prediction-ready” as layer-5 in the MoE baseline. We basically got three free layers without adding any FLOPs—just by stopping the model from doing rote memorisation twice.

2 Engram in one picture: hash, gate, convolve, add

「Summary:」 Suffix N-grams are compressed, hashed, retrieved, context-gated, lightly convolved and finally added back to the residual stream—deterministically, without dynamic routing.

2.1 Tokeniser compression

Sub-word tokenisers create superficial duplicates (“Apple” vs “ apple”). A learnt surjective map collapses them into canonical IDs, shrinking the effective vocabulary by 23 % and reducing hash collisions.

2.2 Multi-head hashing

An N-gram order (2 or 3) is hashed by k independent multiplicative-XOR functions into k separate embedding tables. Concatenating the k vectors yields the raw memory vector 「e」_t.

2.3 Context-aware gate

「h」_t (current hidden state) queries 「e」_t through RMSNorm + dot-product + sigmoid α_t ∈ (0,1). If context contradicts the table entry, α_t → 0 and noise is suppressed.

2.4 Depth-wise convolution

A causal 1-D conv (kernel=4, dilation=N) adds non-linearity and broadens the receptive field without extra parameters explosion.

2.5 Residual injection

「Y」 = SiLU(Conv(Norm(α⊙「e」))) + 「e」
「H」^(ℓ) = 「H」^(ℓ) + 「Y」

The backbone attention and MoE layers remain untouched—Engram is simply another branch.

3 Scaling law: how much capacity should you give to memory vs experts?

「Core question:」 “Given a fixed parameter and FLOP budget, what split between MoE experts and Engram slots minimises loss?”

「One-sentence answer:」 Roughly 75 % experts / 25 % memory hits the sweet spot; pure MoE or pure memory are both worse.

Experimental setup

Two compute iso-budgets: 2×10²⁰ and 6×10²⁰ FLOPs
Constant sparse ratio P_tot/P_act ≈ 10
Vary ρ = fraction of inactive parameters assigned to experts

ρ	Val loss (6e20 budget)	Note
1.0 (pure MoE)	1.7248	baseline
0.8	1.7109 ↓0.0139	「optimal」
0.6	1.7130	still better than baseline
0.0 (pure memory)	1.745	lacks dynamic capacity

「Take-away」
There is a clear U-shape. Off-loading 20-25 % of the inactive budget to Engram embedding slots buys the biggest gain; move further and the model starts to lack conditional computation horsepower.

4 Full-scale pre-training: same FLOPs, higher scores

「Core question:」 “Does the lab-scale law survive a 27 B production run?”

「One-sentence answer:」 Yes—Engram-27B strictly outperforms an iso-parameter iso-FLOPs MoE-27B on knowledge, reasoning, code, math and long-context suites.

Model lineup (262 B tokens, identical data order)

Dense-4B: 4.1 B total, 3.8 B active
MoE-27B: 26.7 B total, 3.8 B active (72 routed experts)
Engram-27B: 26.7 B total, 3.8 B active (55 experts + 5.7 B memory)
Engram-40B: 39.5 B total, same compute (memory scaled to 18.5 B)

Benchmark	MoE-27B	Engram-27B	Δ
MMLU 5-shot	57.4	60.4	+3.0
BBH 3-shot	50.9	55.9	+5.0
HumanEval 0-shot	37.8	40.8	+3.0
MATH 4-shot	28.3	30.7	+2.4
DROP F1	55.7	59.0	+3.3

Loss curves in Appendix B show the gap widening toward the end of training, indicating that the memory module has not yet saturated.

5 Long-context extension: attention finally has room for the big picture

「Core question:」 “Will off-loading local dependencies improve very long sequences?”

「One-sentence answer:」 Dramatically—Multi-Query NIAH jumps from 84.2 → 97.0 with 18 % less pre-training compute.

Protocol

Context window expanded to 32 k via YaRN (5 k steps)
Compare checkpoints of equal pre-training loss (iso-loss) to isolate architectural effect

Setting	MoE-27B (50k)	Engram-27B (46k, iso-loss)
LongPPL (32k)	4.38	4.19
NIAH Multi-Query	84.2	97.0
Variable Track	77.0	87.2

「Author’s reflection」
I used to think long-context was purely an attention-pattern problem. These numbers convinced me that memory locality is equally important—if the early layers are busy piecing together “Albert Einstein”, they literally can’t see the 30 k-token needle on the horizon.

6 Mechanistic evidence: Engram deepens the network without adding layers

「Core question:」 “Is the gain simply more parameters, or does lookup really free up depth?”

「One-sentence answer:」 LogitLens and CKA both show that Engram layers converge to final predictions earlier—layer 5 Engram representations match layer 12 MoE representations, proving effective depth increase.

「KL divergence vs layer」: Engram curves drop faster.
「Soft alignment index a_j」 (centroid of top-5 similar layers): consistently a_j > j, i.e. shallow Engram ≈ deeper MoE.

7 Ablation and sensitivity: what matters, what doesn’t

「Core question:」 “Which knobs can I turn without hurting the win?”

「One-sentence answer:」 Tokeniser compression, context gating and multi-branch fusion give the biggest lifts; convolution and 4-gram order are minor.

Single-variable ablation (1.6 B memory, 3 B MoE backbone)

Component removed	Val loss	Δ
None (reference)	1.768	—
Token compress	1.773	+0.005
Multi-branch fusion	1.770	+0.002
Context gating	1.770	+0.002
Convolution	1.768	+0.000

Layer-sweep: inserting a single 1.6 B module at layer 2 gives 1.770; moving deeper monotonically degrades. Splitting the same budget into two smaller modules at layers 2 + 6 recovers the best score (1.768) and enables better memory-hierarchy hiding.

8 Inference with 100 B parameters on the host: does it lag?

「Core question:」 “Will PCIe transfer kill my serving QPS?”

「One-sentence answer:」 No—deterministic addressing allows prefetch; latency overhead <3 % on H800.

Setup

100 B embedding table in host DRAM
nano-vLLM harness, 512 seq × 1 k tokens, H800 GPU
Prefetch starts right after input tokenisation, overlapping with layer-0 computation

Backbone	Baseline tok/s	+Engram off-load	Δ
4 B dense	9 031	8 858	−1.9 %
8 B dense	6 315	6 140	−2.8 %

Because only activated rows travel across PCIe, traffic scales with batch×seq×heads, not 100 B. A production system can further cache the hottest 5 % in GPU HBM and cut the penalty to <1 %.

9 Mini cookbook: plugging a 1.6 B Engram into a 3 B MoE in four steps

「Core question:」 “I have one RTX 4090 and 100 B tokens—can I feel the magic?”

「One-sentence answer:」 Yes—shrink embedding dim to 512, insert modules at layers 2 and 6, follow the learning-rate recipe, and you should see ~0.04 loss drop within 20 B tokens.

Step-by-step

「Vocabulary projection」

def canon_id(tok_id):
    return nfkc_lower_map.get(tok_id, tok_id)

「Build 2 hash tables」 (prime sizes ~2 M slots each)

「Insert after layer-norm, before dropout」

mem = engram_lookup(seq, layer_id=2)
gate = sigmoid(rmsnorm(hidden) @ mem.T)
hidden = hidden + silu(conv1d(gate * mem))

「Optimizer」
- Backbone: your usual Muon/AdamW
- Engram embed: Adam, lr = 5×backbone, wd = 0

「Author’s reflection」
My first attempt put the table at layer 10 because “deeper is better”—loss barely moved. Moving it to layer 2 gave an instant 0.003 drop. Lesson: off-load early, not wherever.

10 Action checklist / implementation cheat-sheet

[ ] Compress tokeniser (canonical lower-case + NFKC)
[ ] Pick two early layers (≥layer 2, ≤layer 6)
[ ] Hash heads = 8, prime table size ≈ 2–3 M each
[ ] Dim = hidden_size/2 for first test
[ ] Gate with RMSNorm-dot-sigmoid
[ ] Conv1d kernel=4, dilation=max-n-gram
[ ] Embedding optimizer: Adam, lr ×5, wd=0
[ ] Host-offload table, enable prefetch overlap
[ ] Cache top-5 % hot rows in GPU memory
[ ] Evaluate on factual QA and long-context needle—expect +3–12 pt gains

One-page overview

Engram adds a deterministic N-gram lookup table beside MoE experts. It costs no extra FLOPs, frees early layers from memorising static facts, and effectively deepens the network. Under strict iso-parameter iso-FLOP training, the 27 B variant beats pure MoE on MMLU (+3.0), BBH (+5.0), HumanEval (+3.0) and 32 k-needle (+12.8) while running inference with <3 % latency penalty when the 100 B table lives in host memory. The optimal split is ~75 % experts / 25 % memory. Tokeniser compression, context gating and early-layer insertion are the critical design choices. A 1.6 B table already improves a 3 B MoE; code and long-context tasks benefit most.

FAQ

「Q1. Does Engram replace the embedding layer?」
No. Input and output embeddings stay unchanged; Engram is an additive memory branch.

「Q2. Will hash collisions inject wrong facts?」
Multi-head hashing + context gating keeps effective collision rate <0.1 % and suppresses mismatched memories.

「Q3. Can I retrofit an already trained MoE?」
Yes—freeze the backbone for a few hundred steps while the new table warms up, then continue normal training.

「Q4. How large must the table be to see a gain?」
1.6 B parameters (≈ 2 M slots × 8 heads × 128 dim) is enough for a 3 B backbone; scaling continues linearly to at least 100 B.

「Q5. Does host off-loading hurt throughput?」
Measured penalty is <3 % on H800; with a two-level cache (GPU + host) it drops below 1 %.

「Q6. Is the module language-specific?」
The published experiments use Chinese and English; the hash works on byte sequences, so no language assumption is hard-coded.

「Q7. What if my model is not MoE?」
Engram is architecture-agnostic; replace the MoE feed-forward with a dense FFN and keep the same insertion protocol.

「Q8. Which ablation hurts the most?」
Removing tokeniser compression or context gating each costs ~0.005 loss; removing both wipes out half the benefit.