Offload Memorization to a Lookup Table, Let the GPU Reason: How DeepSeek’s Engram Makes LLMs Both Cheaper and Smarter
❝
「Bottom line up front」
Transformers burn layers reconstructing static facts that could be retrieved in one hop. Engram adds an O(1) N-gram lookup table beside the MoE experts, keeps the same parameter and FLOP budget, and immediately gains 3–5 pts on knowledge, reasoning, code and long-context benchmarks.❞
What this article will answer
-
What exactly is Engram and is it a friend or foe to MoE? -
Why does a simple lookup table boost MMLU, BBH, HumanEval and even 32 k-needle tests at the same time? -
How do I run the public demo and plug a miniature Engram into my own small model? -
Where should the table live at inference and will my latency explode? -
Which design choices actually matter if I only have a single GPU?
1 The missing primitive: why Transformers simulate memory with computation
「Core question:」 “If MoE already sparsely activates experts, why do we need another sparsity axis?”
「One-sentence answer:」 Because MoE is conditional computation; there is no conditional memory primitive, so the model is forced to rebuild static patterns layer by layer.
Language modelling contains two qualitatively different workloads:
-
「Compositional reasoning」 – needs deep, dynamic computation. -
「Static knowledge」 – named entities, formulaic phrases, local collocations that rarely change.
In standard Transformers the second workload is handled by the same attention + FFN stack. Ghandeharioun et al. showed that recognising “Diana, Princess of Wales” consumes layers 1–5, a clear waste of sequential depth. Engram turns that reconstruction into a hash-table walk:
| Phase | Classic MoE | Engram-augmented |
|---|---|---|
| Early layers | Rebuild “Princess of Wales” pattern | Hash “princess-of-wales” → static vector |
| Late layers | Free to reason | Extra effective depth for real logic |
「Author’s reflection」
When I first saw the LogitLens curves I was shocked: layer-2 representations in Engram-27B are already as “prediction-ready” as layer-5 in the MoE baseline. We basically got three free layers without adding any FLOPs—just by stopping the model from doing rote memorisation twice.
2 Engram in one picture: hash, gate, convolve, add

「Summary:」 Suffix N-grams are compressed, hashed, retrieved, context-gated, lightly convolved and finally added back to the residual stream—deterministically, without dynamic routing.
2.1 Tokeniser compression
Sub-word tokenisers create superficial duplicates (“Apple” vs “ apple”). A learnt surjective map collapses them into canonical IDs, shrinking the effective vocabulary by 23 % and reducing hash collisions.
2.2 Multi-head hashing
An N-gram order (2 or 3) is hashed by k independent multiplicative-XOR functions into k separate embedding tables. Concatenating the k vectors yields the raw memory vector 「e」_t.
2.3 Context-aware gate
「h」_t (current hidden state) queries 「e」_t through RMSNorm + dot-product + sigmoid α_t ∈ (0,1). If context contradicts the table entry, α_t → 0 and noise is suppressed.
2.4 Depth-wise convolution
A causal 1-D conv (kernel=4, dilation=N) adds non-linearity and broadens the receptive field without extra parameters explosion.
2.5 Residual injection
「Y」 = SiLU(Conv(Norm(α⊙「e」))) + 「e」
「H」^(ℓ) = 「H」^(ℓ) + 「Y」
The backbone attention and MoE layers remain untouched—Engram is simply another branch.
3 Scaling law: how much capacity should you give to memory vs experts?
「Core question:」 “Given a fixed parameter and FLOP budget, what split between MoE experts and Engram slots minimises loss?”
「One-sentence answer:」 Roughly 75 % experts / 25 % memory hits the sweet spot; pure MoE or pure memory are both worse.
Experimental setup
-
Two compute iso-budgets: 2×10²⁰ and 6×10²⁰ FLOPs -
Constant sparse ratio P_tot/P_act ≈ 10 -
Vary ρ = fraction of inactive parameters assigned to experts
| ρ | Val loss (6e20 budget) | Note |
|---|---|---|
| 1.0 (pure MoE) | 1.7248 | baseline |
| 0.8 | 1.7109 ↓0.0139 | 「optimal」 |
| 0.6 | 1.7130 | still better than baseline |
| 0.0 (pure memory) | 1.745 | lacks dynamic capacity |
「Take-away」
There is a clear U-shape. Off-loading 20-25 % of the inactive budget to Engram embedding slots buys the biggest gain; move further and the model starts to lack conditional computation horsepower.
4 Full-scale pre-training: same FLOPs, higher scores
「Core question:」 “Does the lab-scale law survive a 27 B production run?”
「One-sentence answer:」 Yes—Engram-27B strictly outperforms an iso-parameter iso-FLOPs MoE-27B on knowledge, reasoning, code, math and long-context suites.
Model lineup (262 B tokens, identical data order)
-
Dense-4B: 4.1 B total, 3.8 B active -
MoE-27B: 26.7 B total, 3.8 B active (72 routed experts) -
Engram-27B: 26.7 B total, 3.8 B active (55 experts + 5.7 B memory) -
Engram-40B: 39.5 B total, same compute (memory scaled to 18.5 B)
| Benchmark | MoE-27B | Engram-27B | Δ |
|---|---|---|---|
| MMLU 5-shot | 57.4 | 60.4 | +3.0 |
| BBH 3-shot | 50.9 | 55.9 | +5.0 |
| HumanEval 0-shot | 37.8 | 40.8 | +3.0 |
| MATH 4-shot | 28.3 | 30.7 | +2.4 |
| DROP F1 | 55.7 | 59.0 | +3.3 |
Loss curves in Appendix B show the gap widening toward the end of training, indicating that the memory module has not yet saturated.
5 Long-context extension: attention finally has room for the big picture
「Core question:」 “Will off-loading local dependencies improve very long sequences?”
「One-sentence answer:」 Dramatically—Multi-Query NIAH jumps from 84.2 → 97.0 with 18 % less pre-training compute.
Protocol
-
Context window expanded to 32 k via YaRN (5 k steps) -
Compare checkpoints of equal pre-training loss (iso-loss) to isolate architectural effect
| Setting | MoE-27B (50k) | Engram-27B (46k, iso-loss) |
|---|---|---|
| LongPPL (32k) | 4.38 | 4.19 |
| NIAH Multi-Query | 84.2 | 97.0 |
| Variable Track | 77.0 | 87.2 |
「Author’s reflection」
I used to think long-context was purely an attention-pattern problem. These numbers convinced me that memory locality is equally important—if the early layers are busy piecing together “Albert Einstein”, they literally can’t see the 30 k-token needle on the horizon.
6 Mechanistic evidence: Engram deepens the network without adding layers
「Core question:」 “Is the gain simply more parameters, or does lookup really free up depth?”
「One-sentence answer:」 LogitLens and CKA both show that Engram layers converge to final predictions earlier—layer 5 Engram representations match layer 12 MoE representations, proving effective depth increase.

-
「KL divergence vs layer」: Engram curves drop faster. -
「Soft alignment index a_j」 (centroid of top-5 similar layers): consistently a_j > j, i.e. shallow Engram ≈ deeper MoE.
7 Ablation and sensitivity: what matters, what doesn’t
「Core question:」 “Which knobs can I turn without hurting the win?”
「One-sentence answer:」 Tokeniser compression, context gating and multi-branch fusion give the biggest lifts; convolution and 4-gram order are minor.
Single-variable ablation (1.6 B memory, 3 B MoE backbone)
| Component removed | Val loss | Δ |
|---|---|---|
| None (reference) | 1.768 | — |
| Token compress | 1.773 | +0.005 |
| Multi-branch fusion | 1.770 | +0.002 |
| Context gating | 1.770 | +0.002 |
| Convolution | 1.768 | +0.000 |
Layer-sweep: inserting a single 1.6 B module at layer 2 gives 1.770; moving deeper monotonically degrades. Splitting the same budget into two smaller modules at layers 2 + 6 recovers the best score (1.768) and enables better memory-hierarchy hiding.
8 Inference with 100 B parameters on the host: does it lag?
「Core question:」 “Will PCIe transfer kill my serving QPS?”
「One-sentence answer:」 No—deterministic addressing allows prefetch; latency overhead <3 % on H800.
Setup
-
100 B embedding table in host DRAM -
nano-vLLM harness, 512 seq × 1 k tokens, H800 GPU -
Prefetch starts right after input tokenisation, overlapping with layer-0 computation
| Backbone | Baseline tok/s | +Engram off-load | Δ |
|---|---|---|---|
| 4 B dense | 9 031 | 8 858 | −1.9 % |
| 8 B dense | 6 315 | 6 140 | −2.8 % |
Because only activated rows travel across PCIe, traffic scales with batch×seq×heads, not 100 B. A production system can further cache the hottest 5 % in GPU HBM and cut the penalty to <1 %.
9 Mini cookbook: plugging a 1.6 B Engram into a 3 B MoE in four steps
「Core question:」 “I have one RTX 4090 and 100 B tokens—can I feel the magic?”
「One-sentence answer:」 Yes—shrink embedding dim to 512, insert modules at layers 2 and 6, follow the learning-rate recipe, and you should see ~0.04 loss drop within 20 B tokens.
Step-by-step
-
「Vocabulary projection」 def canon_id(tok_id): return nfkc_lower_map.get(tok_id, tok_id) -
「Build 2 hash tables」 (prime sizes ~2 M slots each) -
「Insert after layer-norm, before dropout」 mem = engram_lookup(seq, layer_id=2) gate = sigmoid(rmsnorm(hidden) @ mem.T) hidden = hidden + silu(conv1d(gate * mem)) -
「Optimizer」 -
Backbone: your usual Muon/AdamW -
Engram embed: Adam, lr = 5×backbone, wd = 0
-
「Author’s reflection」
My first attempt put the table at layer 10 because “deeper is better”—loss barely moved. Moving it to layer 2 gave an instant 0.003 drop. Lesson: off-load early, not wherever.
10 Action checklist / implementation cheat-sheet
-
[ ] Compress tokeniser (canonical lower-case + NFKC) -
[ ] Pick two early layers (≥layer 2, ≤layer 6) -
[ ] Hash heads = 8, prime table size ≈ 2–3 M each -
[ ] Dim = hidden_size/2 for first test -
[ ] Gate with RMSNorm-dot-sigmoid -
[ ] Conv1d kernel=4, dilation=max-n-gram -
[ ] Embedding optimizer: Adam, lr ×5, wd=0 -
[ ] Host-offload table, enable prefetch overlap -
[ ] Cache top-5 % hot rows in GPU memory -
[ ] Evaluate on factual QA and long-context needle—expect +3–12 pt gains
One-page overview
Engram adds a deterministic N-gram lookup table beside MoE experts. It costs no extra FLOPs, frees early layers from memorising static facts, and effectively deepens the network. Under strict iso-parameter iso-FLOP training, the 27 B variant beats pure MoE on MMLU (+3.0), BBH (+5.0), HumanEval (+3.0) and 32 k-needle (+12.8) while running inference with <3 % latency penalty when the 100 B table lives in host memory. The optimal split is ~75 % experts / 25 % memory. Tokeniser compression, context gating and early-layer insertion are the critical design choices. A 1.6 B table already improves a 3 B MoE; code and long-context tasks benefit most.
FAQ
「Q1. Does Engram replace the embedding layer?」
No. Input and output embeddings stay unchanged; Engram is an additive memory branch.
「Q2. Will hash collisions inject wrong facts?」
Multi-head hashing + context gating keeps effective collision rate <0.1 % and suppresses mismatched memories.
「Q3. Can I retrofit an already trained MoE?」
Yes—freeze the backbone for a few hundred steps while the new table warms up, then continue normal training.
「Q4. How large must the table be to see a gain?」
1.6 B parameters (≈ 2 M slots × 8 heads × 128 dim) is enough for a 3 B backbone; scaling continues linearly to at least 100 B.
「Q5. Does host off-loading hurt throughput?」
Measured penalty is <3 % on H800; with a two-level cache (GPU + host) it drops below 1 %.
「Q6. Is the module language-specific?」
The published experiments use Chinese and English; the hash works on byte sequences, so no language assumption is hard-coded.
「Q7. What if my model is not MoE?」
Engram is architecture-agnostic; replace the MoE feed-forward with a dense FFN and keep the same insertion protocol.
「Q8. Which ablation hurts the most?」
Removing tokeniser compression or context gating each costs ~0.005 loss; removing both wipes out half the benefit.

