Context Engineering 2.0: Teaching AI to Read Between the Lines
“
What problem does context engineering solve?
Machines can’t “fill in the blanks” the way humans do; we must compress noisy reality into a clean signal they can trust.
This post walks through the 20-year arc of how we got here, the design loops that work today, and the next leaps already visible.
What exactly is context engineering—and how is it different from prompt tuning or RAG?
One-sentence answer:
Context engineering is the full-cycle discipline of collecting, storing, managing and selecting everything a machine needs to understand intent; prompt tuning and RAG are single knobs inside that cycle.
Summary:
Prompts, RAG and memory tricks are tactics. Context engineering is the strategy that decides which tactic to use, when, and with what data.
Reflection:
In 2021 I spent a week hand-crafting a 32-shot prompt for a summarisation task, only to watch performance drop when the test set topic drifted. The prompt wasn’t wrong; the context pipeline feeding it was starved. That was my first realisation that “better prompts” without better context logistics is lipstick on a pig.
20 years in four waves: where we are and what’s next
| Wave | Time | Machine skill | Human labour | Example stack |
|---|---|---|---|---|
| 1.0 Structured | 1990-2020 | sense & react | translate intent into menus | Context Toolkit, Cooltown |
| 2.0 Language-native | 2020-now | read & reason | speak naturally | ChatGPT, LangChain, Letta |
| 3.0 Human-level | near future | social cues, emotion | minimal | research prototypes |
| 4.0 Super-human | speculative | invents new context for us | learner side | AlphaGo-style teachers |
Summary:
Each jump in machine intelligence lowers the human effort required to bridge meaning, but the bridge itself—context engineering—never disappears; it just becomes invisible.
The formal definition (translated into English)
Source file gives:
Context C = ∪ Char(e) for every relevant entity e
CE : (C, T) → f_context
Plain words:
-
List every “thing” that matters (user, app, room, API, memory bank). -
Take a snapshot of each thing. -
Glue the snapshots together—that’s your raw context. -
Build a function that turns that raw blob into whatever shape the task T needs (tokens, vectors, JSON, you name it).
Mini-case:
Gemini CLI treats the project folder as an entity. Its snapshot is the GEMINI.md file. When you type a request, the CLI unions that file with the terminal’s working directory and the chat history, then feeds the bundle to the model. No black magic—just a clean f_context.
Design loop 1: Collection—more signal, less noise
| Source | Modalities | Typical artefact |
|---|---|---|
| Phone / Laptop | keystroke, cursor, window title | “VSCode focused on App.tsx” |
| Smart-watch | heart-rate, accelerometer | “user relaxed, walking” |
| Smart-speaker | voice, pause pattern | “hesitant, 3 false starts” |
| VR controller | micro-gesture velocity | “precise vs broad motion” |
| EEG headset | α/β power ratio | “high cognitive load” |
Principles:
-
Minimal sufficiency: collect just enough to support the next decision. -
Semantic continuity: keep meaning intact across sampling gaps.
Code snippet: 30-second volume sampler for “user energy”
import pyaudio, audioop, math
CHUNK, RATE, SEC = 1024, 16000, 3
p = pyaudio.PyAudio()
s = p.open(format=pyaudio.paInt16, channels=1, rate=RATE, input=True,
frames_per_buffer=CHUNK)
rms = [audioop.rms(s.read(CHUNK), 2) for _ in range(int(RATE/CHUNK*SEC))]
print("avg energy:", round(math.sqrt(sum(rms)/len(rms))))
s.stop_stream(); s.close(); p.terminate()
Use the number as a real-time tag: < 30 → “quiet”, 30-60 → “normal”, > 60 → “loud/engaged”.
Design loop 2: Storage—layered memory that forgets on purpose
Two-layer sketch (works in production today):
┌─ Short-term ─┐ trigger rule ┌─ Long-term ───┐
│ last N turns │ ── repeat ≥3 ──▶│ summary, │
│ + raw I/O │ or star flag │ vectors, │
└────────────────┘ └───────────────┘
Definitions:
-
Short-term = high temporal weight, kept in RAM or a fast KV store. -
Long-term = high importance, low temporal weight, compressed & written to disk/cloud. -
Transfer function decides promotion; can be as simple as “accessed count > 3”.
Letta example block (real format):
{"id":"m_042",
"created":"2025-11-06T10:12:00Z",
"type":"code_snippet",
"content":"def fib(n): return n if n<2 else fib(n-1)+fib(n-2)",
"embedding":[0.11,-0.05,...],
"importance":0.78,
"accessed":4}
Embedding enables vector search; importance & accessed drive the promotion rule.
Design loop 3: Usage—select before you attend
Even 1-million-token windows drown in junk. Apply five filters before anything hits the model:
-
Semantic similarity (vector nearest-neighbour) -
Logical dependency (must keep earlier tool outputs that later steps cite) -
Recency & frequency (time-decay + access count) -
User feedback (explicit thumbs) -
De-duplication (merge near-identical chunks)
Scoring sketch:
def score(mem, q_vec, now):
sem = cosine(mem['embedding'], q_vec)
dep = 1 if mem['id'] in dep_graph else 0.3
rec = exp(-(now - mem['created']).hours / 24)
return 0.5*sem + 0.2*dep + 0.2*rec + 0.1*min(mem['accessed']/10, 1.0)
Keep top-K chunks; K such that total tokens ≤ 50 % of model’s limit—empirical sweet spot seen in coding agents.
Multi-modal fusion—turn images, audio and text into one bucket
Three battle-tested patterns:
-
Project everything into a shared vector space, then concatenate. -
Run a single Transformer with mixed tokens (text & patch & audio frames). -
Cross-attention: let text queries attend to image regions on-the-fly.
Reflection:
I once piped webcam frames into a pipeline that treated each JPG as a base64 string and simply prepended it to the prompt. Throughput tanked and bills tripled. Switching to pattern #2 (unified self-attention inside the model) cut latency by 40 % and cost by half—same accuracy.
Case study 1: Gemini CLI file inheritance (copy-paste ready)
Goal: Give the model project-level context without pasting 500 lines every turn.
Mechanism:
-
GEMINI.md files live at any folder depth. -
Child folders override parent keys; siblings stay isolated. -
At start-up the CLI loads the chain of GEMINI.md files from root down to PWD. -
Mid-session summaries are appended to the deepest file, forming a self-baking loop.
Example hierarchy:
~/GEMINI.md ➜ global rules (TS, eslint, tests)
~/web/GEMINI.md ➜ React conventions
~/web/components/Button/ ➜ no extra file → inherits above
Outcome:
Component code generated inside Button/ automatically follows React + TS rules; a script running in ~/backend/ sees only the global rules, avoiding front-end noise.
Case study 2: Deep Research snapshot loop (break the 200 k wall)
Problem: Open-ended research can spawn 50+ search-call-observe cycles—far beyond any context window.
Solution lifecycle:
collect ➜ compress ➜ reason ➜ repeat
Snapshot prompt (abridged):
You are a compression agent.
Input: full history + current uncertainty list.
Output:
<Evidence> key findings, URLs, quotes </Evidence>
<Plan> top-3 remaining questions </Plan>
Max 800 tokens.
The downstream agent sees only the snapshot, not the raw crawl. In Tongyi DeepResearch this keeps runs inside an 8 k context while producing 30-page reports.
Case study 3: Multi-agent blackboard (no spaghetti messages)
Setup:
-
1 planner + 3 executors, async. -
Shared JSON file (“blackboard”) with sections: outline, sec1, sec2, refs. -
File-lock to avoid write clashes. -
Executors poll their section status; when allocated they write draft + mark “done”.
Blackboard snippet:
{"outline":"done","owner":"planner",
"sec1":"writing","owner":"exec-A",
"sec2":"todo","owner":null}
No direct socket chatter; the file is the protocol. Works across languages and reboots.
Case study 4: Brain–computer interface for cognitive-load tagging
Hardware: 8-channel dry EEG.
Feature: α/β power ratio.
Mapping: ratio < 0.4 → high load.
Pipeline:
EEG stream → band-pass 1-50 Hz → ICA artefact removal → 1-second window FFT → ratio calculation → MQTT message → context store.
Usage rule:
High-load tag triggers “quiet mode”: AI sends bullet summaries instead of long paragraphs; Slack bot pauses non-critical channels. Pilot showed 25 % drop in self-reported frustration.
Emerging ops: KV caching, tool sets, failure retention
KV caching—keep the hit-rate high
-
Freeze system prompt prefix byte-for-byte; timestamps at the top break cache. -
Append-only history; random edits invalidate the whole prefix. -
Warm-up: preload expected sessions during idle time.
Tool sets—smaller is safer
DeepSeek-v3 accuracy peaks near 30 tools; beyond 100 success rate falls off a cliff.
Fix: load only domain-relevant tools per episode; mask invalid logits at decode time.
Failure retention—don’t hide mistakes
Keep wrong actions inside the context window. The model sees the error trace and learns corrective patterns. Shuffle serialization order slightly to avoid mechanical repetition.
Checklist: implement today without blowing the budget
-
Pick two memory layers: hot RAM (last N) + cold DB (compressed). -
Write a 5-line scoring function; filter to ≤ 50 % context length. -
Store embeddings side-by-side with human-readable summary for debug. -
Lock shared blackboard when agents > 1. -
Version your system prompt; any byte change nukes KV cache. -
Strip dynamic timestamps from prefix. -
Keep tool count < 30 per call; mask unavailable choices. -
Log wrong actions—let the model read its own stack trace.
One-page overview
Context engineering compresses high-entropy reality into low-entropy signals machines can act on.
-
20-year arc: structured menus → natural language → social nuance → machine-invented context. -
Core loop: collect, store, select. -
Store: short-term RAM + long-term DB; promote by access count or user star. -
Select: semantic, dependency, recency, frequency, de-duplication. -
Fusion: project multimodal inputs into shared space; cross-attend if needed. -
Four case studies show file inheritance, snapshot compression, blackboard coordination and EEG tagging working in production. -
Ops: freeze KV prefix, limit tools, keep failures.
Follow the checklist to ship without surprises.
FAQ
Q1: Is context engineering just a fancy name for prompt tuning?
A: No. Prompt tuning is one knob inside the larger context pipeline.
Q2: How large should the short-term buffer be?
A: Enough to cover a single task episode—usually 4-12 k tokens for code, 2-4 k for chat.
Q3: Do I need a vector database at day one?
A: A flat JSON file + NumPy works up to ~50 MB of embeddings; switch to FAISS when search latency > 150 ms.
Q4: What promotes a chunk to long-term memory?
A: Repeated access (≥3) or explicit user feedback (thumbs-up/star).
Q5: Why keep wrong actions in the window?
A: Models learn corrective patterns only if they see the error trace.
Q6: Does EEG require medical-grade gear?
A: Dry consumer headsets give coarse load signals; good enough for quiet-mode toggles, not for mind-reading.
Q7: How many tools before reliability tanks?
A: Empirical breakpoint around 30 for DeepSeek-v3; mask unused tools to stay safe.

