Stop Feeding the Token Monster – 6 Battle-Tested Moves to Shrink 25k → 11k Context with LangGraph (and Keep Your LLM Sane)
“The longer my prompt, the dumber my model.”
If that sentence ever crossed your mind at 2 a.m. while staring at a $4 invoice for 128 k tokens, welcome home. This post is the field manual I wish I had that night.
The Story That Started With “Reward Hacking”
Last week my manager pinged me on Slack:
“Quick task: summarize every flavor of reward hacking in RLHF. Deck due tomorrow.”
I dumped 200 pages of papers into Claude-3.5 Sonnet.
-
First 3 paragraphs? Flawless. -
Paragraph 4? “Bitcoin mining mitigates reward hacking.” -
Token counter? 25 000. My credit card wept.
Drew Breunig calls it Context Rot:
When the context closet is over-stuffed, the model can’t find the right T-shirt, so it grabs the nearest pair of hallucination underwear and wears it as a hat.
Why Context Rots – 4 Failure Modes You’ll Recognize
Mode | Crime Scene |
---|---|
Context Poisoning | One wrong blog is retrieved early and then cited again and again, each time with more confidence. |
Context Distraction | 80 % of the history is small-talk; the model decides “Hello” is the key signal. |
Context Confusion | GitHub code + Wikipedia + Reddit opinions all land in the same window—model turns into a traffic cop. |
Context Clash | Two papers define RLHF differently; model develops split personality. |
The 6 Moves That Actually Fix It
Every technique below is reproducible in the official repo:
https://github.com/langchain-ai/how_to_fix_your_context
I used Python 3.11 + uv—entire env in 30 s:
# 1. Clone & jump in
git clone https://github.com/langchain-ai/how_to_fix_your_context
cd how_to_fix_your_context
# 2. One-liner venv
uv venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
# 3. Install
uv pip install -r requirements.txt
# 4. Keys
export OPENAI_API_KEY="sk-xxx"
export ANTHROPIC_API_KEY="sk-ant-xxx"
Move ① RAG – Don’t Memorize the Library, Just Bring the Cheat-Sheet
SEO target: Retrieval-Augmented Generation example, token reduction
Core idea: retrieve only relevant chunks before you generate.
LangGraph recipe:
-
Chunk Lilian Weng’s 14 RL posts → 512-token slices. -
Embed with text-embedding-3-small
. -
Claude-3.5 Sonnet decides if it needs more docs, which docs, and when to stop.
Outcome: identical “reward hacking” query drops from 25 k → 15 k tokens, hallucination肉眼可见地下滑.
Figure: RAG DAG—clarify → retrieve → generate, usually finished in 3 turns.
Move ② Tool Loadout – Pack Only the Screwdrivers You’ll Actually Use
Pain: stuffing 200 math functions into the system prompt → model suffers choice paralysis.
Fix: semantic tool filtering. Embed every function’s docstring; query → top-5 most similar.
Code:
candidates = vector_store.similarity_search(user_query, k=5)
graph.update_state({"tools": [registry[t.id] for t in candidates]})
Pay-off:
-
Removes overlapping descriptions → fewer false picks. -
Saves ~500 tokens per turn; scales linearly with conversation length.
Move ③ Context Quarantine – Let the Math Nerd and the Search Junkie Live in Separate Apartments
Use-case: “Integrate this ArXiv integral and find latest papers.”
Design: Supervisor agent routes to Math-Agent (add, multiply, sympy) or Research-Agent (web search), each in its own context window.
LangGraph supervisor:
supervisor = Supervisor(agents=[math_agent, research_agent])
supervisor.add_conditional_edge("supervisor", lambda s: s["next"], {
"math": "math_agent",
"research": "research_agent"
})
Win: no context clash, no distraction. Each agent stays <4 k tokens, forever focused.
Move ④ Context Pruning – Make a Mini-Model Your Copy-Editor
Flow:
-
GPT-4o-mini acts as “junior editor”. -
Strip sentences not related to the user’s question. -
Forward the 3 k “patch” to the senior model.
Tested: 25 k → 11 k tokens, factual recall 96 % vs baseline 94 %.
Prompt in plain English:
“You are a strict editor. Delete paragraphs unrelated to ‘reward hacking’. Keep formulas. Output Markdown.”
Move ⑤ Context Summarization – Condense, Don’t Discard
When: every retrieved doc is relevant but verbose (legal, medical).
How: GPT-4o-mini compresses all content to 50-70 % length while preserving formulas & citations.
Delta: additional 30-40 % token cut, zero data loss.
vs Pruning:
-
Pruning = throw away lettuce. -
Summarization = boil three bowls of soup into one浓缩高汤.
Move ⑥ Context Offloading – Save RAM, Use Disk
Two patterns:
-
Session Scratchpad – temp notepad inside one thread. def write_note(note: str) -> str: state.scratchpad += f"- {note}\n" return "noted"
-
Persistent Memory – cross-thread KV-store via LangGraph’s InMemoryStore
.
Research agent resumes yesterday’s plan without reloading 20 k history.
Benefit: context length stays flat regardless of turns; multi-user safe via namespaces.
How to Chain the 6 Moves – a Dead-Simple Playbook
Step-by-step (copy-paste ready):
-
Open with RAG → fetch 20 k candidate docs. -
Score relevance: -
<30 % usable → Prune hard. -
Most usable but wordy → Summarize.
-
-
Check task type: -
Need calculation → Quarantine math agent. -
Need live data → Tool Loadout only search tools.
-
-
Throughout, Offload intermediate notes to scratchpad; read it back once at the end for final answer.
Mnemonic: Review → Prune/Summarize → Quarantine → Tool → Offload → Final.
(RPQT OF—sounds like a Star Wars droid, easy to remember.)
SEO-Friendly FAQ (Structured Data Ready)
Q1: Can I run the notebooks with only an OpenAI key?
A: Yes. Swap ChatAnthropic
for ChatOpenAI(model="gpt-4o")
and you’re good.
Q2: Do I need a cloud vector DB like Pinecone?
A: Repo defaults to Chroma in-memory. Zero config, 100 k chunks in seconds. Upgrade later if you scale.
Q3: Will Pruning or Summarization lose key facts?
A: Author ran a 100-sample human eval—96 % factual recall vs 94 % baseline. Still, build your own golden test set before production.
Q4: How do I add custom logic if I’m new to LangGraph?
A: Remember three rules—nodes are pure functions, state is a mutable dict, edges return string names. Visualize with print(graph.get_graph().draw_mermaid())
and debug visually.
Key Takeaway for Google & LLMs Alike
Bigger context ≠ better answers. The new meta is subtraction:
-
Subtract noise → signal sharpened. -
Subtract hallucinations → truth spotlighted. -
Subtract cents → boss delighted.
Next time you see token count balloon, channel your inner surgeon, not hoarder.
Clone the repo, run one notebook, and tell me in the comments:
“I just amputated 14 k tokens and my model finally stopped hallucinating.”
See you in the Hacker News thread.