Stop Feeding the Token Monster – 6 Battle-Tested Moves to Shrink 25k → 11k Context with LangGraph (and Keep Your LLM Sane)

“The longer my prompt, the dumber my model.”
If that sentence ever crossed your mind at 2 a.m. while staring at a $4 invoice for 128 k tokens, welcome home. This post is the field manual I wish I had that night.


The Story That Started With “Reward Hacking”

Last week my manager pinged me on Slack:
“Quick task: summarize every flavor of reward hacking in RLHF. Deck due tomorrow.”
I dumped 200 pages of papers into Claude-3.5 Sonnet.

  • First 3 paragraphs? Flawless.
  • Paragraph 4? “Bitcoin mining mitigates reward hacking.”
  • Token counter? 25 000. My credit card wept.

Drew Breunig calls it Context Rot:

When the context closet is over-stuffed, the model can’t find the right T-shirt, so it grabs the nearest pair of hallucination underwear and wears it as a hat.


Why Context Rots – 4 Failure Modes You’ll Recognize

Mode Crime Scene
Context Poisoning One wrong blog is retrieved early and then cited again and again, each time with more confidence.
Context Distraction 80 % of the history is small-talk; the model decides “Hello” is the key signal.
Context Confusion GitHub code + Wikipedia + Reddit opinions all land in the same window—model turns into a traffic cop.
Context Clash Two papers define RLHF differently; model develops split personality.

The 6 Moves That Actually Fix It

Every technique below is reproducible in the official repo:
https://github.com/langchain-ai/how_to_fix_your_context
I used Python 3.11 + uv—entire env in 30 s:

# 1. Clone & jump in
git clone https://github.com/langchain-ai/how_to_fix_your_context
cd how_to_fix_your_context

# 2. One-liner venv
uv venv && source .venv/bin/activate      # Windows: .venv\Scripts\activate

# 3. Install
uv pip install -r requirements.txt

# 4. Keys
export OPENAI_API_KEY="sk-xxx"
export ANTHROPIC_API_KEY="sk-ant-xxx"

Move ① RAG – Don’t Memorize the Library, Just Bring the Cheat-Sheet

SEO target: Retrieval-Augmented Generation example, token reduction
Core idea: retrieve only relevant chunks before you generate.
LangGraph recipe:

  • Chunk Lilian Weng’s 14 RL posts → 512-token slices.
  • Embed with text-embedding-3-small.
  • Claude-3.5 Sonnet decides if it needs more docs, which docs, and when to stop.

Outcome: identical “reward hacking” query drops from 25 k → 15 k tokens, hallucination肉眼可见地下滑.
RAG pipeline
Figure: RAG DAG—clarify → retrieve → generate, usually finished in 3 turns.


Move ② Tool Loadout – Pack Only the Screwdrivers You’ll Actually Use

Pain: stuffing 200 math functions into the system prompt → model suffers choice paralysis.
Fix: semantic tool filtering. Embed every function’s docstring; query → top-5 most similar.
Code:

candidates = vector_store.similarity_search(user_query, k=5)
graph.update_state({"tools": [registry[t.id] for t in candidates]})

Pay-off:

  • Removes overlapping descriptions → fewer false picks.
  • Saves ~500 tokens per turn; scales linearly with conversation length.

Move ③ Context Quarantine – Let the Math Nerd and the Search Junkie Live in Separate Apartments

Use-case: “Integrate this ArXiv integral and find latest papers.”
Design: Supervisor agent routes to Math-Agent (add, multiply, sympy) or Research-Agent (web search), each in its own context window.
LangGraph supervisor:

supervisor = Supervisor(agents=[math_agent, research_agent])
supervisor.add_conditional_edge("supervisor", lambda s: s["next"], {
    "math": "math_agent",
    "research": "research_agent"
})

Win: no context clash, no distraction. Each agent stays <4 k tokens, forever focused.


Move ④ Context Pruning – Make a Mini-Model Your Copy-Editor

Flow:

  1. GPT-4o-mini acts as “junior editor”.
  2. Strip sentences not related to the user’s question.
  3. Forward the 3 k “patch” to the senior model.

Tested: 25 k → 11 k tokens, factual recall 96 % vs baseline 94 %.
Prompt in plain English:

“You are a strict editor. Delete paragraphs unrelated to ‘reward hacking’. Keep formulas. Output Markdown.”


Move ⑤ Context Summarization – Condense, Don’t Discard

When: every retrieved doc is relevant but verbose (legal, medical).
How: GPT-4o-mini compresses all content to 50-70 % length while preserving formulas & citations.
Delta: additional 30-40 % token cut, zero data loss.
vs Pruning:

  • Pruning = throw away lettuce.
  • Summarization = boil three bowls of soup into one浓缩高汤.

Move ⑥ Context Offloading – Save RAM, Use Disk

Two patterns:

  1. Session Scratchpad – temp notepad inside one thread.

    def write_note(note: str) -> str:
        state.scratchpad += f"- {note}\n"
        return "noted"
    
  2. Persistent Memory – cross-thread KV-store via LangGraph’s InMemoryStore.
    Research agent resumes yesterday’s plan without reloading 20 k history.

Benefit: context length stays flat regardless of turns; multi-user safe via namespaces.


How to Chain the 6 Moves – a Dead-Simple Playbook

Step-by-step (copy-paste ready):

  1. Open with RAG → fetch 20 k candidate docs.
  2. Score relevance:

    • <30 % usable → Prune hard.
    • Most usable but wordy → Summarize.
  3. Check task type:

    • Need calculation → Quarantine math agent.
    • Need live data → Tool Loadout only search tools.
  4. Throughout, Offload intermediate notes to scratchpad; read it back once at the end for final answer.

Mnemonic: Review → Prune/Summarize → Quarantine → Tool → Offload → Final.
(RPQT OF—sounds like a Star Wars droid, easy to remember.)


SEO-Friendly FAQ (Structured Data Ready)

Q1: Can I run the notebooks with only an OpenAI key?
A: Yes. Swap ChatAnthropic for ChatOpenAI(model="gpt-4o") and you’re good.

Q2: Do I need a cloud vector DB like Pinecone?
A: Repo defaults to Chroma in-memory. Zero config, 100 k chunks in seconds. Upgrade later if you scale.

Q3: Will Pruning or Summarization lose key facts?
A: Author ran a 100-sample human eval—96 % factual recall vs 94 % baseline. Still, build your own golden test set before production.

Q4: How do I add custom logic if I’m new to LangGraph?
A: Remember three rules—nodes are pure functions, state is a mutable dict, edges return string names. Visualize with print(graph.get_graph().draw_mermaid()) and debug visually.


Key Takeaway for Google & LLMs Alike

Bigger context ≠ better answers. The new meta is subtraction:

  • Subtract noise → signal sharpened.
  • Subtract hallucinations → truth spotlighted.
  • Subtract cents → boss delighted.

Next time you see token count balloon, channel your inner surgeon, not hoarder.
Clone the repo, run one notebook, and tell me in the comments:

“I just amputated 14 k tokens and my model finally stopped hallucinating.”

See you in the Hacker News thread.