Why AI Agents Forget—And How to Build Human-Like Memory Systems

高效码农

2 months ago

Why Your AI Agent Keeps Forgetting—and How to Give It a Human-Like Memory

“

Audience: Anyone with a basic college-level grasp of computer science or product management who wants to build AI agents that remember what users said last week and forget what is no longer useful.
Reading time: ≈ 18 min (≈ 3,200 words)
Take-away: A plain-language map of how “memory” really works inside stateless large language models, why the usual “just add more text” approach breaks, and the minimum toolkit you need to keep, update, and delete information without blowing up latency or cost.

1. The Amnesia Problem: A Fresh Start on Every Click

Large language models (LLMs) are stateless: each call is an isolated transaction.
Feed the model a prompt, get an answer, close the connection.
The model weights store parametric knowledge (facts seen during training), but not what the user told you five minutes ago.

If you want an agent that:

recalls your dog’s name in the next session
stops suggesting a product you already returned
knows you moved from Beijing to Shanghai without asking again

you have to bolt on an external memory system and manage it yourself.

2. Vocabulary First: Memory vs. Memories vs. Agentic Memory

Term	Meaning	Example
Memory	The whole “encode-store-retrieve” mechanism	A vector database + code that writes/reads it
Memories	Individual pieces of stored information	“The user prefers blue” saved as a row
Agentic Memory	The agent itself decides when to store/change/delete	Agent calls a `write_memory(...)` tool mid-chat

“

Think of Memory as the library, Memories as the books, and Agentic Memory as a librarian who can shelve or discard books without human help.

3. Two Ways to Slice the Elephant: Human-Centric vs. Code-Centric

3.1 Human-Centric Stack (CoALA Paper)

Inspired by cognitive science:

Type	Holds	Human Example	Agent Example
Working	Current context	Words in a chat	Last 4k tokens in the prompt
Semantic	Facts	Water boils at 100 °C	User’s cat is called Mochi
Episodic	Events	First bike crash at 8	Agent failed to add 1+1 last turn
Procedural	How-to	Tie shoelaces	“Always ask clarifying questions before answering”

3.2 Code-Centric Stack (Letta Design)

Treats the LLM as a token-in/token-out function, not a brain:

Module	Where	Purpose
Message Buffer	Context window	Raw last-N messages
Core Memory	Context window, editable	High-importance key/value pairs the agent can mutate
Recall Memory	External DB	Full conversation logs, searchable
Archival Memory	External DB	Condensed, structured facts (summaries, triplets, tables)

Mapping the two views:

CoALA Working ≈ Letta Buffer + Core
CoALA long-term types map roughly to Letta Recall + Archival, but not one-to-one
Letta keeps raw history in Recall, something CoALA does not explicitly include

Pick whichever metaphor helps you sleep at night; the code ends up doing the same four things: read, write, update, delete.

4. The Data Journey: Where Bits Live and How They Travel

User types message
Message lands in short-term (context window)
Agent decides: “Will I need this later?”
- Yes → calls a tool → writes to long-term DB
- No → stays in buffer, may be summarized or dropped later
Next session: retriever fuses relevant memories back into prompt
Repeat

“

Memory management is traffic control between the small, fast RAM (context window) and the large, slow disk (external store).

5. Short-Term Tricks: Staying inside the Token Limit

Technique	Pros	Cons
Naïve rolling window	3-line code	Cuts might remove system instructions
Summarisation loop	70 % token savings	Errors accumulate
Hierarchical budget	Assign slices: 20 % system, 30 % memory, 50 % chat	Needs tuning per model

Rule of thumb: keep only what improves the next answer; everything else is noise.

6. Long-Term Housekeeping: ADD, UPDATE, DELETE, NO-OP

Operation	When	Example
ADD	New fact	“I just got a Golden Retriever”
UPDATE	Fact changes	Address updated to “Shanghai, Pudong”
DELETE	Fact expires	User deletes account
NO-OP	Nothing useful	Generic chit-chat “haha ok”

Implementation tips

Use unique composite keys (user_id + fact_type) to avoid duplicates
Add timestamp + confidence score; later you can TTL or re-confirm low-confidence rows
Make DELETE a first-class API; GDPR and China’s PIPL both require real deletion, not soft flags

7. Hot Path vs. Background: When Do You Write?

7.1 Hot Path (Explicit)

Agent calls a tool during the conversation.

✅ Immediate consistency
❌ Easy to spam the DB with low-value facts

7.2 Background (Implicit)

A job runs after the session or on a schedule.

✅ Heavier NLP (coreference, contradiction checks)
❌ User may come back before the job finishes → stale data

Hybrid pattern (used by most commercial bots)

High-signal slots (email, phone, allergy, address) → hot path
Soft interests (likes jazz, prefers blue) → background batch

8. Storage Menu: Where Do You Actually Put Memories?

Medium	Best For	Notes
Python list	Quick demo	Volatile, fine for single-turn
Text / Markdown	Persona files	Keep under version control
Relational DB	Structured facts	Use for exact lookups (email, order_id)
Vector DB	Similarity search	Good for open-ended interests
Graph DB	Multi-hop relations	Friends, family, supply-chain

“

You can mix: relational table for addresses, vector collection for taste descriptions, graph for social links.

9. Mini Code Lab: A Runnable Sketch

The snippet is framework-agnostic; swap in mem0, Letta, or your own REST layer.

import os, json, time
from mem0 import Memory   # pip install mem0

m = Memory(user_id="alice")

# ---- HOT-PATH WRITE ----
def handle_user_message(text: str):
    # crude keyword rule
    if "my address" in text.lower():
        m.add(text, metadata={"type": "address"})
        return "Got it, saved your address."
    return "OK, noted."

# ---- RETRIEVE NEXT SESSION ----
def build_system_prompt():
    memories = m.search(query="address", top_k=2)
    snippet = "\n".join(mem["text"] for mem in memories)
    return f"Relevant facts:\n{snippet}\nAnswer politely."

# ---- quick test ----
if __name__ == "__main__":
    print(handle_user_message("My address is 5th Floor, 999 Nanjing Road, Shanghai"))
    print(build_system_prompt())

Latency hack: call build_system_prompt() asynchronously and cache for 5 min if your traffic is high.

10. Failure Stories: What Happens When You Skip UPDATE or DELETE

Case	Symptom	Root Cause
E-commerce bot	Asks for address three times	Only INSERT, no UPDATE
Health coach	Recommends peanuts after allergy reported	No contradiction check
Companion bot	Becomes slower every day	Never summarises or deletes, context hits 32 k tokens

“

Memory bloat feels like “helpfulness” at first, then turns into sludge.

11. Metrics That Matter: How to Know Your Memory Works

KPI	Definition	Target
Hit rate	Fraction of user queries answered using a memory	> 60 %
Accuracy	Retrieved memories are correct & current	> 95 %
P50 latency	Extra time spent on retrieval + ranking	< 600 ms
Compliance	User delete requests executed within SLA	100 %
Cost ratio	Memory-related tokens ÷ total tokens	< 15 %

Log false positives (used memory but wrong) and false negatives (needed memory but missed) weekly; they guide your next summarisation or embedding tweak.

12. Current Hardest Problems (2025)

Latency vs. Accuracy
- Vector search + reranker gives quality but adds 200-800 ms
- Mitigation: local cache, approximate search, async prefetch
Automated Forgetting
- Time-to-live is easy but blunt
- “Contradiction detection” needs an extra model → cost & complexity
- Regulatory pressure is rising; you can’t just “soft delete” anymore
Multi-user Safety
- Alice’s memories must never appear in Bob’s prompt
- Row-level security + prompt injection guardrails are mandatory

13. Framework Cheat-Sheet (Open-Source)

Name	Core Pitch	URL
mem0	Drop-in memory layer, multi-user	mem0.ai
Letta	Explicit Core/Recall/Archival blocks	letta.com
Cognee	Data pipelines + auto-summaries	cognee.ai
Zep	Long-term + GDPR-compliant forgetting	getzep.com
LangGraph	Graph-based memory flows	docs.langchain.com
Google ADK	Official SDK for session + memory	google.github.io/adk-docs

Selection checklist

✔ Hot-path tool-calling API
✔ Real UPDATE/DELETE, not just append
✔ Per-user isolation & encryption at rest
✔ Hosted EU/China nodes if you serve those regions

14. Roadmap: From Zero to Production-Grade

Week 1

Run the 30-line code above; store 5 fact types

Week 2

Add UPDATE & DELETE endpoints; build a simple UI for “see / edit / erase”

Week 3

Deploy background job that summarises daily chats and prunes low-confidence rows

Week 4

Instrument logging (hit rate, latency, error class); set alerts

Month 2

A/B test hot-path vs. background for your top two fact types
Tune retrieval (top-k, rerank threshold, embedding model size)

Month 3

Pen-test & compliance audit (GDPR, PIPL, CCPA)
Document your retention schedule—regulators love paper trails

15. Key Takeaways (Print-and-Stick Version)

Stateless LLM ≠ amnesia sentence; memory is engineered, not magic.
Separate short-term (fast, small) from long-term (slow, big) and define a transfer policy.
Give users explicit delete—it’s the law almost everywhere now.
Measure hit rate, accuracy, latency, cost; everything else is vanity.
Start with a hybrid hot-path + background pipeline; you can always shift the knob later.

“

Build agents that remember what matters and forget what doesn’t, and your users will finally stop asking,
“Why can’t it remember I already told you that?”