Long-Term Memory for LLMs: How OpenMemory Solves the Goldfish Problem for Good

高效码农

2 months ago

OpenMemory: Give Any AI a Private, Persistent & Explainable Long-Term Memory

“

In one line—OpenMemory is a self-hosted, MIT-licensed “memory engine” that turns LLMs from goldfish into elephants: they never forget user facts, yet can tell you exactly why they recalled something.

Core questions this post answers

Why do vector DBs and chat-history caches fail at “getting smarter over time”?
How does OpenMemory’s Hierarchical Memory Decomposition (HMD) work in plain English?
Can you go from git clone to first recall in under 10 minutes?
What does production look like for a personal assistant, an enterprise copilot and a LangGraph agent?
Where do the 10 × cost savings come from without sacrificing latency or accuracy?

1. The goldfish problem: where today’s stacks drop the ball

Pain	Symptom	Root cause in existing tools
Session amnesia	New chat → preferences gone	Context windows are short-lived
Vector glut	Same sentence stored 20 ×, key fact still missing	Flat embeddings, no structure
Black-box retrieval	“Why this chunk?”—no idea	No weights, no path, no explainability
Runaway cost	2–3 USD per 1 M tokens	Hosted embedding + SaaS margin

Personal anecdote: we once fed 30 days of support logs into Pinecone. When a user updated her shipping address, the bot returned three obsolete ones—cosine similarity ≠ business truth. That day I learned that “structure-free” is a feature until it isn’t.

2. OpenMemory’s brain map in one glance

Short answer: every memory is split into five “cognitive drawers”, linked by a sparse, biologically inspired graph. At query time four factors—similarity, salience, recency and link weight—are fused into a single score, so the engine is both fast and auditable.

2.1 The five drawers

Drawer	Example	Embedding model
episodic	“User said he prefers dark roast last Wednesday”	E5-large
semantic	“Dark roast = low-acid coffee”	BGE-base
procedural	“Grind 18 g, 90 °C, 25 s pre-infuse”	OpenAI-3-small
emotional	“Customer angry about 40 min wait”	Gemini-text
reflective	“User likely values speed over small talk”	Ollama-nomic

2.2 Single-waypoint graph

One canonical node per memory—zero duplication.
Directed edge = “activates next”; traversal stops at 1-hop → constant time.

2.3 Four-factor ranking

Score = 0.6·cos_sim + 0.2·salience + 0.1·recency + 0.1·link_weight
Because the coefficients are baked into the response meta, you can explain any recall path to compliance teams—or curious users.

3. Zero-to-recall in 6 commands

Short answer: install Node 20 → clone → tweak five env vars → npm run dev → curl to write → curl to search. Done.

3.1 Manual setup (dev favourite)

# 1. Clone
git clone https://github.com/caviraoss/openmemory.git
cd openmemory/backend
cp .env.example .env

# 2. Install
npm install

# 3. Switch to local embeddings (example: Ollama)
echo 'OM_EMBEDDINGS=ollama' >> .env
echo 'OLLAMA_URL=http://localhost:11434' >> .env

# 4. Start
npm run dev
# API now listens on http://localhost:8080

3.2 Write & retrieve

# Write a memory
curl -X POST http://localhost:8080/memory/add \
  -H "Content-Type: application/json" \
  -d '{"content":"User prefers dark mode"}'

# Query
curl -X POST http://localhost:8080/memory/query \
  -H "Content-Type: application/json" \
  -d '{"query":"UI preference"}'

Response

[
  {
    "id":"a7f83b",
    "content":"User prefers dark mode",
    "score":0.87,
    "path":"episodic→semantic"
  }
]

Lesson learned: I once set OM_MIN_SCORE=0.9 and got zero hits—cosine between “dark mode” and “UI preference” was 0.81. Dialing it back to 0.3 doubled recall overnight. Thresholds are knives—handle with care.

4. Three real-world patterns

Short answer: personal assistant remembers taste, enterprise copilot remembers SOP, LangGraph nodes auto-archive their own outputs—code snippets included.

4.1 Personal assistant—“never ask about cilantro again”

Write: detect negations (“hate / skip / no”) → save as episodic, salience 0.9.
Retrieve: before meal suggestions query="cilantro dislike" → filter menus.
Code (Node.js)

await fetch(`${OM_URL}/memory/add`,{
  method:'POST',
  body: JSON.stringify({
    content: 'User hates cilantro',
    sector: 'episodic',
    salience: 0.9
  })
});

4.2 Enterprise copilot—“onboard in 30 min”

Chunk 30-page expense PDF into procedural memories.
User types “file travel refund” → copilot queries procedural sector → returns latest steps + template links.
Result: average onboarding drops from 3 days to 30 minutes.

4.3 LangGraph mode—agents that reflect on yesterday’s plan

Enable with env

OM_MODE=langgraph
OM_LG_NAMESPACE=finance_agent
OM_LG_MAX_CONTEXT=50
OM_LG_REFLECTIVE=true

Automatic mapping

LangGraph node	Memory sector
observe	episodic
plan	semantic
reflect	reflective
act	procedural
emotion	emotional

After a long-horizon task finishes, the memory layer already holds distilled lessons—swap prompts tomorrow, still profit.

5. Performance & cost: where the 10 × saving comes from

Short answer: local embeddings remove API tolls, zero-duplication keeps disks small, sparse graph traversal stays CPU-cheap, SQLite + blob store squeezes 1 M memories into ~15 GB.

Metric	OpenMemory self-host	Zep Cloud	Supermemory	Mem0
Query latency @100 k	110–130 ms	280–350 ms	350–400 ms	250 ms
Hosted embed cost /1 M tokens	$0.30–0.40	$2.0–2.5	$2.50+	$1.20
Local models	✅ Ollama/E5/BGE	❌	❌	partial
Monthly cost @100 k	$5–8 VPS	$80–150	$60–120	$25–40
Explainable path	✅	❌	❌	❌

After migrating 100 k memories off Zep, our monthly invoice fell from $140 t o$ 6.5 while latency improved—proof that architecture, not bargaining, is the biggest lever on cost.

6. Security & privacy—data never leaves your disk

Bearer token mandatory for write endpoints.
Optional AES-GCM field-level encryption.
Tenant isolation + physical DELETE /memory/:id.
Zero third-party clouds → GDPR & HIPAA paperwork shrinks.

7. Roadmap: shipping today, evolving tomorrow

Release	Highlight	Status
v1.2	React dashboard + metrics	in progress
v1.3	Tiny transformer auto-sector	planned
v1.4	Federated multi-node	planned
v1.5	Plug-able pgvector / Weaviate	planned

8. TL;DR checklist

Grab any machine with Node 20 and 2 GB RAM.
git clone → cp .env → npm install → npm run dev.
Write one “user preference” memory, query to confirm path.
Production: docker compose up -d, mount /data volume.
Turn on Bearer auth + schedule cron for decay pruning.

9. One-page summary

OpenMemory adds structured, explainable long-term memory to any LLM. Its five-drawer cognitive model plus single-waypoint graph delivers 110 ms recall at 1/10 the cost of cloud memory services. A built-in MCP server, LangGraph hooks and Docker one-liner make it production-ready for personal assistants, enterprise copilots and multi-agent systems—today.

10. FAQ

Q1: Do I have to use Ollama?
No—swap OM_EMBEDDINGS for openai, gemini, E5 or BGE.

Q2: Will SQLite choke at scale?
Benchmark shows <130 ms at 100 k memories; pgvector backend lands in v1.5 for tens of millions.

Q3: How do I prevent memory bloat?
Built-in decay scheduler prunes low-salience nodes automatically—no manual vacuuming.

Q4: Can I scale horizontally?
Today you can shard by sector manually; federated auto-scaling ships with v1.4.

Q5: What separates OpenMemory from Mem0?
Mem0 stores flat JSON; OpenMemory keeps a multi-sector graph with explainable recall and lower operational cost.

Q6: Is a GPU required?
Inference is CPU-only. Running a local 7 B embedding model benefits from 4 GB VRAM but is optional.

Q7: Does it ingest audio or images?
v1.1 accepts pdf, docx, txt and audio—transcribed before entering the memory pipeline.

Q8: License?
MIT—commercial use, closed-source forks and redistribution are all allowed.