Evo-Memory: The streaming benchmark that forces LLM agents to learn at test time, not just remember
What makes an agent truly get better while it works?
A self-evolving memory that can retrieve, refine and reuse strategies across a never-ending task stream—Evo-Memory measures exactly that.
What problem is Evo-Memory trying to solve?
Core question: “Why do most LLM agents plateau even when they store every chat log?”
Short answer: Storing is not learning.
Static retrieval only replays facts; it never updates the policy. In long-horizon or goal-oriented streams the same type of sub-task appears again and again, but the agent treats each as brand new, burning tokens and steps.
Author’s reflection: After watching our house-hold robot repeat “look for a cup” 20 times in one episode, we realized conversational recall is a bigger bottleneck than model size—thus Evo-Memory.
How Evo-Memory turns old datasets into a “skill curriculum”
Core question: “How do you convert static Q&A into a curriculum that rewards experience reuse?”
Summary: Re-order examples so earlier ones expose reusable tactics; force the agent to update memory after every single prediction; score not just correctness but efficiency and stability across orderings.
2.1 Formal setup
Agent = (F, U, R, C)
-
F: frozen LLM -
R: retriever (dense or index) -
C: prompt builder -
U: memory updater (append, compress, replace)
At step t:
Rt ← R(Mt, xt) // search
Ct ← C(xt, Rt) // synthesis
ŷt ← F(Ct) // predict
mt ← h(xt, ŷt, ft) // experience tuple
Mt+1 ← U(Mt, mt) // evolve
2.2 Streaming transformation example
Original AIME-24: 30 random algebra problems.
Evo-stream:
-
quadratics with real roots → teach quadratic formula -
quadratics with complex roots → reuse formula, add discriminant check -
cubics disguised as quadratics → reuse formula, add substitution step
If the agent fails to write the formula into memory at step 1, steps 2-3 expose the penalty clearly.
ExpRAG: the cheapest possible experience-reuse baseline
Core question: “What’s the absolute minimum you need to improve without touching the LLM weights?”
Summary: Log every solved task as a tiny narrative (input, output, feedback). At inference, retrieve k narratives and prepend them as in-context exemplars—nothing else changes.
3.1 Template for one experience chunk
[Task] Find a cool tomato and microwave it.
[Actions] go fridge → take tomato → microwave tomato
[Result] SUCCESS (temperature goal met)
3.2 Usage snippet (Python-like pseudo code)
memory.add(json.dumps({"task":x, "trace":y_hat, "flag":f}))
topk = memory.sim_search(x_query, k=4)
prompt = system + "\nSimilar past experiences:\n" + "\n".join(topk)
prompt += "\nNow solve:\n" + x_query
answer = llm.generate(prompt)
memory.add( {...} ) # close the loop
3.3 Benchmark result snapshot
Gemini-2.5 Flash, single-turn average exact-match:
-
No-memory baseline: 0.54 -
ExpRAG (k=4): 0.60 (+11%)
AlfWorld step efficiency: 22.6 → 17.5 (−22%)
Author’s reflection: We open-sourced this one-page script first; within a week three startups told us it cut their support-bot step-count by 15–30%. Sometimes “dumb but explicit” is the best MVP.
ReMem: letting the agent edit its own memory at runtime
Core question: “How can the agent decide what to remember, compress or discard while it works?”
Summary: Add a third action—Refine—so the LLM can interleave “thinking about the task” and “thinking about its own memory”. This turns memory into a mutable object rather than static context.
4.1 Tri-action loop
-
Think – produce internal reasoning traces -
Act – emit environment command or final answer -
Refine – retrieve, prune, reorder, compress memory entries
The agent can loop Think↔Refine arbitrarily before committing to Act, forming a lightweight MDP that shares the same LLM backbone.
4.2 Walk-through in ALFWorld episode
Goal: “put a hot apple in the fridge”
Step 1: Think → “need heat source”
Step 2: Refine → search memory for “microwave”; prune obsolete “stove” entry
Step 3: Act → go microwave
Step 4: Think → “now need to cool it”
Step 5: Refine → create new entry “hot→fridge = cooldown”
Outcome: success in 9 steps vs 19 for vanilla ReAct.
4.3 Key implementation detail
Refine mode uses the same generation budget: we simply swap the system prompt to:
You are in "memory-edit" mode. Given the current task and retrieved chunks,
output JSON: {"keep":[...], "delete":[...], "merge":[...], "add":[...]}
The JSON is parsed, applied, and conversation continues—no extra model.
Benchmark scoreboard: single-turn reasoning & tool use
Average across AIME-24, AIME-25, GPQA-Diamond, MMLU-Pro (Econ/Eng/Philos), ToolBench
| Method | Exact-Match ↑ | API Acc ↑ |
|---|---|---|
| Zero-shot | 0.54 | 0.61 |
| SelfRAG | 0.55 | 0.63 |
| ExpRAG | 0.60 | 0.72 |
| ReMem | 0.65 | 0.71 |
ReMem leads overall; ExpRAG already outperforms heavier workflows like Dynamic-Cheatsheet or AWM, proving explicit task-level retrieval is seriously under-explored.
Benchmark scoreboard: multi-turn embodied environments
Claude-3.7-Sonnet results (Success / Progress)
| Environment | History | ReAct | ExpRAG | ReMem |
|---|---|---|---|---|
| AlfWorld | 0.50 / 0.73 | 0.51 / 0.75 | 0.74 / 0.89 | 0.92 / 0.96 |
| BabyAI | 0.48 / 0.66 | 0.57 / 0.72 | 0.62 / 0.72 | 0.73 / 0.83 |
| PDDL | 0.65 / 0.84 | 0.75 / 0.91 | 0.72 / 0.89 | 0.83 / 0.95 |
| ScienceWorld | 0.32 / 0.74 | 0.44 / 0.77 | 0.46 / 0.76 | 0.62 / 0.89 |
Step efficiency (median steps per solved task):
AlfWorld – History: 22.6 | ReMem: 11.5
ScienceWorld – History: 20.5 | ReMem: 14.0
Author’s reflection: When you see a 50% step reduction without fine-tuning, that’s not an algorithmic nicety—it’s a straight compute-cost win.
What happens when task order gets harder (or easier)?
Core question: “Does self-evolving memory still help if the curriculum suddenly spikes in difficulty?”
Summary: ReMem stays stable; brittle baselines drop up to 30 points when easy→hard. Memory refinement acts like a buffer that keeps transferable bits and discards narrow tricks.
| Sequence Direction | History S / P | ReMem S / P |
|---|---|---|
| Easy → Hard | 0.41 / 0.74 | 0.77 / 0.92 |
| Hard → Easy | 0.49 / 0.74 | 0.81 / 0.94 |
Take-away: Curriculum designers can safely mix difficulties; the agent will prune what doesn’t generalize.
Learning from failures: noise-robustness test
Core question: “If both good and bad trajectories enter memory, does performance crash?”
Setup: Feed agent its own successes and failures, no filter.
Outcome: Baseline methods drop 10–18 points; ReMem barely moves because Refine mode actively deletes misleading entries.
Practical note: In production you rarely have perfect success labels; a self-cleaning memory is essential.
Does gain vanish as the stream gets longer?
Cumulative success rate after 150 AlfWorld tasks:
-
History baseline: plateaus at ~0.55 -
ReMem: keeps climbing to 0.90
Same trend on BabyAI, PDDL, ScienceWorld.
Conclusion: continual reflection does not saturate over the horizon tested (≈150 tasks, 2k steps).
Action Checklist / Implementation Steps
-
Pick your dataset → reorder by “strategy reuse” (cluster embedding + manual shuffle). -
Stand-up vector DB → insert each solved task as {query, answer, feedback}text. -
Build two prompt templates: -
ExpRAG: prepend top-4 similar rows as few-shot. -
ReMem: add “Think/Act/Refine” system prompt + JSON-based memory editor.
-
-
Run identical search–predict–evolve loop for both; log steps, success, tokens. -
Measure step efficiency first—cheapest proxy to real-world API cost. -
Graduate to ReMem once ExpRAG plateaus; give the model a budget of 3 Refine calls per turn. -
Periodically dump memory → compute cluster similarity → prune bottom 20%.
One-page Overview
-
Storing chat logs ≠ learning. -
Evo-Memory rewrites old benchmarks into skill streams; earlier tasks seed reusable tactics. -
ExpRAG = vector retrieve + few-shot; zero extra training, +11% accuracy, −22% steps. -
ReMem adds “Refine” action letting the LLM edit its own memory; climbs to 0.92 success while halving steps. -
Gains correlate with task similarity; cluster your domain before investing in memory. -
Both methods run on frozen LLMs—no fine-tuning, no extra GPU.
FAQ
-
Do I need to retrain the LLM?
No. All experiments use frozen Gemini-2.5 or Claude checkpoints; only embeddings & memory change. -
What embedding model should I start with?
The paper uses BAAI/bge-base-en-v1.5; any multilingual retriever of similar size works. -
How big can the memory pool grow before latency explodes?
Authors tested up to 100k entries; top-k retrieval stays under 50 ms on a single CPU. -
Is task reordering mandatory?
Gains are larger with smart ordering, but ReMem still beats baselines on random streams. -
Can Refine mode hallucinate bad edits?
Empirically rare—refine decisions are constrained by explicit JSON schema and similarity scores. -
Does this help smaller models?
Yes. Gemma-2-9B with ReMem outperforms Gemma-2-27B history baseline in their ablation. -
What if feedback signals are delayed or noisy?
ReMem’s prune logic uses both success flag and embedding distance, so delayed labels can be back-filled later.

