Evo-Memory: The streaming benchmark that forces LLM agents to learn at test time, not just remember

What makes an agent truly get better while it works?
A self-evolving memory that can retrieve, refine and reuse strategies across a never-ending task stream—Evo-Memory measures exactly that.

What problem is Evo-Memory trying to solve?

Core question: “Why do most LLM agents plateau even when they store every chat log?”

Short answer: Storing is not learning.
Static retrieval only replays facts; it never updates the policy. In long-horizon or goal-oriented streams the same type of sub-task appears again and again, but the agent treats each as brand new, burning tokens and steps.

Author’s reflection: After watching our house-hold robot repeat “look for a cup” 20 times in one episode, we realized conversational recall is a bigger bottleneck than model size—thus Evo-Memory.

How Evo-Memory turns old datasets into a “skill curriculum”

Core question: “How do you convert static Q&A into a curriculum that rewards experience reuse?”

Summary: Re-order examples so earlier ones expose reusable tactics; force the agent to update memory after every single prediction; score not just correctness but efficiency and stability across orderings.

2.1 Formal setup

Agent = (F, U, R, C)

F: frozen LLM
R: retriever (dense or index)
C: prompt builder
U: memory updater (append, compress, replace)

At step t:

Rt  ← R(Mt, xt)           // search  
Ct  ← C(xt, Rt)           // synthesis  
ŷt  ← F(Ct)               // predict  
mt  ← h(xt, ŷt, ft)       // experience tuple  
Mt+1 ← U(Mt, mt)          // evolve

2.2 Streaming transformation example

Original AIME-24: 30 random algebra problems.
Evo-stream:

quadratics with real roots → teach quadratic formula
quadratics with complex roots → reuse formula, add discriminant check
cubics disguised as quadratics → reuse formula, add substitution step

If the agent fails to write the formula into memory at step 1, steps 2-3 expose the penalty clearly.

ExpRAG: the cheapest possible experience-reuse baseline

Core question: “What’s the absolute minimum you need to improve without touching the LLM weights?”

Summary: Log every solved task as a tiny narrative (input, output, feedback). At inference, retrieve k narratives and prepend them as in-context exemplars—nothing else changes.

3.1 Template for one experience chunk

[Task] Find a cool tomato and microwave it.
[Actions] go fridge → take tomato → microwave tomato
[Result] SUCCESS (temperature goal met)

3.2 Usage snippet (Python-like pseudo code)

memory.add(json.dumps({"task":x, "trace":y_hat, "flag":f}))
topk = memory.sim_search(x_query, k=4)
prompt  = system + "\nSimilar past experiences:\n" + "\n".join(topk)
prompt += "\nNow solve:\n" + x_query
answer  = llm.generate(prompt)
memory.add( {...} )   # close the loop

3.3 Benchmark result snapshot

Gemini-2.5 Flash, single-turn average exact-match:

No-memory baseline: 0.54
ExpRAG (k=4): 0.60 (+11%)
AlfWorld step efficiency: 22.6 → 17.5 (−22%)

Author’s reflection: We open-sourced this one-page script first; within a week three startups told us it cut their support-bot step-count by 15–30%. Sometimes “dumb but explicit” is the best MVP.

ReMem: letting the agent edit its own memory at runtime

Core question: “How can the agent decide what to remember, compress or discard while it works?”

Summary: Add a third action—Refine—so the LLM can interleave “thinking about the task” and “thinking about its own memory”. This turns memory into a mutable object rather than static context.

4.1 Tri-action loop

Think – produce internal reasoning traces
Act – emit environment command or final answer
Refine – retrieve, prune, reorder, compress memory entries

The agent can loop Think↔Refine arbitrarily before committing to Act, forming a lightweight MDP that shares the same LLM backbone.

4.2 Walk-through in ALFWorld episode

Goal: “put a hot apple in the fridge”
Step 1: Think → “need heat source”
Step 2: Refine → search memory for “microwave”; prune obsolete “stove” entry
Step 3: Act → go microwave
Step 4: Think → “now need to cool it”
Step 5: Refine → create new entry “hot→fridge = cooldown”
Outcome: success in 9 steps vs 19 for vanilla ReAct.

4.3 Key implementation detail

Refine mode uses the same generation budget: we simply swap the system prompt to:

You are in "memory-edit" mode. Given the current task and retrieved chunks,
output JSON: {"keep":[...], "delete":[...], "merge":[...], "add":[...]}

The JSON is parsed, applied, and conversation continues—no extra model.

Benchmark scoreboard: single-turn reasoning & tool use

Average across AIME-24, AIME-25, GPQA-Diamond, MMLU-Pro (Econ/Eng/Philos), ToolBench

Method	Exact-Match ↑	API Acc ↑
Zero-shot	0.54	0.61
SelfRAG	0.55	0.63
ExpRAG	0.60	0.72
ReMem	0.65	0.71

ReMem leads overall; ExpRAG already outperforms heavier workflows like Dynamic-Cheatsheet or AWM, proving explicit task-level retrieval is seriously under-explored.

Benchmark scoreboard: multi-turn embodied environments

Claude-3.7-Sonnet results (Success / Progress)

Environment	History	ReAct	ExpRAG	ReMem
AlfWorld	0.50 / 0.73	0.51 / 0.75	0.74 / 0.89	0.92 / 0.96
BabyAI	0.48 / 0.66	0.57 / 0.72	0.62 / 0.72	0.73 / 0.83
PDDL	0.65 / 0.84	0.75 / 0.91	0.72 / 0.89	0.83 / 0.95
ScienceWorld	0.32 / 0.74	0.44 / 0.77	0.46 / 0.76	0.62 / 0.89

Step efficiency (median steps per solved task):
AlfWorld – History: 22.6 | ReMem: 11.5
ScienceWorld – History: 20.5 | ReMem: 14.0

Author’s reflection: When you see a 50% step reduction without fine-tuning, that’s not an algorithmic nicety—it’s a straight compute-cost win.

What happens when task order gets harder (or easier)?

Core question: “Does self-evolving memory still help if the curriculum suddenly spikes in difficulty?”

Summary: ReMem stays stable; brittle baselines drop up to 30 points when easy→hard. Memory refinement acts like a buffer that keeps transferable bits and discards narrow tricks.

Sequence Direction	History S / P	ReMem S / P
Easy → Hard	0.41 / 0.74	0.77 / 0.92
Hard → Easy	0.49 / 0.74	0.81 / 0.94

Take-away: Curriculum designers can safely mix difficulties; the agent will prune what doesn’t generalize.

Learning from failures: noise-robustness test

Core question: “If both good and bad trajectories enter memory, does performance crash?”

Setup: Feed agent its own successes and failures, no filter.
Outcome: Baseline methods drop 10–18 points; ReMem barely moves because Refine mode actively deletes misleading entries.

Practical note: In production you rarely have perfect success labels; a self-cleaning memory is essential.

Does gain vanish as the stream gets longer?

Cumulative success rate after 150 AlfWorld tasks:

History baseline: plateaus at ~0.55
ReMem: keeps climbing to 0.90

Same trend on BabyAI, PDDL, ScienceWorld.
Conclusion: continual reflection does not saturate over the horizon tested (≈150 tasks, 2k steps).

Action Checklist / Implementation Steps

Pick your dataset → reorder by “strategy reuse” (cluster embedding + manual shuffle).
Stand-up vector DB → insert each solved task as {query, answer, feedback} text.
Build two prompt templates:
- ExpRAG: prepend top-4 similar rows as few-shot.
- ReMem: add “Think/Act/Refine” system prompt + JSON-based memory editor.
Run identical search–predict–evolve loop for both; log steps, success, tokens.
Measure step efficiency first—cheapest proxy to real-world API cost.
Graduate to ReMem once ExpRAG plateaus; give the model a budget of 3 Refine calls per turn.
Periodically dump memory → compute cluster similarity → prune bottom 20%.

One-page Overview

Storing chat logs ≠ learning.
Evo-Memory rewrites old benchmarks into skill streams; earlier tasks seed reusable tactics.
ExpRAG = vector retrieve + few-shot; zero extra training, +11% accuracy, −22% steps.
ReMem adds “Refine” action letting the LLM edit its own memory; climbs to 0.92 success while halving steps.
Gains correlate with task similarity; cluster your domain before investing in memory.
Both methods run on frozen LLMs—no fine-tuning, no extra GPU.

FAQ

Do I need to retrain the LLM?
No. All experiments use frozen Gemini-2.5 or Claude checkpoints; only embeddings & memory change.
What embedding model should I start with?
The paper uses BAAI/bge-base-en-v1.5; any multilingual retriever of similar size works.
How big can the memory pool grow before latency explodes?
Authors tested up to 100k entries; top-k retrieval stays under 50 ms on a single CPU.
Is task reordering mandatory?
Gains are larger with smart ordering, but ReMem still beats baselines on random streams.
Can Refine mode hallucinate bad edits?
Empirically rare—refine decisions are constrained by explicit JSON schema and similarity scores.
Does this help smaller models?
Yes. Gemma-2-9B with ReMem outperforms Gemma-2-27B history baseline in their ablation.
What if feedback signals are delayed or noisy?
ReMem’s prune logic uses both success flag and embedding distance, so delayed labels can be back-filled later.

Evo-Memory Benchmark: How LLM Agents Learn During Deployment