★AgentEvolver: A Self-Evolving Agent Framework That Writes Its Own Homework, Study Notes, and Report Card★
“
Can a large language model train itself to use tools in a brand-new environment without human-made datasets, dense reward functions, or brute-force sampling?
Yes—AgentEvolver gives the model three “super-powers”: write the questions, remember the mistakes, and grade every step. The 7 B version outscores a 14 B baseline on two public benchmarks while using 60 % fewer tokens.
1. Why Most RL Pipelines for Agents Are Too Expensive
| Pain Point | Symptom | Cost |
|---|---|---|
| No training tasks | Engineers hand-write hundreds of multi-step questions | $1–2 per label, weeks of back-and-forth |
| Sparse rewards | Only final 0/1 signal, no hint which of 30 steps helped | High gradient variance, slow convergence |
| Blind exploration | Random roll-outs repeat the same dead-ends | GPU hours burn, little learning value |
AgentEvolver replaces all three manual steps with an autonomous loop: Self-Questioning → Self-Navigating → Self-Attributing.
2. The Core Loop in One Picture
Sandbox (no rewards)
↓ Self-Questioning ← curiosity + environment profile
Synthetic tasks + reference answers
↓ Self-Navigating ← experience pool (ReMe)
Experience-guided roll-outs
↓ Self-Attributing ← LLM judge scores every step
Dense rewards → GRPO update
Author reflection: The first time we ran the loop overnight, the 7 B model woke up having invented 412 tasks we never taught it—like a student who finishes the textbook and writes extra chapters for fun.
3. Self-Questioning: How to Write an Exam the Model Can Already Pass
“
How does an LLM create high-quality, diverse, and solvable tasks in an unknown environment?
3.1 Build an “Environment Profile” First
The agent performs high-temperature exploration for Nb steps, summarising:
| Entity | Attributes | Operations |
|---|---|---|
| traffic_light | status ∈ {red,green} | wait_and_cross |
| hospital | location, capacity | enter, query_bed |
Short summary: The profile acts as a “table of contents” so the model does not walk blindly.
3.2 Two-Phase Exploration
- 🍄
Breadth phase (Nb): visit many regions, collect diverse state-action pairs - 🍄
Depth phase (Nd): dive deep into promising sub-graphs to guarantee solvability
Code snippet (pseudo):
for step in range(Nb):
action ← llm.sample(profile, temperature=1.2)
exec(action)
for step in range(Nd):
action ← llm.sample(last_k_obs, temperature=0.8)
exec(action)
3.3 Task Factory Pipeline
-
Trajectories → distilled action-observation pairs -
User preferences (difficulty, style) → prompt -
LLM outputs: (a) task description, (b) reference solution, (c) verification script -
Execute reference in sandbox—fail ⇒ discard
Example task created overnight:
“
“Help user Alice book an appointment at the nearest hospital after 5 pm without passing any red light.”
Reference: [query_bed,wait_and_cross,enter]
Author reflection: We set the difficulty knob to “three entities + three attributes” and went for coffee; 200 valid tasks were waiting when we returned—no human touched a keyboard.
4. Self-Navigating: Turning Old Mistakes into New Shortcuts
“
Once tasks exist, how can the agent avoid repeating the same failures?
4.1 Experience Format
Each record is two natural-language sentences:
When to use: about to call an unverified delete API
Content: call apis.api_docs.show_api_doc first to confirm delete semantics
The When part is embedded for retrieval; the Content part is injected into the prompt when similarity > threshold.
4.2 Mixed Roll-out Strategy
For every batch of N trajectories:
| Type | Count | Prompt |
|---|---|---|
| Vanilla | Nv = N·(1-η) | system + query |
| Experience | Ne = N·η | system + … + query |
η = 0.5 gives the best long-term score in our grid search.
4.3 Experience Stripping & Selective Boosting
- 🍄
Stripping: remove injected tokens before gradient calculation → prevents over-fitting to external text - 🍄
Selective boosting: raise clipping bound ε̂_high from 0.28 → 0.6 for samples with positive advantage, letting helpful experiences write themselves into the weights
Author reflection: Stripping felt counter-intuitive—like teaching with the textbook closed—but ablation drops 8 % absolute score, so the brain must learn the pattern, not the paragraph.
5. Self-Attributing: Giving Every Step Its Own Grade
“
Terminal 0/1 rewards work, but how do we spot the single step that actually doomed the run?
5.1 LLM Judge Protocol
Input: full trajectory + final score
Output: per-step label GOOD (+1) or BAD (-1)
System prompt excerpt:
“
If score > 0: GOOD = contributed positively, BAD = irrelevant or harmful.
If score ≤ 0: GOOD only if step actively fixed an error.
5.2 Reward Construction
-
Process reward: rt^attr ∈ {+1, -1} -
Outcome reward: R^out ∈ {0, 1} at final step -
Standardise both channels independently (z-score across batch) -
Composite: r̂_t = α·r̂_t^attr + 1_{t=T}·r̂^out (α tuned 0.1–0.2)
5.3 Advantage Calculation
Use undiscounted cumulative return (γ = 1) for simplicity:
A_t = Σ_{k=t}^T r̂_k
Author reflection: Setting α too high feels like letting the judge live the student’s life—early grades sky-rocket, but the model overfits to the judge’s quirks and forgets the real exam.
6. Infrastructure: Services, Context Manager, and One-Command Launch
“
How do all pieces plug together without rewriting your RL stack?
6.1 Service-Oriented Architecture
- 🍄
Environment Service (Ray actors): Gym-compatible, HTTP interface, scalable to 1 k parallel envs - 🍄
Task Manager: curriculum plug-ins, deduplication filters - 🍄
Experience Manager (ReMe): vector DB, summarisation, re-ranking - 🍄
Training Worker: veRL backend, GRPO loss, experience stripping & boosting built-in
6.2 Context Manager Templates
| Template | Use-Case | Memory Growth | Edit Friendly |
|---|---|---|---|
| Basic Causal | Search RL | linear | ❌ |
| Reasoning-Augmented | Think-then-act | linear | ❌ |
| Sliding Window | Long horizon | constant | ✅ |
| Self-Context-Managing | Autonomous | self-regulated | ✅✅ |
Author reflection: Letting the model delete its own memory felt scary—until TGC@8 jumped 12 %. Turns out forgetting is a feature, not a bug.
6.3 Launch Commands
Minimal:
conda activate agentevolver
python launcher.py --conf examples/basic.yaml --with-appworld
Full pipeline:
python launcher.py --conf examples/overall.yaml --with-appworld --with-reme
Dashboard at http://localhost:8265 (Ray dashboard).
7. Experiments: Numbers That Fit in a Tweet
“
Does the framework actually move the needle?
7.1 Main Benchmarks
AppWorld & BFCL-v3, 8-rollouts average:
| Model | Params | AppWorld avg@8 | BFCL avg@8 | Δ avg |
|---|---|---|---|---|
| Qwen2.5 | 7 B | 1.8 % | 29.8 % | — |
| +AgentEvolver | 7 B | 32.4 % | 57.9 % | +29.4 % |
| Qwen2.5 | 14 B | 18.0 % | 41.6 % | — |
| +AgentEvolver | 14 B | 48.7 % | 66.5 % | +27.8 % |
7.2 Ablations (14 B dev set)
| Variant | avg@4 | Drop vs Full |
|---|---|---|
| Full pipeline | 65.0 % | — |
| w/o Self-Questioning | 45.3 % | -19.7 % |
| w/o Self-Navigating | 56.7 % | -8.3 % |
| w/o Self-Attributing | 57.4 % | -7.6 % |
7.3 Sample Efficiency
To reach 90 % of baseline best performance:
| Benchmark | Baseline Steps | AgentEvolver Steps | Saving |
|---|---|---|---|
| AppWorld | 90 | 40 | 55 % |
| BFCL-v3 | 60 | 20 | 67 % |
Author reflection: Watching the 7 B curve pass the 14 B baseline with half the compute felt like seeing a Prius overtake a sports car—efficiency beats brute force.
8. Hands-On Walk-Through: Teaching an Agent to Book a Flight in 30 Minutes
Scenario: sandbox exposes three APIs—search_flight, book_ticket, cancel_order. No docs, no rewards.
-
Self-Questioning (10 min)
- 🍄
Nb = 30, Nd = 20 → 50-step exploration - 🍄
Discovers implicit pre-condition: must search before book
- 🍄
-
Task Synthesis (5 min)
- 🍄
Generates 80 tasks: “Book cheapest round-trip next week”, “Cancel last order”, … - 🍄
Verification script filters 9 hallucinated tasks → 71 clean
- 🍄
-
Self-Navigating (10 min)
- 🍄
Cold-start experience pool: 4 roll-outs/task = 284 trajectories - 🍄
Top-5 retrieval injected into 50 % roll-outs → success rate 42 % → 68 %
- 🍄
-
Self-Attributing (5 min, 2 epochs)
- 🍄
LLM judge labels 1 200 steps; α = 0.15 - 🍄
Error “book without passenger_id” disappears after 1 200 steps
- 🍄
Result: zero-shot 12 % → 71 % task-completion; 1 800 GPU-seconds on a single A100.
9. Action Checklist / Implementation Steps
- 🍄
[ ] git cloneAgentEvolver & install withbash install.sh - 🍄
[ ] Pick an environment (AppWorld pre-packed) and run its setup.sh - 🍄
[ ] (Optional) Install ReMe for experience services - 🍄
[ ] Copy example.env→.env, add API keys & conda path - 🍄
[ ] Quick test: python launcher.py --conf examples/basic.yaml --with-appworld - 🍄
[ ] Full pipeline: switch to overall.yamland add--with-reme - 🍄
[ ] Watch Ray dashboard; expect 2–4× speed-up vs vanilla GRPO after 2 000 steps
10. One-Page Overview
AgentEvolver is an end-to-end, self-evolving framework for LLM agents. It removes the need for hand-written tasks, dense rewards, and massive random sampling by letting the model:
-
Write homework (Self-Questioning) → high-diversity, solvable tasks -
Keep a diary of mistakes (Self-Navigating) → retrieve and reuse experiences -
Grade every move (Self-Attributing) → dense, step-level rewards
Built on modular micro-services (Ray + veRL), it trains a 7 B model to outperform a 14 B baseline while using 60 % fewer tokens. Installation is one script; launch is one command. The same loop works for booking flights, calling APIs, or any Gym-compatible sandbox.
11. FAQ
Q1: Do I need labelled data to start?
A: No—Self-Questioning bootstraps from an empty sandbox.
Q2: Which base models are compatible?
A: Any LLM with 8 k+ context and a causal policy head; tested on Qwen2.5-7/14 B.
Q3: How much GPU time for a quick pilot?
A: ~2 hours on a single A100 (80 GB) for 1 k tasks, 2 epochs.
Q4: Is the framework limited to AppWorld/BFCL?
A: No—any environment exposing a Gym interface can register with the Environment Service.
Q5: Does experience injection hurt exploration?
A: η = 0.5 is the sweet spot; higher values improve short-term reward but reduce long-term generalisation.
Q6: Can I turn off Self-Attributing?
A: Yes—set α = 0, but expect ~5 % drop in final score and slower convergence.
Q7: Is AgentEvolver production-ready?
A: Code is Apache-2.0; micro-service design supports Kubernetes/Slurm. Use larger models and longer curricula for production stress.
