AgentEvolver: How a 7B LLM Outperforms 14B Models with Self-Training

高效码农

2 months ago

★AgentEvolver: A Self-Evolving Agent Framework That Writes Its Own Homework, Study Notes, and Report Card★

“

Can a large language model train itself to use tools in a brand-new environment without human-made datasets, dense reward functions, or brute-force sampling?
Yes—AgentEvolver gives the model three “super-powers”: write the questions, remember the mistakes, and grade every step. The 7 B version outscores a 14 B baseline on two public benchmarks while using 60 % fewer tokens.

1. Why Most RL Pipelines for Agents Are Too Expensive

Pain Point	Symptom	Cost
No training tasks	Engineers hand-write hundreds of multi-step questions	$1–2 per label, weeks of back-and-forth
Sparse rewards	Only final 0/1 signal, no hint which of 30 steps helped	High gradient variance, slow convergence
Blind exploration	Random roll-outs repeat the same dead-ends	GPU hours burn, little learning value

AgentEvolver replaces all three manual steps with an autonomous loop: Self-Questioning → Self-Navigating → Self-Attributing.

2. The Core Loop in One Picture

Sandbox (no rewards)
   ↓  Self-Questioning  ← curiosity + environment profile
Synthetic tasks + reference answers
   ↓  Self-Navigating   ← experience pool (ReMe)
Experience-guided roll-outs
   ↓  Self-Attributing  ← LLM judge scores every step
Dense rewards → GRPO update

Author reflection: The first time we ran the loop overnight, the 7 B model woke up having invented 412 tasks we never taught it—like a student who finishes the textbook and writes extra chapters for fun.

3. Self-Questioning: How to Write an Exam the Model Can Already Pass

“

How does an LLM create high-quality, diverse, and solvable tasks in an unknown environment?

3.1 Build an “Environment Profile” First

The agent performs high-temperature exploration for Nb steps, summarising:

Entity	Attributes	Operations
traffic_light	status ∈ {red,green}	wait_and_cross
hospital	location, capacity	enter, query_bed

Short summary: The profile acts as a “table of contents” so the model does not walk blindly.

3.2 Two-Phase Exploration

🍄

Breadth phase (Nb): visit many regions, collect diverse state-action pairs
🍄

Depth phase (Nd): dive deep into promising sub-graphs to guarantee solvability

Code snippet (pseudo):

for step in range(Nb):
    action ← llm.sample(profile, temperature=1.2)
    exec(action)
for step in range(Nd):
    action ← llm.sample(last_k_obs, temperature=0.8)
    exec(action)

3.3 Task Factory Pipeline

Trajectories → distilled action-observation pairs
User preferences (difficulty, style) → prompt
LLM outputs: (a) task description, (b) reference solution, (c) verification script
Execute reference in sandbox—fail ⇒ discard

Example task created overnight:

“

“Help user Alice book an appointment at the nearest hospital after 5 pm without passing any red light.”
Reference: [query_bed, wait_and_cross, enter]

Author reflection: We set the difficulty knob to “three entities + three attributes” and went for coffee; 200 valid tasks were waiting when we returned—no human touched a keyboard.

4. Self-Navigating: Turning Old Mistakes into New Shortcuts

“

Once tasks exist, how can the agent avoid repeating the same failures?

4.1 Experience Format

Each record is two natural-language sentences:

When to use: about to call an unverified delete API
Content: call apis.api_docs.show_api_doc first to confirm delete semantics

The When part is embedded for retrieval; the Content part is injected into the prompt when similarity > threshold.

4.2 Mixed Roll-out Strategy

For every batch of N trajectories:

Type	Count	Prompt
Vanilla	Nv = N·(1-η)	system + query
Experience	Ne = N·η	system + … + query

η = 0.5 gives the best long-term score in our grid search.

4.3 Experience Stripping & Selective Boosting

🍄

Stripping: remove injected tokens before gradient calculation → prevents over-fitting to external text
🍄

Selective boosting: raise clipping bound ε̂_high from 0.28 → 0.6 for samples with positive advantage, letting helpful experiences write themselves into the weights

Author reflection: Stripping felt counter-intuitive—like teaching with the textbook closed—but ablation drops 8 % absolute score, so the brain must learn the pattern, not the paragraph.

5. Self-Attributing: Giving Every Step Its Own Grade

“

Terminal 0/1 rewards work, but how do we spot the single step that actually doomed the run?

5.1 LLM Judge Protocol

Input: full trajectory + final score
Output: per-step label GOOD (+1) or BAD (-1)

System prompt excerpt:

“

If score > 0: GOOD = contributed positively, BAD = irrelevant or harmful.
If score ≤ 0: GOOD only if step actively fixed an error.

5.2 Reward Construction

Process reward: rt^attr ∈ {+1, -1}
Outcome reward: R^out ∈ {0, 1} at final step
Standardise both channels independently (z-score across batch)
Composite: r̂_t = α·r̂_t^attr + 1_{t=T}·r̂^out (α tuned 0.1–0.2)

5.3 Advantage Calculation

Use undiscounted cumulative return (γ = 1) for simplicity:
A_t = Σ_{k=t}^T r̂_k

Author reflection: Setting α too high feels like letting the judge live the student’s life—early grades sky-rocket, but the model overfits to the judge’s quirks and forgets the real exam.

6. Infrastructure: Services, Context Manager, and One-Command Launch

“

How do all pieces plug together without rewriting your RL stack?

6.1 Service-Oriented Architecture

🍄

Environment Service (Ray actors): Gym-compatible, HTTP interface, scalable to 1 k parallel envs
🍄

Task Manager: curriculum plug-ins, deduplication filters
🍄

Experience Manager (ReMe): vector DB, summarisation, re-ranking
🍄

Training Worker: veRL backend, GRPO loss, experience stripping & boosting built-in

6.2 Context Manager Templates

Template	Use-Case	Memory Growth	Edit Friendly
Basic Causal	Search RL	linear	❌
Reasoning-Augmented	Think-then-act	linear	❌
Sliding Window	Long horizon	constant	✅
Self-Context-Managing	Autonomous	self-regulated	✅✅

Author reflection: Letting the model delete its own memory felt scary—until TGC@8 jumped 12 %. Turns out forgetting is a feature, not a bug.

6.3 Launch Commands

Minimal:

conda activate agentevolver
python launcher.py --conf examples/basic.yaml --with-appworld

Full pipeline:

python launcher.py --conf examples/overall.yaml --with-appworld --with-reme

Dashboard at http://localhost:8265 (Ray dashboard).

7. Experiments: Numbers That Fit in a Tweet

“

Does the framework actually move the needle?

7.1 Main Benchmarks

AppWorld & BFCL-v3, 8-rollouts average:

Model	Params	AppWorld avg@8	BFCL avg@8	Δ avg
Qwen2.5	7 B	1.8 %	29.8 %	—
+AgentEvolver	7 B	32.4 %	57.9 %	+29.4 %
Qwen2.5	14 B	18.0 %	41.6 %	—
+AgentEvolver	14 B	48.7 %	66.5 %	+27.8 %

7.2 Ablations (14 B dev set)

Variant	avg@4	Drop vs Full
Full pipeline	65.0 %	—
w/o Self-Questioning	45.3 %	-19.7 %
w/o Self-Navigating	56.7 %	-8.3 %
w/o Self-Attributing	57.4 %	-7.6 %

7.3 Sample Efficiency

To reach 90 % of baseline best performance:

Benchmark	Baseline Steps	AgentEvolver Steps	Saving
AppWorld	90	40	55 %
BFCL-v3	60	20	67 %

Author reflection: Watching the 7 B curve pass the 14 B baseline with half the compute felt like seeing a Prius overtake a sports car—efficiency beats brute force.

8. Hands-On Walk-Through: Teaching an Agent to Book a Flight in 30 Minutes

Scenario: sandbox exposes three APIs—search_flight, book_ticket, cancel_order. No docs, no rewards.

Self-Questioning (10 min)
- 🍄
  
  Nb = 30, Nd = 20 → 50-step exploration
- 🍄
  
  Discovers implicit pre-condition: must search before book
Task Synthesis (5 min)
- 🍄
  
  Generates 80 tasks: “Book cheapest round-trip next week”, “Cancel last order”, …
- 🍄
  
  Verification script filters 9 hallucinated tasks → 71 clean
Self-Navigating (10 min)
- 🍄
  
  Cold-start experience pool: 4 roll-outs/task = 284 trajectories
- 🍄
  
  Top-5 retrieval injected into 50 % roll-outs → success rate 42 % → 68 %
Self-Attributing (5 min, 2 epochs)
- 🍄
  
  LLM judge labels 1 200 steps; α = 0.15
- 🍄
  
  Error “book without passenger_id” disappears after 1 200 steps

Result: zero-shot 12 % → 71 % task-completion; 1 800 GPU-seconds on a single A100.

9. Action Checklist / Implementation Steps

🍄

[ ] git clone AgentEvolver & install with bash install.sh
🍄

[ ] Pick an environment (AppWorld pre-packed) and run its setup.sh
🍄

[ ] (Optional) Install ReMe for experience services
🍄

[ ] Copy example.env → .env, add API keys & conda path
🍄

[ ] Quick test: python launcher.py --conf examples/basic.yaml --with-appworld
🍄

[ ] Full pipeline: switch to overall.yaml and add --with-reme
🍄

[ ] Watch Ray dashboard; expect 2–4× speed-up vs vanilla GRPO after 2 000 steps

10. One-Page Overview

AgentEvolver is an end-to-end, self-evolving framework for LLM agents. It removes the need for hand-written tasks, dense rewards, and massive random sampling by letting the model:

Write homework (Self-Questioning) → high-diversity, solvable tasks
Keep a diary of mistakes (Self-Navigating) → retrieve and reuse experiences
Grade every move (Self-Attributing) → dense, step-level rewards

Built on modular micro-services (Ray + veRL), it trains a 7 B model to outperform a 14 B baseline while using 60 % fewer tokens. Installation is one script; launch is one command. The same loop works for booking flights, calling APIs, or any Gym-compatible sandbox.

11. FAQ

Q1: Do I need labelled data to start?
A: No—Self-Questioning bootstraps from an empty sandbox.

Q2: Which base models are compatible?
A: Any LLM with 8 k+ context and a causal policy head; tested on Qwen2.5-7/14 B.

Q3: How much GPU time for a quick pilot?
A: ~2 hours on a single A100 (80 GB) for 1 k tasks, 2 epochs.

Q4: Is the framework limited to AppWorld/BFCL?
A: No—any environment exposing a Gym interface can register with the Environment Service.

Q5: Does experience injection hurt exploration?
A: η = 0.5 is the sweet spot; higher values improve short-term reward but reduce long-term generalisation.

Q6: Can I turn off Self-Attributing?
A: Yes—set α = 0, but expect ~5 % drop in final score and slower convergence.

Q7: Is AgentEvolver production-ready?
A: Code is Apache-2.0; micro-service design supports Kubernetes/Slurm. Use larger models and longer curricula for production stress.