Site icon Efficient Coder

AgentEvolver: How a 7B LLM Outperforms 14B Models with Self-Training

AgentEvolver: A Self-Evolving Agent Framework That Writes Its Own Homework, Study Notes, and Report Card

Can a large language model train itself to use tools in a brand-new environment without human-made datasets, dense reward functions, or brute-force sampling?
Yes—AgentEvolver gives the model three “super-powers”: write the questions, remember the mistakes, and grade every step. The 7 B version outscores a 14 B baseline on two public benchmarks while using 60 % fewer tokens.


1. Why Most RL Pipelines for Agents Are Too Expensive

Pain Point Symptom Cost
No training tasks Engineers hand-write hundreds of multi-step questions $1–2 per label, weeks of back-and-forth
Sparse rewards Only final 0/1 signal, no hint which of 30 steps helped High gradient variance, slow convergence
Blind exploration Random roll-outs repeat the same dead-ends GPU hours burn, little learning value

AgentEvolver replaces all three manual steps with an autonomous loop: Self-Questioning → Self-Navigating → Self-Attributing.


2. The Core Loop in One Picture

Sandbox (no rewards)
   ↓  Self-Questioning  ← curiosity + environment profile
Synthetic tasks + reference answers
   ↓  Self-Navigating   ← experience pool (ReMe)
Experience-guided roll-outs
   ↓  Self-Attributing  ← LLM judge scores every step
Dense rewards → GRPO update

Author reflection: The first time we ran the loop overnight, the 7 B model woke up having invented 412 tasks we never taught it—like a student who finishes the textbook and writes extra chapters for fun.


3. Self-Questioning: How to Write an Exam the Model Can Already Pass

How does an LLM create high-quality, diverse, and solvable tasks in an unknown environment?

3.1 Build an “Environment Profile” First

The agent performs high-temperature exploration for Nb steps, summarising:

Entity Attributes Operations
traffic_light status ∈ {red,green} wait_and_cross
hospital location, capacity enter, query_bed

Short summary: The profile acts as a “table of contents” so the model does not walk blindly.

3.2 Two-Phase Exploration

  • 🍄
    Breadth phase (Nb): visit many regions, collect diverse state-action pairs
  • 🍄
    Depth phase (Nd): dive deep into promising sub-graphs to guarantee solvability

Code snippet (pseudo):

for step in range(Nb):
    action ← llm.sample(profile, temperature=1.2)
    exec(action)
for step in range(Nd):
    action ← llm.sample(last_k_obs, temperature=0.8)
    exec(action)

3.3 Task Factory Pipeline

  1. Trajectories → distilled action-observation pairs
  2. User preferences (difficulty, style) → prompt
  3. LLM outputs: (a) task description, (b) reference solution, (c) verification script
  4. Execute reference in sandbox—fail ⇒ discard

Example task created overnight:

“Help user Alice book an appointment at the nearest hospital after 5 pm without passing any red light.”
Reference: [query_bed, wait_and_cross, enter]

Author reflection: We set the difficulty knob to “three entities + three attributes” and went for coffee; 200 valid tasks were waiting when we returned—no human touched a keyboard.


4. Self-Navigating: Turning Old Mistakes into New Shortcuts

Once tasks exist, how can the agent avoid repeating the same failures?

4.1 Experience Format

Each record is two natural-language sentences:

When to use: about to call an unverified delete API
Content: call apis.api_docs.show_api_doc first to confirm delete semantics

The When part is embedded for retrieval; the Content part is injected into the prompt when similarity > threshold.

4.2 Mixed Roll-out Strategy

For every batch of N trajectories:

Type Count Prompt
Vanilla Nv = N·(1-η) system + query
Experience Ne = N·η system + … + query

η = 0.5 gives the best long-term score in our grid search.

4.3 Experience Stripping & Selective Boosting

  • 🍄
    Stripping: remove injected tokens before gradient calculation → prevents over-fitting to external text
  • 🍄
    Selective boosting: raise clipping bound ε̂_high from 0.28 → 0.6 for samples with positive advantage, letting helpful experiences write themselves into the weights

Author reflection: Stripping felt counter-intuitive—like teaching with the textbook closed—but ablation drops 8 % absolute score, so the brain must learn the pattern, not the paragraph.


5. Self-Attributing: Giving Every Step Its Own Grade

Terminal 0/1 rewards work, but how do we spot the single step that actually doomed the run?

5.1 LLM Judge Protocol

Input: full trajectory + final score
Output: per-step label GOOD (+1) or BAD (-1)

System prompt excerpt:

If score > 0: GOOD = contributed positively, BAD = irrelevant or harmful.
If score ≤ 0: GOOD only if step actively fixed an error.

5.2 Reward Construction

  1. Process reward: rt^attr ∈ {+1, -1}
  2. Outcome reward: R^out ∈ {0, 1} at final step
  3. Standardise both channels independently (z-score across batch)
  4. Composite: r̂_t = α·r̂_t^attr + 1_{t=T}·r̂^out (α tuned 0.1–0.2)

5.3 Advantage Calculation

Use undiscounted cumulative return (γ = 1) for simplicity:
A_t = Σ_{k=t}^T r̂_k

Author reflection: Setting α too high feels like letting the judge live the student’s life—early grades sky-rocket, but the model overfits to the judge’s quirks and forgets the real exam.


6. Infrastructure: Services, Context Manager, and One-Command Launch

How do all pieces plug together without rewriting your RL stack?

6.1 Service-Oriented Architecture

  • 🍄
    Environment Service (Ray actors): Gym-compatible, HTTP interface, scalable to 1 k parallel envs
  • 🍄
    Task Manager: curriculum plug-ins, deduplication filters
  • 🍄
    Experience Manager (ReMe): vector DB, summarisation, re-ranking
  • 🍄
    Training Worker: veRL backend, GRPO loss, experience stripping & boosting built-in

6.2 Context Manager Templates

Template Use-Case Memory Growth Edit Friendly
Basic Causal Search RL linear
Reasoning-Augmented Think-then-act linear
Sliding Window Long horizon constant
Self-Context-Managing Autonomous self-regulated ✅✅

Author reflection: Letting the model delete its own memory felt scary—until TGC@8 jumped 12 %. Turns out forgetting is a feature, not a bug.

6.3 Launch Commands

Minimal:

conda activate agentevolver
python launcher.py --conf examples/basic.yaml --with-appworld

Full pipeline:

python launcher.py --conf examples/overall.yaml --with-appworld --with-reme

Dashboard at http://localhost:8265 (Ray dashboard).


7. Experiments: Numbers That Fit in a Tweet

Does the framework actually move the needle?

7.1 Main Benchmarks

AppWorld & BFCL-v3, 8-rollouts average:

Model Params AppWorld avg@8 BFCL avg@8 Δ avg
Qwen2.5 7 B 1.8 % 29.8 %
+AgentEvolver 7 B 32.4 % 57.9 % +29.4 %
Qwen2.5 14 B 18.0 % 41.6 %
+AgentEvolver 14 B 48.7 % 66.5 % +27.8 %

7.2 Ablations (14 B dev set)

Variant avg@4 Drop vs Full
Full pipeline 65.0 %
w/o Self-Questioning 45.3 % -19.7 %
w/o Self-Navigating 56.7 % -8.3 %
w/o Self-Attributing 57.4 % -7.6 %

7.3 Sample Efficiency

To reach 90 % of baseline best performance:

Benchmark Baseline Steps AgentEvolver Steps Saving
AppWorld 90 40 55 %
BFCL-v3 60 20 67 %

Author reflection: Watching the 7 B curve pass the 14 B baseline with half the compute felt like seeing a Prius overtake a sports car—efficiency beats brute force.


8. Hands-On Walk-Through: Teaching an Agent to Book a Flight in 30 Minutes

Scenario: sandbox exposes three APIs—search_flight, book_ticket, cancel_order. No docs, no rewards.

  1. Self-Questioning (10 min)

    • 🍄
      Nb = 30, Nd = 20 → 50-step exploration
    • 🍄
      Discovers implicit pre-condition: must search before book
  2. Task Synthesis (5 min)

    • 🍄
      Generates 80 tasks: “Book cheapest round-trip next week”, “Cancel last order”, …
    • 🍄
      Verification script filters 9 hallucinated tasks → 71 clean
  3. Self-Navigating (10 min)

    • 🍄
      Cold-start experience pool: 4 roll-outs/task = 284 trajectories
    • 🍄
      Top-5 retrieval injected into 50 % roll-outs → success rate 42 % → 68 %
  4. Self-Attributing (5 min, 2 epochs)

    • 🍄
      LLM judge labels 1 200 steps; α = 0.15
    • 🍄
      Error “book without passenger_id” disappears after 1 200 steps

Result: zero-shot 12 % → 71 % task-completion; 1 800 GPU-seconds on a single A100.


9. Action Checklist / Implementation Steps

  • 🍄
    [ ] git clone AgentEvolver & install with bash install.sh
  • 🍄
    [ ] Pick an environment (AppWorld pre-packed) and run its setup.sh
  • 🍄
    [ ] (Optional) Install ReMe for experience services
  • 🍄
    [ ] Copy example.env.env, add API keys & conda path
  • 🍄
    [ ] Quick test: python launcher.py --conf examples/basic.yaml --with-appworld
  • 🍄
    [ ] Full pipeline: switch to overall.yaml and add --with-reme
  • 🍄
    [ ] Watch Ray dashboard; expect 2–4× speed-up vs vanilla GRPO after 2 000 steps

10. One-Page Overview

AgentEvolver is an end-to-end, self-evolving framework for LLM agents. It removes the need for hand-written tasks, dense rewards, and massive random sampling by letting the model:

  1. Write homework (Self-Questioning) → high-diversity, solvable tasks
  2. Keep a diary of mistakes (Self-Navigating) → retrieve and reuse experiences
  3. Grade every move (Self-Attributing) → dense, step-level rewards

Built on modular micro-services (Ray + veRL), it trains a 7 B model to outperform a 14 B baseline while using 60 % fewer tokens. Installation is one script; launch is one command. The same loop works for booking flights, calling APIs, or any Gym-compatible sandbox.


11. FAQ

Q1: Do I need labelled data to start?
A: No—Self-Questioning bootstraps from an empty sandbox.

Q2: Which base models are compatible?
A: Any LLM with 8 k+ context and a causal policy head; tested on Qwen2.5-7/14 B.

Q3: How much GPU time for a quick pilot?
A: ~2 hours on a single A100 (80 GB) for 1 k tasks, 2 epochs.

Q4: Is the framework limited to AppWorld/BFCL?
A: No—any environment exposing a Gym interface can register with the Environment Service.

Q5: Does experience injection hurt exploration?
A: η = 0.5 is the sweet spot; higher values improve short-term reward but reduce long-term generalisation.

Q6: Can I turn off Self-Attributing?
A: Yes—set α = 0, but expect ~5 % drop in final score and slower convergence.

Q7: Is AgentEvolver production-ready?
A: Code is Apache-2.0; micro-service design supports Kubernetes/Slurm. Use larger models and longer curricula for production stress.

Exit mobile version