Train Multi-Step Agents for Real-World Tasks with ART

An end-to-end guide for developers who hate writing reward functions


Reader profile: You already know Python, have played with an LLM API, and now want the model to do something useful across many steps—play 2048, solve Temporal Clue, retrieve the right e-mail—without spending nights hand-crafting a reward function.
This article explains exactly how the open-source Agent Reinforcement Trainer (ART) does that for you.


1. What problem does ART solve?

Pain point How ART fixes it
Writing a reward function is tedious and error-prone RULER auto-scores trajectories with another LLM
GRPO training code is long and fragile ART wraps it into two Docker-ready services
You need a GPU but only have a laptop ART spins up an ephemeral cloud GPU when you call the client
Debugging RL is opaque Native integrations with W&B, Langfuse, OpenPipe

2. RULER: Zero-shot, hand-crafted-reward-free scoring

Instead of if won: reward += 100, describe the task in plain English inside the system prompt.
RULER then:

  1. Takes the entire conversation history (system / user / assistant messages).
  2. Feeds it to an LLM of your choice (openai/o3, claude-3, etc.).
  3. Returns a scalar reward between 0 and 1.

That is literally the whole reward function:

# before: 50-line heuristic
def complex_reward(traj): ...

# after: one line
score = await ruler_score_group(trajectory_group, "openai/o3")

ART’s paper shows this matches or beats hand-written rewards on three out of four tested benchmarks.


3. ART at a glance

3.1 Architecture

Component Runs where Responsibility
ART Client Your laptop / CI box Looks like an OpenAI client, sends prompts, receives completions
ART Server Any GPU machine vLLM inference + GRPO training + LoRA checkpointing

The two pieces talk over HTTP, so you can debug locally and train on a 4090 in the cloud.

3.2 Supported models

Any causal LM that vLLM can load and Unsloth can LoRA-tune works today:

  • Qwen 2.5 (3B/7B/14B/32B)
  • Llama-3-8B-Instruct
  • Mistral-7B-v0.3

Gemma-3 is still unsupported; check Discord for updates.


4. Installation

pip install openpipe-art

No root, no CUDA on the client side.
The server pulls its own CUDA-enabled image the first time it starts.


5. Quickstart: Train an agent to play 2048 (15 minutes)

5.1 One-click Colab

  1. Open 2048 notebook.
  2. Run every cell in order.

    • Cell 1 downloads the ART server image.
    • Cell 2 spins 32 parallel rollouts.
    • Cell 3 calls RULER to reward “reaching higher tiles”.
    • Cell 4 triggers GRPO.
  3. After ~15 min, the 3B Qwen agent consistently hits 512.

5.2 Local GPU (optional)

Replace the notebook’s remote_server_url with http://localhost:8000 and run:

python -m art.server --port 8000 --model Qwen/Qwen2.5-3B-Instruct

6. Anatomy of the training loop

6.1 Inference phase

  1. Your code calls

    response = client.chat.completions.create(
        model="lora:latest",
        messages=[...],
        extra_body={"art": {"trajectory_id": tid}}
    )
    
  2. ART server returns text and logs the full trajectory in memory.
  3. When the rollout ends, you (or RULER) assign

    trajectory.reward = score
    

6.2 Training phase

  1. The client sends

    await client.submit_trajectories([trajectory_group])
    
  2. Server blocks new inference, runs GRPO on the LoRA, saves the new adapter to ./checkpoints.
  3. vLLM hot-swaps the LoRA; inference resumes automatically.

Repeat until max_iterations is reached.


7. Example notebooks you can run today

Task Notebook Model size Performance snapshot
2048 Train 3B accuracy curve
Temporal Clue Train 7B coming soon
Tic Tac Toe Train 3B accuracy curve
Codenames Train 3B win rate

8. Case study: ART•E e-mail retrieval agent

Blog post shows how we trained Qwen 2.5-14B to outperform o3 on a real-world inbox.

8.1 Task definition (system prompt)

You are an email assistant.  
The user will describe an email they need.  
Reply only with the message_id.

8.2 Reward

Binary: 1 if the returned ID is correct, 0 otherwise.
RULER handles it without any extra code.

8.3 Results

Metric Before RL After 300 GRPO steps
Accuracy 34 % 82 %
Avg. tokens per episode 312 178

The entire run cost less than three dollars on an A100 spot instance.


9. Integrating ART into your own project

9.1 Minimal client snippet

from art import AsyncClient

client = AsyncClient(base_url="http://gpu-box:8000/v1")

async def rollout():
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Book a table for two at 7 pm"}
    ]
    resp = await client.chat.completions.create(
        model="lora:latest",
        messages=messages
    )
    return resp.trajectory  # already tagged with id

9.2 Launch the server

docker run --gpus all -p 8000:8000 \
  -e MODEL=Qwen/Qwen2.5-7B-Instruct \
  openpipe/art-server:latest

No other flags required; defaults are battle-tested.


10. Observability built-in

ART ships with adapters for:

  • Weights & Biases – live curves, artifact versioning
  • Langfuse – trace-level debugging for every LLM call
  • OpenPipe – cost & latency analytics

Enable in one line:

export WANDB_PROJECT=my_agents

11. Frequently asked questions

1. Do I need to understand GRPO?

No. ART hides the math; you only decide how many rollouts per iteration.

2. Is my data sent to OpenAI?

Only if you choose `openai/o3` as the RULER judge. Rollouts and training stay on your hardware.

3. Can I use a CPU?

Training a 3B model on CPU would take days. Colab’s free T4 is the minimum viable option.

4. How large are the LoRA checkpoints?

Roughly 20 MB for a 7B model (rank = 64).

5. What if the task cannot be described in one sentence?

Chain multiple RULER calls or fall back to a light hand-written scorer; ART supports both.

6. Windows support?

Client works everywhere. Server is Linux-first; use WSL2 on Windows.

7. Is fine-tuning destructive?

No. ART always keeps the base model frozen; only LoRA weights change.

8. How do I stop a run early?

`Ctrl-C` on the server gracefully saves the latest checkpoint.

9. Can I continue training later?

Yes. Point the server to the same checkpoint directory; it auto-resumes.

10. Licensing?

Apache-2.0 for the code; model licenses depend on the weights you choose.


12. Roadmap & how to contribute

We especially welcome:

  • New environment adapters (any turn-based game or API)
  • Additional judge models for RULER
  • LangChain / LlamaIndex integrations

13. Citation

If you use ART in your research:

@misc{hilton2025art,
  author = {Brad Hilton and Kyle Corbitt and David Corbitt and Saumya Gandhi and Angky William and Bohdan Kovalenskyi and Andie Jones},
  title = {ART: Agent Reinforcement Trainer},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/openpipe/art}}
}

14. Takeaway checklist

Task Done?
Install client pip install openpipe-art
Run 2048 notebook link
Swap system prompt to your use-case ✍️
Watch metrics in W&B 🚀
Ship the LoRA 📦

With ART, the distance between “I have an idea” and “My model can do it reliably” is one notebook away.