Train Multi-Step Agents for Real-World Tasks with ART
An end-to-end guide for developers who hate writing reward functions
Reader profile: You already know Python, have played with an LLM API, and now want the model to do something useful across many steps—play 2048, solve Temporal Clue, retrieve the right e-mail—without spending nights hand-crafting a reward function.
This article explains exactly how the open-source Agent Reinforcement Trainer (ART) does that for you.
1. What problem does ART solve?
Pain point | How ART fixes it |
---|---|
Writing a reward function is tedious and error-prone | RULER auto-scores trajectories with another LLM |
GRPO training code is long and fragile | ART wraps it into two Docker-ready services |
You need a GPU but only have a laptop | ART spins up an ephemeral cloud GPU when you call the client |
Debugging RL is opaque | Native integrations with W&B, Langfuse, OpenPipe |
2. RULER: Zero-shot, hand-crafted-reward-free scoring
Instead of if won: reward += 100
, describe the task in plain English inside the system prompt.
RULER then:
-
Takes the entire conversation history (system / user / assistant messages). -
Feeds it to an LLM of your choice ( openai/o3
,claude-3
, etc.). -
Returns a scalar reward between 0 and 1.
That is literally the whole reward function:
# before: 50-line heuristic
def complex_reward(traj): ...
# after: one line
score = await ruler_score_group(trajectory_group, "openai/o3")
ART’s paper shows this matches or beats hand-written rewards on three out of four tested benchmarks.
3. ART at a glance
3.1 Architecture
Component | Runs where | Responsibility |
---|---|---|
ART Client | Your laptop / CI box | Looks like an OpenAI client, sends prompts, receives completions |
ART Server | Any GPU machine | vLLM inference + GRPO training + LoRA checkpointing |
The two pieces talk over HTTP, so you can debug locally and train on a 4090 in the cloud.
3.2 Supported models
Any causal LM that vLLM can load and Unsloth can LoRA-tune works today:
-
Qwen 2.5 (3B/7B/14B/32B) -
Llama-3-8B-Instruct -
Mistral-7B-v0.3
Gemma-3 is still unsupported; check Discord for updates.
4. Installation
pip install openpipe-art
No root, no CUDA on the client side.
The server pulls its own CUDA-enabled image the first time it starts.
5. Quickstart: Train an agent to play 2048 (15 minutes)
5.1 One-click Colab
-
Open 2048 notebook. -
Run every cell in order. -
Cell 1 downloads the ART server image. -
Cell 2 spins 32 parallel rollouts. -
Cell 3 calls RULER to reward “reaching higher tiles”. -
Cell 4 triggers GRPO.
-
-
After ~15 min, the 3B Qwen agent consistently hits 512.
5.2 Local GPU (optional)
Replace the notebook’s remote_server_url
with http://localhost:8000
and run:
python -m art.server --port 8000 --model Qwen/Qwen2.5-3B-Instruct
6. Anatomy of the training loop
6.1 Inference phase
-
Your code calls response = client.chat.completions.create( model="lora:latest", messages=[...], extra_body={"art": {"trajectory_id": tid}} )
-
ART server returns text and logs the full trajectory in memory. -
When the rollout ends, you (or RULER) assign trajectory.reward = score
6.2 Training phase
-
The client sends await client.submit_trajectories([trajectory_group])
-
Server blocks new inference, runs GRPO on the LoRA, saves the new adapter to ./checkpoints
. -
vLLM hot-swaps the LoRA; inference resumes automatically.
Repeat until max_iterations
is reached.
7. Example notebooks you can run today
Task | Notebook | Model size | Performance snapshot |
---|---|---|---|
2048 | Train | 3B | |
Temporal Clue | Train | 7B | coming soon |
Tic Tac Toe | Train | 3B | |
Codenames | Train | 3B | ![]() |
8. Case study: ART•E e-mail retrieval agent
Blog post shows how we trained Qwen 2.5-14B to outperform o3 on a real-world inbox.
8.1 Task definition (system prompt)
You are an email assistant.
The user will describe an email they need.
Reply only with the message_id.
8.2 Reward
Binary: 1 if the returned ID is correct, 0 otherwise.
RULER handles it without any extra code.
8.3 Results
Metric | Before RL | After 300 GRPO steps |
---|---|---|
Accuracy | 34 % | 82 % |
Avg. tokens per episode | 312 | 178 |
The entire run cost less than three dollars on an A100 spot instance.
9. Integrating ART into your own project
9.1 Minimal client snippet
from art import AsyncClient
client = AsyncClient(base_url="http://gpu-box:8000/v1")
async def rollout():
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Book a table for two at 7 pm"}
]
resp = await client.chat.completions.create(
model="lora:latest",
messages=messages
)
return resp.trajectory # already tagged with id
9.2 Launch the server
docker run --gpus all -p 8000:8000 \
-e MODEL=Qwen/Qwen2.5-7B-Instruct \
openpipe/art-server:latest
No other flags required; defaults are battle-tested.
10. Observability built-in
ART ships with adapters for:
-
Weights & Biases – live curves, artifact versioning -
Langfuse – trace-level debugging for every LLM call -
OpenPipe – cost & latency analytics
Enable in one line:
export WANDB_PROJECT=my_agents
11. Frequently asked questions
1. Do I need to understand GRPO?
No. ART hides the math; you only decide how many rollouts per iteration.
2. Is my data sent to OpenAI?
Only if you choose `openai/o3` as the RULER judge. Rollouts and training stay on your hardware.
3. Can I use a CPU?
Training a 3B model on CPU would take days. Colab’s free T4 is the minimum viable option.
4. How large are the LoRA checkpoints?
Roughly 20 MB for a 7B model (rank = 64).
5. What if the task cannot be described in one sentence?
Chain multiple RULER calls or fall back to a light hand-written scorer; ART supports both.
6. Windows support?
Client works everywhere. Server is Linux-first; use WSL2 on Windows.
7. Is fine-tuning destructive?
No. ART always keeps the base model frozen; only LoRA weights change.
8. How do I stop a run early?
`Ctrl-C` on the server gracefully saves the latest checkpoint.
9. Can I continue training later?
Yes. Point the server to the same checkpoint directory; it auto-resumes.
10. Licensing?
Apache-2.0 for the code; model licenses depend on the weights you choose.
12. Roadmap & how to contribute
-
GitHub: openpipe/art -
Discord: #art -
Contribution guide: CONTRIBUTING.md
We especially welcome:
-
New environment adapters (any turn-based game or API) -
Additional judge models for RULER -
LangChain / LlamaIndex integrations
13. Citation
If you use ART in your research:
@misc{hilton2025art,
author = {Brad Hilton and Kyle Corbitt and David Corbitt and Saumya Gandhi and Angky William and Bohdan Kovalenskyi and Andie Jones},
title = {ART: Agent Reinforcement Trainer},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/openpipe/art}}
}
14. Takeaway checklist
Task | Done? |
---|---|
Install client | pip install openpipe-art |
Run 2048 notebook | link |
Swap system prompt to your use-case | ✍️ |
Watch metrics in W&B | 🚀 |
Ship the LoRA | 📦 |
With ART, the distance between “I have an idea” and “My model can do it reliably” is one notebook away.