Train Multi-Step Agents for Real-World Tasks with ART

An end-to-end guide for developers who hate writing reward functions

Reader profile: You already know Python, have played with an LLM API, and now want the model to do something useful across many steps—play 2048, solve Temporal Clue, retrieve the right e-mail—without spending nights hand-crafting a reward function.
This article explains exactly how the open-source Agent Reinforcement Trainer (ART) does that for you.

1. What problem does ART solve?

Pain point	How ART fixes it
Writing a reward function is tedious and error-prone	RULER auto-scores trajectories with another LLM
GRPO training code is long and fragile	ART wraps it into two Docker-ready services
You need a GPU but only have a laptop	ART spins up an ephemeral cloud GPU when you call the client
Debugging RL is opaque	Native integrations with W&B, Langfuse, OpenPipe

2. RULER: Zero-shot, hand-crafted-reward-free scoring

Instead of if won: reward += 100, describe the task in plain English inside the system prompt.
RULER then:

Takes the entire conversation history (system / user / assistant messages).
Feeds it to an LLM of your choice (openai/o3, claude-3, etc.).
Returns a scalar reward between 0 and 1.

That is literally the whole reward function:

# before: 50-line heuristic
def complex_reward(traj): ...

# after: one line
score = await ruler_score_group(trajectory_group, "openai/o3")

ART’s paper shows this matches or beats hand-written rewards on three out of four tested benchmarks.

3. ART at a glance

3.1 Architecture

Component	Runs where	Responsibility
ART Client	Your laptop / CI box	Looks like an OpenAI client, sends prompts, receives completions
ART Server	Any GPU machine	vLLM inference + GRPO training + LoRA checkpointing

The two pieces talk over HTTP, so you can debug locally and train on a 4090 in the cloud.

3.2 Supported models

Any causal LM that vLLM can load and Unsloth can LoRA-tune works today:

Qwen 2.5 (3B/7B/14B/32B)
Llama-3-8B-Instruct
Mistral-7B-v0.3

Gemma-3 is still unsupported; check Discord for updates.

4. Installation

pip install openpipe-art

No root, no CUDA on the client side.
The server pulls its own CUDA-enabled image the first time it starts.

5. Quickstart: Train an agent to play 2048 (15 minutes)

5.1 One-click Colab

Open 2048 notebook.
Run every cell in order.
- Cell 1 downloads the ART server image.
- Cell 2 spins 32 parallel rollouts.
- Cell 3 calls RULER to reward “reaching higher tiles”.
- Cell 4 triggers GRPO.
After ~15 min, the 3B Qwen agent consistently hits 512.

5.2 Local GPU (optional)

Replace the notebook’s remote_server_url with http://localhost:8000 and run:

python -m art.server --port 8000 --model Qwen/Qwen2.5-3B-Instruct

6. Anatomy of the training loop

6.1 Inference phase

Your code calls

response = client.chat.completions.create(
    model="lora:latest",
    messages=[...],
    extra_body={"art": {"trajectory_id": tid}}
)

ART server returns text and logs the full trajectory in memory.
When the rollout ends, you (or RULER) assign
```
trajectory.reward = score
```

6.2 Training phase

The client sends

await client.submit_trajectories([trajectory_group])

Server blocks new inference, runs GRPO on the LoRA, saves the new adapter to ./checkpoints.
vLLM hot-swaps the LoRA; inference resumes automatically.

Repeat until max_iterations is reached.

7. Example notebooks you can run today

Task	Notebook	Model size	Performance snapshot
2048	Train	3B
Temporal Clue	Train	7B	coming soon
Tic Tac Toe	Train	3B
Codenames	Train	3B

8. Case study: ART•E e-mail retrieval agent

Blog post shows how we trained Qwen 2.5-14B to outperform o3 on a real-world inbox.

8.1 Task definition (system prompt)

You are an email assistant.  
The user will describe an email they need.  
Reply only with the message_id.

8.2 Reward

Binary: 1 if the returned ID is correct, 0 otherwise.
RULER handles it without any extra code.

8.3 Results

Metric	Before RL	After 300 GRPO steps
Accuracy	34 %	82 %
Avg. tokens per episode	312	178

The entire run cost less than three dollars on an A100 spot instance.

9. Integrating ART into your own project

9.1 Minimal client snippet

from art import AsyncClient

client = AsyncClient(base_url="http://gpu-box:8000/v1")

async def rollout():
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Book a table for two at 7 pm"}
    ]
    resp = await client.chat.completions.create(
        model="lora:latest",
        messages=messages
    )
    return resp.trajectory  # already tagged with id

9.2 Launch the server

docker run --gpus all -p 8000:8000 \
  -e MODEL=Qwen/Qwen2.5-7B-Instruct \
  openpipe/art-server:latest

No other flags required; defaults are battle-tested.

10. Observability built-in

ART ships with adapters for:

Weights & Biases – live curves, artifact versioning
Langfuse – trace-level debugging for every LLM call
OpenPipe – cost & latency analytics

Enable in one line:

export WANDB_PROJECT=my_agents

11. Frequently asked questions

1. Do I need to understand GRPO?

No. ART hides the math; you only decide how many rollouts per iteration.

2. Is my data sent to OpenAI?

Only if you choose `openai/o3` as the RULER judge. Rollouts and training stay on your hardware.

3. Can I use a CPU?

Training a 3B model on CPU would take days. Colab’s free T4 is the minimum viable option.

4. How large are the LoRA checkpoints?

Roughly 20 MB for a 7B model (rank = 64).

5. What if the task cannot be described in one sentence?

Chain multiple RULER calls or fall back to a light hand-written scorer; ART supports both.

6. Windows support?

Client works everywhere. Server is Linux-first; use WSL2 on Windows.

7. Is fine-tuning destructive?

No. ART always keeps the base model frozen; only LoRA weights change.

8. How do I stop a run early?

`Ctrl-C` on the server gracefully saves the latest checkpoint.

9. Can I continue training later?

Yes. Point the server to the same checkpoint directory; it auto-resumes.

10. Licensing?

Apache-2.0 for the code; model licenses depend on the weights you choose.

12. Roadmap & how to contribute

GitHub: openpipe/art
Discord: #art
Contribution guide: CONTRIBUTING.md

We especially welcome:

New environment adapters (any turn-based game or API)
Additional judge models for RULER
LangChain / LlamaIndex integrations

13. Citation

If you use ART in your research:

@misc{hilton2025art,
  author = {Brad Hilton and Kyle Corbitt and David Corbitt and Saumya Gandhi and Angky William and Bohdan Kovalenskyi and Andie Jones},
  title = {ART: Agent Reinforcement Trainer},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/openpipe/art}}
}

14. Takeaway checklist

Task	Done?
Install client	`pip install openpipe-art`
Run 2048 notebook	link
Swap system prompt to your use-case	✍️
Watch metrics in W&B	🚀
Ship the LoRA	📦

With ART, the distance between “I have an idea” and “My model can do it reliably” is one notebook away.

How to Train Multi-Step Agents Without Writing Reward Functions Using ART