Going Beyond Ten Clicks: How ASearcher Uses Asynchronous Reinforcement Learning to Push Open-Source Search Agents Past 40 Turns
Imagine you are asked to find the exact number of gold, silver, and bronze medals China won in the 2012 London Olympics as of 31 December 2024.
A quick search returns two conflicting totals: “38-27-22” and “39-31-22”.
A human researcher would open multiple official reports, cross-check doping appeals, and finally discover that one gold medal was later withdrawn.
That process can take dozens of web pages and many reasoning steps—far more than the ten-turn limit that most open-source language agents accept today.
ASearcher is the first fully open-source project that removes this barrier.
By combining a simple two-tool agent, an automatic question-building pipeline, and a fully asynchronous reinforcement-learning (RL) system, it routinely performs 40-plus search-and-browse actions and produces more than 150 000 tokens of reasoning before giving a final answer.
This post explains—in plain language—how it works, what it achieves, and how you can reproduce or adapt it for your own use case.
Table of contents
-
Why long-horizon search matters -
Three building blocks of ASearcher
2.1 Data that grows tougher at every step
2.2 Training that never waits for the slowest query
2.3 An agent with only two tools -
Benchmark results in plain numbers -
Hands-on: running ASearcher in three scenarios
4.1 Reproducing the published scores
4.2 Fine-tuning a 7-billion-parameter model
4.3 Building your own question set -
Frequently asked questions
1. Why long-horizon search matters
The takeaway is simple: complex questions need deep dives, and deep dives need long trajectories.
Traditional RL systems stop at ten steps because longer trajectories leave GPUs idle.
ASearcher solves the idle-GPU problem with fully asynchronous training, letting the agent search as long as necessary.
2. Three building blocks of ASearcher
2.1 Data that grows tougher at every step
How the synthetic writer works
-
Injection
Start with “When was Michael P. Hein born?”
Insert extra facts: “…the first Ulster County Executive who allowed the Catskill Mountain Railroad to keep running in 2016…”
The question is now harder because more conditions must be checked. -
Fuzzing
Replace “2016” with “the 2016 United States House elections period”, or swap “Catskill Mountain Railroad” for “a historic mountain railway”.
Precision is lost, forcing the model to search instead of memorise. -
Quality gates
-
A strong model (QwQ-32B) tries to answer without tools; if it succeeds, the question is discarded. -
A second model confirms there is only one valid answer. -
Human reviewers sample 5 % for final sanity checks.
-
This pipeline starts with 14 107 seed questions and finishes with 25 624 high-quality, tool-demanding items.
2.2 Training that never waits for the slowest query
Under the hood, the sampler and trainer live in separate processes.
Even if one trajectory takes 50 tool calls and another takes 2, the trainer always has fresh data and the GPUs stay busy.
2.3 An agent with only two tools
No external LLM, no extra planner, no memory bank.
Everything—reasoning, summarising, and verifying—happens inside the same model.
For large reasoning models (LRMs) like QwQ-32B, only the last 25 000 characters of history are kept to fit the context window.
3. Benchmark results in plain numbers
3.1 Standard multi-hop and single-hop tasks (local knowledge base)
3.2 Challenging web tasks (real-time search)
3.3 Improvement after RL
4. Hands-on: running ASearcher in three scenarios
4.1 Reproducing the published scores
-
Get the code and weights
git clone https://github.com/inclusionAI/ASearcher.git cd ASearcher pip install -r requirements.txt
-
Download test data
wget https://huggingface.co/datasets/inclusionAI/ASearcher-test-data/resolve/main/GAIA.tar.gz tar -xzf GAIA.tar.gz
-
Run evaluation
cd evaluation/ export SERPER_API_KEY="your_key" export JINA_API_KEY="your_key" python3 search_eval_async.py \ --data_names GAIA,xbench-deepsearch,Frames \ --model_name_or_path inclusionAI/ASearcher-Web-QwQ \ --output_dir ./results \ --llm_as_judge \ --pass-at-k 4
After 30–60 minutes on an 8×A100 node, the console prints Avg@4 and Pass@4 scores that match Table 4 in the paper.
4.2 Fine-tuning a 7-billion-parameter model
Option A: single-node (slow but cheap)
cd AReaL
export SERPER_API_KEY="your_key"
export JINA_API_KEY="your_key"
python3 -m areal.launcher.local ASearcher/train/asearcher.py \
--config ASearcher/configs/asearcher_web.yaml \
experiment_name=my_run \
trial_name=7b_local
Option B: 16-node cluster (recommended)
python3 -m areal.launcher.ray ASearcher/train/asearcher.py \
--config ASearcher/configs/asearcher_web_16nodes.yaml \
experiment_name=my_run \
trial_name=7b_cluster \
cluster.n_nodes=16 \
cluster.n_gpus_per_node=8
-
Training 35 k questions with a 32-turn limit takes ~48 hours on 128 A100s. -
Logs are written to logs/
; TensorBoard shows reward, F1, and token count curves.
4.3 Building your own question set
-
Prepare seed questions
Any JSONL file with{"question": "...", "answer": "..."}
works. -
Start two SGLang servers
-
Port 30000 → QwQ-32B
for generation -
Port 30001 → Qwen2.5-72B-Instruct
for quality checks
-
-
Launch synthesis
python3 qa_synthesis/qa_synthesis_agent.py \ --seed_path data/seed_qa.jsonl \ --output_dir data/my_questions \ --inject_rounds 4 \ --fuzz_rounds 2
The script outputs filtered JSONL ready for training.
5. Frequently asked questions
Q1: How much GPU memory do I really need?
-
7 B model: 24 GB with ZeRO-3 offload. -
32 B model: 80 GB × 8 GPUs is comfortable.
Q2: Can I replace Serper with my own enterprise search?
Yes. Modify search_client.py
; the rest of the pipeline stays unchanged.
Q3: My synthetic questions feel too easy.
Increase fuzz_rounds
or lower the difficulty threshold in qa_synthesis_agent.py
.
Q4: Why not use more tools (calculator, map, code executor)?
The paper keeps the design minimal to prove RL alone unlocks depth.
Feel free to add tools—just extend the action space and reward function.
Q5: Can I fine-tune smaller models like 3 B?
Technically yes, but the 7 B baseline already struggles with long-page summarisation; 3 B would likely fail to converge.