ASearcher: How Asynchronous Reinforcement Learning Breaks 10-Click Barrier in Open-Source Search Agents

高效码农

5 months ago

Going Beyond Ten Clicks: How ASearcher Uses Asynchronous Reinforcement Learning to Push Open-Source Search Agents Past 40 Turns

Imagine you are asked to find the exact number of gold, silver, and bronze medals China won in the 2012 London Olympics as of 31 December 2024.
A quick search returns two conflicting totals: “38-27-22” and “39-31-22”.
A human researcher would open multiple official reports, cross-check doping appeals, and finally discover that one gold medal was later withdrawn.
That process can take dozens of web pages and many reasoning steps—far more than the ten-turn limit that most open-source language agents accept today.

ASearcher is the first fully open-source project that removes this barrier.
By combining a simple two-tool agent, an automatic question-building pipeline, and a fully asynchronous reinforcement-learning (RL) system, it routinely performs 40-plus search-and-browse actions and produces more than 150 000 tokens of reasoning before giving a final answer.
This post explains—in plain language—how it works, what it achieves, and how you can reproduce or adapt it for your own use case.

Why long-horizon search matters
Three building blocks of ASearcher
2.1 Data that grows tougher at every step
2.2 Training that never waits for the slowest query
2.3 An agent with only two tools
Benchmark results in plain numbers
Hands-on: running ASearcher in three scenarios
4.1 Reproducing the published scores
4.2 Fine-tuning a 7-billion-parameter model
4.3 Building your own question set
Frequently asked questions

1. Why long-horizon search matters

Task	Typical human effort	Old open-source limit	ASearcher capability
Resolve conflicting medal tables	15–20 tabs, 30 min	≤10 tool calls, gives up early	40–70 calls, finds official correction
Find an animal mentioned across three unrelated papers	Repeated keyword tweaks	Cannot connect all sources	Cross-doc inference, confirms “mice” with citations

The takeaway is simple: complex questions need deep dives, and deep dives need long trajectories.
Traditional RL systems stop at ten steps because longer trajectories leave GPUs idle.
ASearcher solves the idle-GPU problem with fully asynchronous training, letting the agent search as long as necessary.

2. Three building blocks of ASearcher

2.1 Data that grows tougher at every step

Source	Size after filtering	What makes it hard
Public multi-hop QA (HotpotQA, 2WikiMultiHopQA)	16 000 questions	Model must retrieve ≥2 documents
Auto-generated questions	25 624 questions	Average 4.3 “fact injections” + 2.1 “fuzzing” steps each

How the synthetic writer works

Injection
Start with “When was Michael P. Hein born?”
Insert extra facts: “…the first Ulster County Executive who allowed the Catskill Mountain Railroad to keep running in 2016…”
The question is now harder because more conditions must be checked.
Fuzzing
Replace “2016” with “the 2016 United States House elections period”, or swap “Catskill Mountain Railroad” for “a historic mountain railway”.
Precision is lost, forcing the model to search instead of memorise.
Quality gates
- A strong model (QwQ-32B) tries to answer without tools; if it succeeds, the question is discarded.
- A second model confirms there is only one valid answer.
- Human reviewers sample 5 % for final sanity checks.

This pipeline starts with 14 107 seed questions and finishes with 25 624 high-quality, tool-demanding items.

2.2 Training that never waits for the slowest query

Bottleneck in old systems	ASearcher fix
Batch generation waits for the longest trajectory	Decoupled rollout and training: every trajectory runs independently
10-turn hard limit to keep GPUs busy	Relaxed limit of 128 turns—the agent stops when the task is solved
High variance in run time	Asynchronous sampler feeds the trainer as soon as each trajectory ends

Under the hood, the sampler and trainer live in separate processes.
Even if one trajectory takes 50 tool calls and another takes 2, the trainer always has fresh data and the GPUs stay busy.

2.3 An agent with only two tools

Tool	Input	Output
Search engine	Text question	Top-10 snippets + URLs
Web browser	URL	Full page content in Markdown

No external LLM, no extra planner, no memory bank.
Everything—reasoning, summarising, and verifying—happens inside the same model.
For large reasoning models (LRMs) like QwQ-32B, only the last 25 000 characters of history are kept to fit the context window.

3. Benchmark results in plain numbers

3.1 Standard multi-hop and single-hop tasks (local knowledge base)

Model size	Method	Average F1	Average LLM-as-Judge
7 B	ASearcher-Local	58.0	61.0
7 B	Previous best (Search-R1-7B)	54.3	55.4
14 B	ASearcher-Local	60.0	65.6
14 B	Previous best (Search-R1-14B)	55.4	56.8

3.2 Challenging web tasks (real-time search)

Benchmark	What it tests	ASearcher-Web-QwQ Avg@4	Previous best (32 B)
GAIA	Real-world planning & verification	52.8	48.1 (Search-o1)
xBench-DeepSearch	Deep retrieval & cross-doc reasoning	42.1	40.3 (Search-o1)
Frames	Long-document fact checking	70.9	67.0 (SimpleDS)

3.3 Improvement after RL

Benchmark	Before RL	After RL	Gain
GAIA	43.7	52.8	+9.1
xBench-DeepSearch	28.7	42.1	+13.4
Frames	58.9	70.9	+12.0

4. Hands-on: running ASearcher in three scenarios

4.1 Reproducing the published scores

Get the code and weights

git clone https://github.com/inclusionAI/ASearcher.git
cd ASearcher
pip install -r requirements.txt

Download test data

wget https://huggingface.co/datasets/inclusionAI/ASearcher-test-data/resolve/main/GAIA.tar.gz
tar -xzf GAIA.tar.gz

Run evaluation

cd evaluation/
export SERPER_API_KEY="your_key"
export JINA_API_KEY="your_key"

python3 search_eval_async.py \
  --data_names GAIA,xbench-deepsearch,Frames \
  --model_name_or_path inclusionAI/ASearcher-Web-QwQ \
  --output_dir ./results \
  --llm_as_judge \
  --pass-at-k 4

After 30–60 minutes on an 8×A100 node, the console prints Avg@4 and Pass@4 scores that match Table 4 in the paper.

4.2 Fine-tuning a 7-billion-parameter model

Option A: single-node (slow but cheap)

cd AReaL
export SERPER_API_KEY="your_key"
export JINA_API_KEY="your_key"

python3 -m areal.launcher.local ASearcher/train/asearcher.py \
  --config ASearcher/configs/asearcher_web.yaml \
  experiment_name=my_run \
  trial_name=7b_local

Option B: 16-node cluster (recommended)

python3 -m areal.launcher.ray ASearcher/train/asearcher.py \
  --config ASearcher/configs/asearcher_web_16nodes.yaml \
  experiment_name=my_run \
  trial_name=7b_cluster \
  cluster.n_nodes=16 \
  cluster.n_gpus_per_node=8

Training 35 k questions with a 32-turn limit takes ~48 hours on 128 A100s.
Logs are written to logs/; TensorBoard shows reward, F1, and token count curves.

4.3 Building your own question set

Prepare seed questions
Any JSONL file with {"question": "...", "answer": "..."} works.
Start two SGLang servers
- Port 30000 → QwQ-32B for generation
- Port 30001 → Qwen2.5-72B-Instruct for quality checks

Launch synthesis

python3 qa_synthesis/qa_synthesis_agent.py \
  --seed_path data/seed_qa.jsonl \
  --output_dir data/my_questions \
  --inject_rounds 4 \
  --fuzz_rounds 2

The script outputs filtered JSONL ready for training.

5. Frequently asked questions

Q1: How much GPU memory do I really need?

7 B model: 24 GB with ZeRO-3 offload.
32 B model: 80 GB × 8 GPUs is comfortable.

Q2: Can I replace Serper with my own enterprise search?
Yes. Modify search_client.py; the rest of the pipeline stays unchanged.

Q3: My synthetic questions feel too easy.
Increase fuzz_rounds or lower the difficulty threshold in qa_synthesis_agent.py.

Q4: Why not use more tools (calculator, map, code executor)?
The paper keeps the design minimal to prove RL alone unlocks depth.
Feel free to add tools—just extend the action space and reward function.

Q5: Can I fine-tune smaller models like 3 B?
Technically yes, but the 7 B baseline already struggles with long-page summarisation; 3 B would likely fail to converge.