Site icon Efficient Coder

ASearcher: How Asynchronous Reinforcement Learning Breaks 10-Click Barrier in Open-Source Search Agents

Going Beyond Ten Clicks: How ASearcher Uses Asynchronous Reinforcement Learning to Push Open-Source Search Agents Past 40 Turns

Imagine you are asked to find the exact number of gold, silver, and bronze medals China won in the 2012 London Olympics as of 31 December 2024.
A quick search returns two conflicting totals: “38-27-22” and “39-31-22”.
A human researcher would open multiple official reports, cross-check doping appeals, and finally discover that one gold medal was later withdrawn.
That process can take dozens of web pages and many reasoning steps—far more than the ten-turn limit that most open-source language agents accept today.

ASearcher is the first fully open-source project that removes this barrier.
By combining a simple two-tool agent, an automatic question-building pipeline, and a fully asynchronous reinforcement-learning (RL) system, it routinely performs 40-plus search-and-browse actions and produces more than 150 000 tokens of reasoning before giving a final answer.
This post explains—in plain language—how it works, what it achieves, and how you can reproduce or adapt it for your own use case.


Table of contents

  1. Why long-horizon search matters
  2. Three building blocks of ASearcher
    2.1 Data that grows tougher at every step
    2.2 Training that never waits for the slowest query
    2.3 An agent with only two tools
  3. Benchmark results in plain numbers
  4. Hands-on: running ASearcher in three scenarios
    4.1 Reproducing the published scores
    4.2 Fine-tuning a 7-billion-parameter model
    4.3 Building your own question set
  5. Frequently asked questions

1. Why long-horizon search matters

Task Typical human effort Old open-source limit ASearcher capability
Resolve conflicting medal tables 15–20 tabs, 30 min ≤10 tool calls, gives up early 40–70 calls, finds official correction
Find an animal mentioned across three unrelated papers Repeated keyword tweaks Cannot connect all sources Cross-doc inference, confirms “mice” with citations

The takeaway is simple: complex questions need deep dives, and deep dives need long trajectories.
Traditional RL systems stop at ten steps because longer trajectories leave GPUs idle.
ASearcher solves the idle-GPU problem with fully asynchronous training, letting the agent search as long as necessary.


2. Three building blocks of ASearcher

2.1 Data that grows tougher at every step

Source Size after filtering What makes it hard
Public multi-hop QA (HotpotQA, 2WikiMultiHopQA) 16 000 questions Model must retrieve ≥2 documents
Auto-generated questions 25 624 questions Average 4.3 “fact injections” + 2.1 “fuzzing” steps each

How the synthetic writer works

  1. Injection
    Start with “When was Michael P. Hein born?”
    Insert extra facts: “…the first Ulster County Executive who allowed the Catskill Mountain Railroad to keep running in 2016…”
    The question is now harder because more conditions must be checked.

  2. Fuzzing
    Replace “2016” with “the 2016 United States House elections period”, or swap “Catskill Mountain Railroad” for “a historic mountain railway”.
    Precision is lost, forcing the model to search instead of memorise.

  3. Quality gates

    • A strong model (QwQ-32B) tries to answer without tools; if it succeeds, the question is discarded.
    • A second model confirms there is only one valid answer.
    • Human reviewers sample 5 % for final sanity checks.

This pipeline starts with 14 107 seed questions and finishes with 25 624 high-quality, tool-demanding items.

2.2 Training that never waits for the slowest query

Bottleneck in old systems ASearcher fix
Batch generation waits for the longest trajectory Decoupled rollout and training: every trajectory runs independently
10-turn hard limit to keep GPUs busy Relaxed limit of 128 turns—the agent stops when the task is solved
High variance in run time Asynchronous sampler feeds the trainer as soon as each trajectory ends

Under the hood, the sampler and trainer live in separate processes.
Even if one trajectory takes 50 tool calls and another takes 2, the trainer always has fresh data and the GPUs stay busy.

2.3 An agent with only two tools

Tool Input Output
Search engine Text question Top-10 snippets + URLs
Web browser URL Full page content in Markdown

No external LLM, no extra planner, no memory bank.
Everything—reasoning, summarising, and verifying—happens inside the same model.
For large reasoning models (LRMs) like QwQ-32B, only the last 25 000 characters of history are kept to fit the context window.


3. Benchmark results in plain numbers

3.1 Standard multi-hop and single-hop tasks (local knowledge base)

Model size Method Average F1 Average LLM-as-Judge
7 B ASearcher-Local 58.0 61.0
7 B Previous best (Search-R1-7B) 54.3 55.4
14 B ASearcher-Local 60.0 65.6
14 B Previous best (Search-R1-14B) 55.4 56.8

3.2 Challenging web tasks (real-time search)

Benchmark What it tests ASearcher-Web-QwQ Avg@4 Previous best (32 B)
GAIA Real-world planning & verification 52.8 48.1 (Search-o1)
xBench-DeepSearch Deep retrieval & cross-doc reasoning 42.1 40.3 (Search-o1)
Frames Long-document fact checking 70.9 67.0 (SimpleDS)

3.3 Improvement after RL

Benchmark Before RL After RL Gain
GAIA 43.7 52.8 +9.1
xBench-DeepSearch 28.7 42.1 +13.4
Frames 58.9 70.9 +12.0

4. Hands-on: running ASearcher in three scenarios

4.1 Reproducing the published scores

  1. Get the code and weights

    git clone https://github.com/inclusionAI/ASearcher.git
    cd ASearcher
    pip install -r requirements.txt
    
  2. Download test data

    wget https://huggingface.co/datasets/inclusionAI/ASearcher-test-data/resolve/main/GAIA.tar.gz
    tar -xzf GAIA.tar.gz
    
  3. Run evaluation

    cd evaluation/
    export SERPER_API_KEY="your_key"
    export JINA_API_KEY="your_key"
    
    python3 search_eval_async.py \
      --data_names GAIA,xbench-deepsearch,Frames \
      --model_name_or_path inclusionAI/ASearcher-Web-QwQ \
      --output_dir ./results \
      --llm_as_judge \
      --pass-at-k 4
    

    After 30–60 minutes on an 8×A100 node, the console prints Avg@4 and Pass@4 scores that match Table 4 in the paper.

4.2 Fine-tuning a 7-billion-parameter model

Option A: single-node (slow but cheap)

cd AReaL
export SERPER_API_KEY="your_key"
export JINA_API_KEY="your_key"

python3 -m areal.launcher.local ASearcher/train/asearcher.py \
  --config ASearcher/configs/asearcher_web.yaml \
  experiment_name=my_run \
  trial_name=7b_local

Option B: 16-node cluster (recommended)

python3 -m areal.launcher.ray ASearcher/train/asearcher.py \
  --config ASearcher/configs/asearcher_web_16nodes.yaml \
  experiment_name=my_run \
  trial_name=7b_cluster \
  cluster.n_nodes=16 \
  cluster.n_gpus_per_node=8
  • Training 35 k questions with a 32-turn limit takes ~48 hours on 128 A100s.
  • Logs are written to logs/; TensorBoard shows reward, F1, and token count curves.

4.3 Building your own question set

  1. Prepare seed questions
    Any JSONL file with {"question": "...", "answer": "..."} works.

  2. Start two SGLang servers

    • Port 30000 → QwQ-32B for generation
    • Port 30001 → Qwen2.5-72B-Instruct for quality checks
  3. Launch synthesis

    python3 qa_synthesis/qa_synthesis_agent.py \
      --seed_path data/seed_qa.jsonl \
      --output_dir data/my_questions \
      --inject_rounds 4 \
      --fuzz_rounds 2
    

    The script outputs filtered JSONL ready for training.


5. Frequently asked questions

Q1: How much GPU memory do I really need?

  • 7 B model: 24 GB with ZeRO-3 offload.
  • 32 B model: 80 GB × 8 GPUs is comfortable.

Q2: Can I replace Serper with my own enterprise search?
Yes. Modify search_client.py; the rest of the pipeline stays unchanged.

Q3: My synthetic questions feel too easy.
Increase fuzz_rounds or lower the difficulty threshold in qa_synthesis_agent.py.

Q4: Why not use more tools (calculator, map, code executor)?
The paper keeps the design minimal to prove RL alone unlocks depth.
Feel free to add tools—just extend the action space and reward function.

Q5: Can I fine-tune smaller models like 3 B?
Technically yes, but the 7 B baseline already struggles with long-page summarisation; 3 B would likely fail to converge.

Exit mobile version