WebDancer: Breakthroughs in Autonomous Information-Seeking Agents

WebAgent for Information Seeking bulit by Tongyi Lab, Alibaba Group

Introduction: A New Paradigm for Complex Problem-Solving

Traditional AI systems often struggle with complex real-world problems due to shallow, single-step information retrieval. Yet humans solve intricate tasks through multi-step reasoning and deep exploration—like researchers cross-referencing studies or validating hypotheses. Alibaba’s Tongyi Lab now addresses this gap with WebDancer, an open-source framework for training end-to-end autonomous information-seeking agents that browse the web and reason like humans.

Key breakthrough: WebDancer achieves 61.1% Pass@3 accuracy on GAIA and 54.6% on WebWalkerQA benchmarks, outperforming GPT-4o in specific tasks.

Part 1: Four Core Challenges in Deep Information Retrieval

Building truly autonomous agents requires solving:

Shallow Datasets
Existing QA datasets (e.g., 2Wiki) contain ~80% 1-2 step queries, inadequate for multi-hop reasoning. Real-world tasks demand 5+ steps (e.g., “Locate invasive species records → Extract geographic data → Convert to ZIP codes”).
Dynamic Environments
Websites constantly change structure. Minor UI updates (tested in 2025) reduced agent performance by 37%, demanding environment adaptation.
Long-Trajectory Optimization
Reinforcement learning (RL) fails with sparse rewards beyond 10 steps. QwQ-32B showed 21% invalid actions in 20-step tasks.
Tool Coordination Failures
Agents hallucinate non-existent tools (e.g., “calculate”) or repeat actions redundantly during multi-tool workflows (search + parsing + analysis).

Part 2: WebDancer’s Four-Stage Architecture

Stage 1: Data Synthesis – Engineering Deep QA Pairs

Dataset	Method	Key Feature	Size
CRAWLQA	Recursive crawl of arXiv/GitHub/Wiki	Mimics human browsing	60K
E2HQA	Iterative complexity escalation	Controls step count (3-15)	40K

Example transformation:
Simple: “What species is Nemo from Finding Nemo?”
→ Complex: “Where did this species, popularized as pets by Finding Nemo, establish invasive populations pre-2020 per USGS? Output ZIP codes.”

Stage 2: Trajectory Sampling – High-Quality Reasoning Chains

Dual-path sampling using:

graph LR
A[Problem Q] --> B{Sampling Strategy}
B --> C[Short CoT: GPT-4o]
B --> D[Long CoT: QwQ-32B]
C --> E[4-6 step trajectories]
D --> F[15+ step trajectories]
E & F --> G[3-Stage Filtration]
G --> H[Validity: Format checks]
G --> I[Correctness: GPT-4o verification]
G --> J[Quality: Logic coherence]

Stage 3: Supervised Fine-Tuning – Cold-Start Initialization

Convert trajectories into structured format:

<think>Analyze Florida invasive species records</think>
<tool_call>{"name":"search","query":"USGS Amphiprion ocellaris Florida"}</tool_call>
<tool_response>...Top 10 results...</tool_response>

Key innovation: Mask observation loss (Eq.2) to preserve reasoning capabilities:

$$L=-\frac{1}{\sum\mathbb{I}[x_i\neq o]}\sum\mathbb{I}[x_i\neq o]\cdot\log\pi_{\theta}(x_i\mid\mathbf{tc},x_{<i}) $$

Stage 4: Reinforcement Learning – Dynamic Sampling Optimization

DAPO algorithm (Eq.3-4) maximizes data efficiency:

for qa_pair in unlabeled_data:
   candidates = policy.rollout(qa_pair, samples=16)
   rewards = reward_model(candidates)
   if max(rewards) < 0.2:  # Filter low-quality
      skip(qa_pair) 
   elif 0.2 < mean(rewards) < 0.8: # Prioritize medium-difficulty
      policy.update(candidates)

Reward design (Eq.5): Total Reward = 0.1 × format_score + 0.9 × answer_score
→ Answer correctness judged by Qwen-72B-as-a-judge.

Part 3: Critical Experimental Results

1. Benchmark Dominance

Model / Framework	GAIA (Avg.)	WebWalkerQA (Avg.)
GPT-4o (non-agentic)	17.5%	5.5%
QwQ-32B + RAG	32.0%	31.2%
WebDancer (QwQ-32B)	46.6%	43.2%

GAIA Level-3 (hardest) accuracy surged from 8.3% to 25.0%.

2. Long-Horizon Reasoning Gains

Metric	SFT Baseline	+RL Optimization	Delta
Pass@3	45.6%	61.1%	↑ 34%
Cons@3	30.0%	39.7%	↑ 32%
Invalid Actions	13.6%	0.97%	↓ 93%

3. Data Efficiency Comparison

E2HQA’s strict filtering boosted low-data performance by 2.3×

Part 4: Real-World Task Execution

Task:
“Find ZIP codes where Nemo’s fish species (released by pet owners) was reported invasive pre-2020 per USGS.”

WebDancer’s Trajectory:

1. THINK: Confirm species = Orange clownfish (Amphiprion ocellaris)
   → SEARCH: "USGS Amphiprion ocellaris nonnative locations before 2020"
   → OBSERVE: USGS species profile link

2. THINK: Extract locations from page
   → VISIT: https://nas.er.usgs.gov/queries/FactSheet.aspx?speciesID=3243
   → OBSERVE: "Pinellas County, Florida (2018)"

3. THINK: Convert location to ZIP
   → SEARCH: "Fred Howard Park, Pinellas County ZIP code"
   → OBSERVE: 34689

4. ANSWER: 34689

Demonstrates hypothesis testing → gap filling → tool orchestration.

Part 5: Limitations & Future Work

Tool Expansion
Current tools (search/click) to include browser automation & Python API sandboxes.
Long-Form Reasoning
New reward functions needed for research tasks (e.g., paper writing).
Compute Efficiency
RL rollouts average 3 minutes/task (16 samples).
Hybrid Reasoning
Dynamic Short/Long CoT switching under development.

Conclusion: The Autonomous Agent Frontier

WebDancer pioneers a reproducible pipeline for agentic AI:

Data synthesis → Trajectory sampling → SFT → RL optimization
Open-source code (GitHub)
Model-agnostic design (supports Qwen, DeepSeek, etc.)
Live demo (WebDancer Sandbox)

As the paper concludes: “This provides the community actionable pathways to develop agents for real-world information-seeking challenges.” With expanded tooling and efficiency gains, autonomous agents will transform research, education, and decision-making.

Resources:

</i})>

WebDancer: Autonomous Information-Seeking Agents Outperforming GPT-4o