Introduction
We live in an era where search is everywhere. From asking Google “What’s the weather like in Tokyo tomorrow?” to querying ChatGPT about “How to implement a vector database,” information retrieval shapes almost every decision we make.
But here’s the catch: most existing systems struggle when the question is complex, multi-step, or requires long reasoning. For example:
“
“List 19th-century female painters in Paris and identify which museums currently exhibit their works.”
That’s not a single keyword match. It’s a multi-hop reasoning task involving entity linking, temporal filtering, knowledge integration, and source verification.
Traditional search engines fail because they’re optimized for surface-level keyword retrieval, not structured reasoning. Even large language models (LLMs) like GPT-4o or Claude often stumble—they either hallucinate or stop after one or two reasoning steps.
This is where DeepDive comes in. It’s not just another LLM—it’s a framework for training deep search agents capable of complex, multi-turn reasoning and browsing.
What is DeepDive?
At its core, DeepDive is a research framework from THUDM that aims to make AI better at deep, multi-step search tasks.
Think of it as a “training ground” where agents learn not just to answer questions, but to search like a human researcher:
- 🍄
Identify where to look, - 🍄
Follow multiple paths, - 🍄
Validate sources, - 🍄
And adjust strategies if the first attempt fails.
Why is this important?
Because real-world information needs are rarely simple. Doctors, lawyers, scientists, and journalists often need to combine multiple pieces of evidence before reaching conclusions. A system that can do this reliably unlocks massive potential for research automation, enterprise search, and scientific discovery.
Current project status
- 🍄
✅ Dataset released: 4,108 QA pairs and supervised fine-tuning (SFT) trajectories are available on HuggingFace. - 🍄
🔜 Models coming soon: DeepDive-9B and DeepDive-32B are in development. - 🍄
🔥 Paper published: arXiv preprint.
How DeepDive Works
DeepDive’s workflow has two major stages: automated data synthesis and multi-turn reinforcement learning (RL).
Stage 1: Automated Data Synthesis
Instead of relying on expensive human annotation, DeepDive generates its own training questions using knowledge graphs like KILT and AMiner.
The process has three steps:
-
Knowledge Graph Random Walks
- 🍄
Start at a node (e.g., Marie Curie). - 🍄
Walk along 5–9 edges, collecting a path of related entities. - 🍄
Example path: Scientist → Research Field → Co-author → Award → Institution
. - 🍄
The longer the path, the harder the reasoning required.
- 🍄
-
Entity Obfuscation
- 🍄
Instead of asking directly about Marie Curie, the system describes her indirectly: - 🍄
“A European female scientist who won two Nobel Prizes.” - 🍄
This forces the model to resolve ambiguity through search instead of memorization.
- 🍄
-
Difficulty Filtering
- 🍄
Questions are tested on a strong model (GPT-4o). - 🍄
If GPT-4o fails four times in a row, the question is considered “hard enough.” - 🍄
This ensures the dataset only contains non-trivial, high-difficulty challenges.
- 🍄
Stage 2: Multi-Turn Reinforcement Learning
Once the dataset is ready, the agent is trained via multi-turn RL.
Here’s how it works:
-
At step t, the agent generates a chain of thought (its reasoning so far). -
It executes a search/browse action (like clicking a link or issuing a query). -
It observes the resulting content, then updates its reasoning.
This mimics how humans search: we think, act, observe, then adjust.
The Algorithm: GRPO
DeepDive uses Group Relative Policy Optimization (GRPO), which normalizes rewards across a group of trajectories to stabilize training.
Formula (simplified):
A_i = \frac{r_i - \text{mean}(r)}{\text{std}(r)}
The Reward: Strict Binary
- 🍄
If the final answer is correct AND properly formatted → reward = 1 - 🍄
Else → reward = 0
This strictness prevents “reward hacking” (gaming the scoring system without actually solving the task).
Dataset & Models
DeepDive’s dataset looks like this:
Model Variants
- 🍄
DeepDive-9B: lighter model, still competitive. - 🍄
DeepDive-32B: more powerful, best performance so far.
Performance on BrowseComp benchmark:
- 🍄
DeepDive-9B → 6.3% - 🍄
DeepDive-32B → 14.8%
Not perfect, but significantly stronger than most open-source models.
Results & Benchmarks
DeepDive was tested against both proprietary and open-source baselines.
Highlights:
- 🍄
Outperforms many open-source models like Qwen2.5, DeepSeek-V3, and GLM-Z1. - 🍄
Competitive with proprietary systems like Claude-4-Sonnet-Thinking. - 🍄
On simpler datasets (HotpotQA, WebWalker), DeepDive generalizes well, showing it’s not just tuned for one benchmark.
Test-Time Scaling
One of DeepDive’s coolest tricks is how it scales at inference time.
Tool Call Scaling
If you let the model make more search attempts (tool calls), accuracy improves:
- 🍄
8 calls → 8% - 🍄
128 calls → 15%
This shows the model benefits from longer horizons.
Parallel Sampling
Instead of relying on one reasoning path, DeepDive runs 8 paths in parallel.
It then:
- 🍄
Uses majority voting, or - 🍄
Picks the answer with the fewest tool calls.
The latter works best: accuracy jumped from 12% (single-shot) to 24.8%.
Semi-Automated i.i.d. QA Extension
Beyond KG data, DeepDive also experiments with i.i.d. (independent & identically distributed) QA generation.
Results:
- 🍄
32B (KG only): 14.8% on BrowseComp - 🍄
32B (i.i.d. added): 22.2%
Takeaway: mixing data sources makes the model stronger.
Practical Use Cases
So why should you care as a researcher, developer, or company?
- 🍄
Researchers: build better evaluation frameworks for reasoning tasks. - 🍄
Developers: train smarter search agents for apps, bots, and assistants. - 🍄
Enterprises: supercharge internal knowledge search—especially across scientific, legal, or medical documents.
HowTo: Get Started with DeepDive Dataset
If you want to explore today, here’s how:
-
Visit HuggingFace
Go to DeepDive dataset. -
Install the library
pip install datasets
-
Load the dataset
from datasets import load_dataset dataset = load_dataset("zai-org/DeepDive") print(dataset)
-
Explore QA pairs
print(dataset["train"][0])
-
Start experimenting
Use the QA pairs for SFT or RL experiments in your own agent framework.
“
📝 Tip: The models are not yet released, but preparing with the dataset will help you hit the ground running when they arrive.
FAQ (Schema Markup Ready)
❓ What is DeepDive in simple terms?
It’s a framework to train AI agents that can search like humans, using multiple steps and reasoning instead of single-shot answers.
❓ How is it different from Google or Bing?
Search engines retrieve documents. DeepDive retrieves, reasons, cross-checks, and outputs a final synthesized answer.
❓ Why knowledge graphs?
They provide structured, multi-hop data that’s ideal for generating challenging questions automatically.
❓ Can it handle Chinese?
Yes. DeepDive was evaluated on BrowseComp-ZH and showed strong results.
❓ How can I use it today?
Start with the open dataset. When the models drop, you’ll be able to fine-tune or integrate them directly.
Conclusion
DeepDive represents a major step forward in AI search and reasoning:
-
Automated dataset generation from knowledge graphs. -
Multi-turn reinforcement learning that mimics human browsing. -
Scaling strategies like tool call expansion and parallel sampling.
While still early, DeepDive shows that AI can be trained to search deeply, reason carefully, and generalize across tasks.
The next frontier? Turning these agents into real-world research assistants, enterprise copilots, and educational tools.
Deep search isn’t just about finding answers. It’s about teaching machines to think with us, not just for us.