Building an Expert-Level Medical Deep-Research Agent with Only 32 Billion Parameters
“
A practical, end-to-end guide for developers, data scientists, and clinicians who want reproducible, high-quality medical reasoning.
”
1. Why do general “deep-research” tools stumble in medicine?
When ChatGPT, Gemini, or Claude first demonstrated multi-step web search, the demos looked magical.
Yet the moment we moved from “Who won the 2023 Nobel Prize in Chemistry?” to “What phase-II drugs target LMNA mutations in dilated cardiomyopathy?”, accuracy plunged.
System | MedBrowseComp accuracy (50 questions) |
---|---|
o3-search | 19 % |
Gemini-2.5-Pro deep-research | 25 % |
MedResearcher-R1-32B | 27.5 % (new state-of-the-art) |
Two root causes surfaced:
-
Sparse domain knowledge
Rare diseases, off-label uses, and emerging trials sit below the 10⁻⁶ frequency line in general corpora. -
Tool mismatch
Public web search rarely reaches FDA drug labels, EMA filings, or trial-registry protocols in a single hop.
The MedResearcher-R1 team solved both problems by re-building the training stack from the ground up instead of stacking more prompts on top of a general model.
2. The complete pipeline in one glance
graph LR
A[Rare medical entities] --> B[Sub-graph construction]
B --> C[Longest reasoning chain]
C --> D[Question-answer pairs]
D --> E[Masked trajectory]
E --> F[Supervised fine-tuning]
F --> G[Reinforcement learning]
G --> H[32 B model ready for API]
Three folders = three stages
Folder | Purpose |
---|---|
KnowledgeGraphConstruction |
Generate challenging questions |
TrajectoryGenerationPipeline |
Turn Q-A pairs into multi-turn tool trajectories |
EvaluationPipeline |
Run benchmarks and live demos |
You can run the entire loop on a single 8-GPU node or rent four A100s for an afternoon.
3. Step-by-step reproduction
3.1 One-time setup (3 minutes)
OS | Commands |
---|---|
Linux/macOS | python -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt |
Windows | python -m venv .venv && .venv\Scripts\activate && pip install -r requirements.txt |
Conda (any OS) | conda create -n med_r1 python=3.10 && conda activate med_r1 && pip install -r requirements.txt |
“
Python 3.10+ is required for modern
match-case
syntax.”
3.2 Visual walk-through: see the knowledge graph before you trust it
python KnowledgeGraphConstruction/start_web.py
Open http://localhost:5000
in any browser.
Upload a tiny CSV like:
entity
LMNA gene mutation
Dilated cardiomyopathy
ARRY-371797
The interactive force-directed graph shows entities as nodes and relationships as edges.
Click “max_chain” to let the system auto-extract the longest medically valid path and convert it into a natural-language question.
3.3 Batch-generate questions (optional)
cd KnowledgeGraphConstruction
python batch_qa_cli.py \
--seed-file demo_medical.csv \
--output ../TrajectoryGenerationPipeline/dataset/qa.jsonl \
--max-iterations 3
Each line in qa.jsonl
looks like:
{"id":"rare_002","question":"Which companies are developing PCSK9 inhibitors for familial hypercholesterolemia in 2025?","answer":"Regeneron (evinacumab), Novartis (XXXX-301)...","reasoning_chain":["Query PCSK9 FH indication → Filter ongoing trials → Map sponsor names"]}
3.4 Turn questions into training trajectories
Edit TrajectoryGenerationPipeline/src/trajectory_generation/config.json
:
{
"llm_config": {
"api_base": "https://api.openai.com/v1",
"api_key_env": "OPENAI_API_KEY"
},
"generation": {
"model": "gpt-4o-mini",
"dataset": "qa.jsonl"
}
}
Run:
cd ../TrajectoryGenerationPipeline
python src/trajectory_generation/run_reasoning.py # raw trajectories
python src/postprocessing/pipeline.py \
--input_dir generation/gpt-4o-mini/qa \
--mode eval_filter
python src/postprocessing/pipeline.py \
--input_dir generation/gpt-4o-mini/qa \
--mode rewrite
The final file rewritten_results.jsonl
is ready for fine-tuning.
4. Training recipe: two stages, no tricks
Stage | Goal | Data | Key safeguard |
---|---|---|---|
Supervised fine-tuning | Learn when to call which tool | 2 100 trajectories | 5 % random tool-failure injection |
Reinforcement learning (GRPO) | Reward accuracy + penalize redundancy | Online rollouts | Reward = 1.0×accuracy + 0.2×expert-preference – 0.1×redundant-calls |
Hardware: 8×H200 finishes in 3 epochs (~6 hours).
8×A100 needs one overnight job.
5. Deployment: from checkpoint to REST API
5.1 Launch the model
pip install sglang[all]
CUDA_VISIBLE_DEVICES=0,1 \
python -m sglang.launch_server \
--model-path ./MedResearcher-R1-32B \
--port 6001 --tp-size 2
5.2 Single-question debugging
cd EvaluationPipeline
python eval_cli.py --mode interactive
You will see the model’s thought-process, tool calls, and final answer in real time.
5.3 Batch benchmark
python eval_cli.py \
--mode batch \
--dataset medbrowsecomp \
--workers 20
A score.json
file is produced that reproduces the 27.5 % score reported in the paper.
6. FAQ
Q1: Can I run this on a single RTX 4090?
Yes. Reduce --max-iterations
to 1 and use LoRA fine-tuning. The provided open_data.jsonl
is small enough for 12 GB VRAM.
Q2: Is the framework restricted to medicine?
No. Replace the seed CSV with legal, finance, or industrial-safety entities; the rest of the pipeline remains identical.
Q3: Why mask entities in trajectories?
Masking prevents the model from memorizing answers. It must actively retrieve masked entities using tools, achieving true reasoning instead of pattern matching.
7. From technology to product mindset
The team’s original reflection (translated from Chinese):
“
“We used to think: build a generic model, then add prompts for each use case.
The breakthrough came when we flipped the order: build the entire product system first, then train the model inside that system.
In four people and one month, two weeks went to infrastructure. Slow is fast.””
Today, the open-source release ships:
-
A point-and-click web UI for knowledge-graph exploration -
One-command batch generation and filtering scripts -
A reproducible benchmark harness
Future work already on the roadmap:
-
Multi-modal tools (radiology images, pathology slides, EHR) -
Clinician-in-the-loop RLHF -
Context length expansion from 32 K to 128 K tokens
8. Citation
@article{medresearcher2025,
title={MedResearcher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework},
author={Yu, Ailing and Yao, Lan and Liu, Jingnan and Chen, Zhe and Yin, Jiajun and Wang, Yuan and Liao, Xinhao and Ye, Zhiling and Li, Ji and Yue, Yun and Xiao, Hansong and Zhou, Hualei and Guo, Chunxiao and Wei, Peng and Gu, Jinjie},
journal={arXiv preprint arXiv:2508.14880},
year={2025}
}
If you have walked through the commands above, you now own an end-to-end, domain-agnostic Deep-Research factory. Swap in new seed entities, press “run,” and the same pipeline will climb the next knowledge mountain.