Building an Expert-Level Medical Deep-Research Agent with Only 32 Billion Parameters

“

A practical, end-to-end guide for developers, data scientists, and clinicians who want reproducible, high-quality medical reasoning.

”

1. Why do general “deep-research” tools stumble in medicine?

When ChatGPT, Gemini, or Claude first demonstrated multi-step web search, the demos looked magical.
Yet the moment we moved from “Who won the 2023 Nobel Prize in Chemistry?” to “What phase-II drugs target LMNA mutations in dilated cardiomyopathy?”, accuracy plunged.

System	MedBrowseComp accuracy (50 questions)
o3-search	19 %
Gemini-2.5-Pro deep-research	25 %
MedResearcher-R1-32B	27.5 % (new state-of-the-art)

Two root causes surfaced:

Sparse domain knowledge
Rare diseases, off-label uses, and emerging trials sit below the 10⁻⁶ frequency line in general corpora.
Tool mismatch
Public web search rarely reaches FDA drug labels, EMA filings, or trial-registry protocols in a single hop.

The MedResearcher-R1 team solved both problems by re-building the training stack from the ground up instead of stacking more prompts on top of a general model.

2. The complete pipeline in one glance

graph LR
    A[Rare medical entities] --> B[Sub-graph construction]
    B --> C[Longest reasoning chain]
    C --> D[Question-answer pairs]
    D --> E[Masked trajectory]
    E --> F[Supervised fine-tuning]
    F --> G[Reinforcement learning]
    G --> H[32 B model ready for API]

Three folders = three stages

Folder	Purpose
`KnowledgeGraphConstruction`	Generate challenging questions
`TrajectoryGenerationPipeline`	Turn Q-A pairs into multi-turn tool trajectories
`EvaluationPipeline`	Run benchmarks and live demos

You can run the entire loop on a single 8-GPU node or rent four A100s for an afternoon.

3. Step-by-step reproduction

3.1 One-time setup (3 minutes)

OS	Commands
Linux/macOS	`python -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt`
Windows	`python -m venv .venv && .venv\Scripts\activate && pip install -r requirements.txt`
Conda (any OS)	`conda create -n med_r1 python=3.10 && conda activate med_r1 && pip install -r requirements.txt`

“

Python 3.10+ is required for modern match-case syntax.

”

3.2 Visual walk-through: see the knowledge graph before you trust it

python KnowledgeGraphConstruction/start_web.py

Open http://localhost:5000 in any browser.
Upload a tiny CSV like:

entity
LMNA gene mutation
Dilated cardiomyopathy
ARRY-371797

The interactive force-directed graph shows entities as nodes and relationships as edges.
Click “max_chain” to let the system auto-extract the longest medically valid path and convert it into a natural-language question.

3.3 Batch-generate questions (optional)

cd KnowledgeGraphConstruction
python batch_qa_cli.py \
  --seed-file demo_medical.csv \
  --output ../TrajectoryGenerationPipeline/dataset/qa.jsonl \
  --max-iterations 3

Each line in qa.jsonl looks like:

{"id":"rare_002","question":"Which companies are developing PCSK9 inhibitors for familial hypercholesterolemia in 2025?","answer":"Regeneron (evinacumab), Novartis (XXXX-301)...","reasoning_chain":["Query PCSK9 FH indication → Filter ongoing trials → Map sponsor names"]}

3.4 Turn questions into training trajectories

Edit TrajectoryGenerationPipeline/src/trajectory_generation/config.json:

{
  "llm_config": {
    "api_base": "https://api.openai.com/v1",
    "api_key_env": "OPENAI_API_KEY"
  },
  "generation": {
    "model": "gpt-4o-mini",
    "dataset": "qa.jsonl"
  }
}

Run:

cd ../TrajectoryGenerationPipeline
python src/trajectory_generation/run_reasoning.py        # raw trajectories
python src/postprocessing/pipeline.py \
  --input_dir generation/gpt-4o-mini/qa \
  --mode eval_filter
python src/postprocessing/pipeline.py \
  --input_dir generation/gpt-4o-mini/qa \
  --mode rewrite

The final file rewritten_results.jsonl is ready for fine-tuning.

4. Training recipe: two stages, no tricks

Stage	Goal	Data	Key safeguard
Supervised fine-tuning	Learn when to call which tool	2 100 trajectories	5 % random tool-failure injection
Reinforcement learning (GRPO)	Reward accuracy + penalize redundancy	Online rollouts	Reward = 1.0×accuracy + 0.2×expert-preference – 0.1×redundant-calls

Hardware: 8×H200 finishes in 3 epochs (~6 hours).
8×A100 needs one overnight job.

5. Deployment: from checkpoint to REST API

5.1 Launch the model

pip install sglang[all]
CUDA_VISIBLE_DEVICES=0,1 \
python -m sglang.launch_server \
  --model-path ./MedResearcher-R1-32B \
  --port 6001 --tp-size 2

5.2 Single-question debugging

cd EvaluationPipeline
python eval_cli.py --mode interactive

You will see the model’s thought-process, tool calls, and final answer in real time.

5.3 Batch benchmark

python eval_cli.py \
  --mode batch \
  --dataset medbrowsecomp \
  --workers 20

A score.json file is produced that reproduces the 27.5 % score reported in the paper.

6. FAQ

Q1: Can I run this on a single RTX 4090?
Yes. Reduce --max-iterations to 1 and use LoRA fine-tuning. The provided open_data.jsonl is small enough for 12 GB VRAM.

Q2: Is the framework restricted to medicine?
No. Replace the seed CSV with legal, finance, or industrial-safety entities; the rest of the pipeline remains identical.

Q3: Why mask entities in trajectories?
Masking prevents the model from memorizing answers. It must actively retrieve masked entities using tools, achieving true reasoning instead of pattern matching.

7. From technology to product mindset

The team’s original reflection (translated from Chinese):

“

“We used to think: build a generic model, then add prompts for each use case.
The breakthrough came when we flipped the order: build the entire product system first, then train the model inside that system.
In four people and one month, two weeks went to infrastructure. Slow is fast.”

”

Today, the open-source release ships:

A point-and-click web UI for knowledge-graph exploration
One-command batch generation and filtering scripts
A reproducible benchmark harness

Future work already on the roadmap:

Multi-modal tools (radiology images, pathology slides, EHR)
Clinician-in-the-loop RLHF
Context length expansion from 32 K to 128 K tokens

8. Citation

@article{medresearcher2025,
  title={MedResearcher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework},
  author={Yu, Ailing and Yao, Lan and Liu, Jingnan and Chen, Zhe and Yin, Jiajun and Wang, Yuan and Liao, Xinhao and Ye, Zhiling and Li, Ji and Yue, Yun and Xiao, Hansong and Zhou, Hualei and Guo, Chunxiao and Wei, Peng and Gu, Jinjie},
  journal={arXiv preprint arXiv:2508.14880},
  year={2025}
}

If you have walked through the commands above, you now own an end-to-end, domain-agnostic Deep-Research factory. Swap in new seed entities, press “run,” and the same pipeline will climb the next knowledge mountain.

Build a Medical AI Research Agent with 32B Parameters That Outperforms Gemini