Building an Expert-Level Medical Deep-Research Agent with Only 32 Billion Parameters

A practical, end-to-end guide for developers, data scientists, and clinicians who want reproducible, high-quality medical reasoning.


1. Why do general “deep-research” tools stumble in medicine?

When ChatGPT, Gemini, or Claude first demonstrated multi-step web search, the demos looked magical.
Yet the moment we moved from “Who won the 2023 Nobel Prize in Chemistry?” to “What phase-II drugs target LMNA mutations in dilated cardiomyopathy?”, accuracy plunged.

System MedBrowseComp accuracy (50 questions)
o3-search 19 %
Gemini-2.5-Pro deep-research 25 %
MedResearcher-R1-32B 27.5 % (new state-of-the-art)

Two root causes surfaced:

  1. Sparse domain knowledge
    Rare diseases, off-label uses, and emerging trials sit below the 10⁻⁶ frequency line in general corpora.

  2. Tool mismatch
    Public web search rarely reaches FDA drug labels, EMA filings, or trial-registry protocols in a single hop.

The MedResearcher-R1 team solved both problems by re-building the training stack from the ground up instead of stacking more prompts on top of a general model.


2. The complete pipeline in one glance

graph LR
    A[Rare medical entities] --> B[Sub-graph construction]
    B --> C[Longest reasoning chain]
    C --> D[Question-answer pairs]
    D --> E[Masked trajectory]
    E --> F[Supervised fine-tuning]
    F --> G[Reinforcement learning]
    G --> H[32 B model ready for API]

Three folders = three stages

Folder Purpose
KnowledgeGraphConstruction Generate challenging questions
TrajectoryGenerationPipeline Turn Q-A pairs into multi-turn tool trajectories
EvaluationPipeline Run benchmarks and live demos

You can run the entire loop on a single 8-GPU node or rent four A100s for an afternoon.


3. Step-by-step reproduction

3.1 One-time setup (3 minutes)

OS Commands
Linux/macOS python -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt
Windows python -m venv .venv && .venv\Scripts\activate && pip install -r requirements.txt
Conda (any OS) conda create -n med_r1 python=3.10 && conda activate med_r1 && pip install -r requirements.txt

Python 3.10+ is required for modern match-case syntax.

3.2 Visual walk-through: see the knowledge graph before you trust it

python KnowledgeGraphConstruction/start_web.py

Open http://localhost:5000 in any browser.
Upload a tiny CSV like:

entity
LMNA gene mutation
Dilated cardiomyopathy
ARRY-371797

The interactive force-directed graph shows entities as nodes and relationships as edges.
Click “max_chain” to let the system auto-extract the longest medically valid path and convert it into a natural-language question.

3.3 Batch-generate questions (optional)

cd KnowledgeGraphConstruction
python batch_qa_cli.py \
  --seed-file demo_medical.csv \
  --output ../TrajectoryGenerationPipeline/dataset/qa.jsonl \
  --max-iterations 3

Each line in qa.jsonl looks like:

{"id":"rare_002","question":"Which companies are developing PCSK9 inhibitors for familial hypercholesterolemia in 2025?","answer":"Regeneron (evinacumab), Novartis (XXXX-301)...","reasoning_chain":["Query PCSK9 FH indication → Filter ongoing trials → Map sponsor names"]}

3.4 Turn questions into training trajectories

Edit TrajectoryGenerationPipeline/src/trajectory_generation/config.json:

{
  "llm_config": {
    "api_base": "https://api.openai.com/v1",
    "api_key_env": "OPENAI_API_KEY"
  },
  "generation": {
    "model": "gpt-4o-mini",
    "dataset": "qa.jsonl"
  }
}

Run:

cd ../TrajectoryGenerationPipeline
python src/trajectory_generation/run_reasoning.py        # raw trajectories
python src/postprocessing/pipeline.py \
  --input_dir generation/gpt-4o-mini/qa \
  --mode eval_filter
python src/postprocessing/pipeline.py \
  --input_dir generation/gpt-4o-mini/qa \
  --mode rewrite

The final file rewritten_results.jsonl is ready for fine-tuning.


4. Training recipe: two stages, no tricks

Stage Goal Data Key safeguard
Supervised fine-tuning Learn when to call which tool 2 100 trajectories 5 % random tool-failure injection
Reinforcement learning (GRPO) Reward accuracy + penalize redundancy Online rollouts Reward = 1.0×accuracy + 0.2×expert-preference – 0.1×redundant-calls

Hardware: 8×H200 finishes in 3 epochs (~6 hours).
8×A100 needs one overnight job.


5. Deployment: from checkpoint to REST API

5.1 Launch the model

pip install sglang[all]
CUDA_VISIBLE_DEVICES=0,1 \
python -m sglang.launch_server \
  --model-path ./MedResearcher-R1-32B \
  --port 6001 --tp-size 2

5.2 Single-question debugging

cd EvaluationPipeline
python eval_cli.py --mode interactive

You will see the model’s thought-process, tool calls, and final answer in real time.

5.3 Batch benchmark

python eval_cli.py \
  --mode batch \
  --dataset medbrowsecomp \
  --workers 20

A score.json file is produced that reproduces the 27.5 % score reported in the paper.


6. FAQ

Q1: Can I run this on a single RTX 4090?
Yes. Reduce --max-iterations to 1 and use LoRA fine-tuning. The provided open_data.jsonl is small enough for 12 GB VRAM.

Q2: Is the framework restricted to medicine?
No. Replace the seed CSV with legal, finance, or industrial-safety entities; the rest of the pipeline remains identical.

Q3: Why mask entities in trajectories?
Masking prevents the model from memorizing answers. It must actively retrieve masked entities using tools, achieving true reasoning instead of pattern matching.


7. From technology to product mindset

The team’s original reflection (translated from Chinese):

“We used to think: build a generic model, then add prompts for each use case.
The breakthrough came when we flipped the order: build the entire product system first, then train the model inside that system.
In four people and one month, two weeks went to infrastructure. Slow is fast.”

Today, the open-source release ships:

  • A point-and-click web UI for knowledge-graph exploration
  • One-command batch generation and filtering scripts
  • A reproducible benchmark harness

Future work already on the roadmap:

  1. Multi-modal tools (radiology images, pathology slides, EHR)
  2. Clinician-in-the-loop RLHF
  3. Context length expansion from 32 K to 128 K tokens

8. Citation

@article{medresearcher2025,
  title={MedResearcher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework},
  author={Yu, Ailing and Yao, Lan and Liu, Jingnan and Chen, Zhe and Yin, Jiajun and Wang, Yuan and Liao, Xinhao and Ye, Zhiling and Li, Ji and Yue, Yun and Xiao, Hansong and Zhou, Hualei and Guo, Chunxiao and Wei, Peng and Gu, Jinjie},
  journal={arXiv preprint arXiv:2508.14880},
  year={2025}
}

If you have walked through the commands above, you now own an end-to-end, domain-agnostic Deep-Research factory. Swap in new seed entities, press “run,” and the same pipeline will climb the next knowledge mountain.