Stanford’s MedAgentBench: The Real-World Test Lab for Healthcare AI Assistants

For years, the conversation around artificial intelligence in medicine has centered on one question: “Can it pass the test?” Large language models (LLMs) like GPT and Claude have dazzled us by acing the US Medical Licensing Exam (USMLE), proving they possess an encyclopedic knowledge of medical facts. But passing a written exam is only the first hurdle. The true, and far more critical, challenge is this: Can AI reliably do the job?

Imagine an AI not just telling you the treatment for pneumonia, but actually logging into a hospital’s electronic health record (EHR) system, checking the patient’s specific allergies and kidney function, calculating the correct drug dosage, and then safely entering that order for the doctor’s final approval. This is the difference between a brilliant student and a competent, trustworthy colleague. Stanford University researchers have recognized this gap and built the solution: MedAgentBench. It’s not another quiz; it’s a fully simulated hospital environment designed to test whether AI can function as a practical, day-to-day assistant for healthcare professionals.

Why Knowledge Isn’t Enough: The Rise of the AI “Doer”

Before we dive into what MedAgentBench is, we need to understand the fundamental shift it represents. It’s crucial to distinguish between two types of AI:

The Chatbot (The “Sayer”): This is the AI you’re probably familiar with. You ask it a question—“What are the symptoms of appendicitis?”—and it gives you a well-researched, text-based answer. Its job is to inform and converse. It’s a powerful tool, but it’s passive.
The AI Agent (The “Doer”): This is the next evolution. An AI agent doesn’t just answer; it acts. Given a high-level instruction like, “Prepare a discharge summary for patient 12345,” it can autonomously break that down into steps: pull the patient’s diagnosis and treatment history from the EHR, summarize key lab results, list prescribed medications, and draft the document—all by interacting with different digital systems. It’s an active participant in the workflow.

In a hospital, doctors and nurses spend a staggering amount of time—studies suggest up to 73%—on administrative tasks: documenting visits, ordering tests, reviewing lab results, writing referrals, and managing medications. These tasks are essential but repetitive and time-consuming, contributing significantly to the global crisis of clinician burnout. AI agents hold the promise of taking over this “digital housekeeping,” freeing up medical staff to focus on what truly matters: direct patient care and complex decision-making.

However, letting an AI loose in a real EHR system is incredibly risky. A single mistyped number or a misinterpreted instruction could lead to a wrong medication being ordered or a critical test being missed. We need a safe, standardized, and realistic training ground to rigorously test these AI agents before they ever touch a real patient’s record. Existing AI benchmarks, designed for general tasks, simply don’t cut it for the high-stakes, complex, and highly regulated world of healthcare. This is the void that MedAgentBench fills.

Inside the MedAgentBench Lab: A Hospital in a Box

MedAgentBench isn’t a spreadsheet of questions. It’s a sophisticated, interactive simulation of a real hospital’s digital infrastructure. Think of it as a flight simulator for AI pilots in the medical field. It’s built on three core pillars to ensure its tests are as realistic and meaningful as possible.

Pillar 1: 300 Real-World Tasks, Written by Doctors

The heart of MedAgentBench is its collection of 300 distinct clinical tasks. These weren’t dreamed up by computer scientists; they were meticulously crafted by two practicing Stanford physicians, Dr. Kameron Black and Dr. Jonathan Chen, based on their daily routines. The tasks are designed to mirror the actual, often messy, work that happens in both inpatient (hospital) and outpatient (clinic) settings.

These tasks fall into 10 broad categories, covering the full spectrum of routine clinical work:

Finding Patient Information: “What is the medical record number (MRN) for the patient named John Doe, born on January 1, 1980?”
Retrieving Lab Results: “What was patient 12345’s most recent magnesium level within the last 24 hours?”
Recording Patient Data: “I just took patient 12345’s blood pressure, and it’s 118/77 mmHg. Please document this in their chart.”
Ordering Tests: “Check when patient 12345 last had a Hemoglobin A1C test. If it was over a year ago, order a new one.”
Managing Medications: “Review patient 12345’s latest potassium level. If it’s below 3.5 mmol/L, calculate and order the appropriate potassium replacement dose.”
Placing Referrals: “Order a referral to orthopedic surgery for patient 12345, with the following clinical notes…”
Aggregating Data: “What is the average blood glucose level for patient 12345 over the past 24 hours?”

On average, completing each task requires 2 to 3 distinct steps, forcing the AI to plan and execute a sequence of actions, just like a human would. For instance, to order a medication, the AI might first need to retrieve the patient’s latest lab result, interpret it against a clinical threshold, calculate a dose, and then finally submit the order.

Pillar 2: 100 Virtual Patients with Real Medical Histories

A test is only as good as its data. MedAgentBench doesn’t use fake or synthetic patient profiles. Instead, it leverages anonymized data from Stanford’s STARR (Stanford Medicine Research Data Repository). Researchers extracted the records of 100 real patients, carefully removing all personally identifiable information (like names and addresses) and slightly adjusting timestamps to protect privacy while preserving the clinical reality.

This virtual patient cohort is incredibly rich, comprising over 785,000 individual data points. This includes:

Hundreds of thousands of laboratory test results.
Detailed vital sign records (heart rate, blood pressure, oxygen levels, etc.).
Comprehensive lists of diagnoses and medical conditions.
Records of procedures and surgeries performed.
Full histories of medication orders.

This means the AI isn’t operating in a clean, theoretical environment. It’s dealing with the same kind of complex, longitudinal, and sometimes incomplete data that clinicians encounter every day. This is crucial for testing an AI’s ability to handle real-world ambiguity.

Pillar 3: A FHIR-Compliant Digital Hospital

The most critical aspect of MedAgentBench is its technical foundation. It’s built to mimic how real hospitals exchange data using the FHIR (Fast Healthcare Interoperability Resources) standard. FHIR is the global language that allows different healthcare software systems—like EHRs, lab systems, and pharmacy databases—to talk to each other securely.

In the MedAgentBench environment, AI agents don’t get special treatment. They must interact with the system using the same standard API (Application Programming Interface) calls that real software uses: GET requests to retrieve information and POST requests to add or modify data (like entering a new order). This design is intentional. It means that an AI agent that learns to operate successfully within MedAgentBench can, in theory, be integrated into a real hospital’s EHR system with minimal modification. The benchmark is built for direct, real-world translation.

The Report Card: How Did the AI Students Perform?

Stanford’s researchers put 12 of the world’s most advanced LLMs through the MedAgentBench paces. The lineup included industry leaders like OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, Google’s Gemini series, and top open-source models like Meta’s Llama 3.3 and DeepSeek-V3.

The grading was intentionally harsh. Instead of allowing multiple attempts (a common practice in AI testing called “pass@k”), MedAgentBench uses a “pass@1” metric. This means the AI only gets one shot to complete the task correctly. Why? Because in a real hospital, there’s no “undo” button for a wrong medication order. Safety and reliability are non-negotiable, and the benchmark reflects that.

Here’s how the top performers stacked up:

AI Model	Overall Success Rate	Success Rate: Information Tasks	Success Rate: Action Tasks
Claude 3.5 Sonnet v2	69.67%	85.33%	54.00%
GPT-4o	64.00%	72.00%	56.00%
DeepSeek-V3 (Open Source)	62.67%	70.67%	54.67%
Gemini 1.5 Pro	62.00%	52.67%	71.33%
GPT-4o-mini	56.33%	59.33%	53.33%
Llama 3.3 (Open Source)	46.33%	50.00%	42.67%
Mistral v0.3 (Open Source)	4.00%	8.00%	0.00%

(Source: MedAgentBench research paper, Stanford HAI)

Key Takeaways from the Results

Claude Leads the Pack: Claude 3.5 Sonnet v2 emerged as the overall champion with a 69.67% success rate. It was particularly strong at information retrieval tasks (85.33%), which involve finding and reporting data.
The “Doing” Gap is Real: The most striking observation is the performance gap between information tasks and action tasks. Almost every model, including Claude, performed significantly worse when asked to modify the system (like placing an order) compared to simply retrieving information. This highlights that “doing” is much harder than “knowing” or “finding.” Even the best models failed nearly half the time when asked to take action.
Open Source is Closing In: It’s encouraging to see that open-source models like DeepSeek-V3 (62.67%) and Llama 3.3 (46.33%) are competitive, with DeepSeek-V3 outperforming several proprietary models. This democratizes development and allows more researchers to contribute to solving this critical problem.
Room for Improvement is Vast: A 70% success rate sounds good on a school test, but it’s unacceptable in a clinical setting. It means that for every 10 tasks, 3 would fail. This underscores that while the technology is promising, we are still far from deploying fully autonomous AI agents in live patient care. Human oversight remains essential.

Where Do AI Agents Trip Up? Common Mistakes Revealed

By analyzing the failures, researchers identified two major categories of errors that plagued the AI agents:

Not Following Instructions (The “Rebel” Error): This was the most common pitfall. The AI would understand the task but fail to execute it in the precise way the system required. For example:
- The system might require a response in a strict JSON format, but the AI would return a full, verbose sentence.
- It might send a POST request with incorrectly formatted data, causing the system to reject it.
- One model, Gemini 2.0 Flash, failed 54% of its tasks simply because it didn’t follow the basic API call instructions.
Wrong Output Format (The “Over-Explainer” Error): Even when the AI got the right answer, it often presented it incorrectly. For instance:
- The task might ask for a single number (e.g., a lab value like “5.4”), but the AI would respond with a structured object like {"value": 5.4}.
- It might add unnecessary explanatory text when only a raw value was requested.

These errors are not about medical knowledge; they’re about precision, reliability, and the ability to follow complex, technical protocols—skills that are absolutely critical in a healthcare environment. They reveal that current AI agents, while intelligent, still lack the meticulous attention to detail required for unsupervised clinical work.

The Bigger Picture: What MedAgentBench Means for the Future of Healthcare

The creation of MedAgentBench is more than just an academic exercise. It’s a foundational step toward a future where AI genuinely alleviates the crushing administrative burden on healthcare workers.

A New North Star for Developers: For AI companies and researchers, MedAgentBench provides a clear, standardized target. Instead of optimizing for abstract knowledge tests, they can now focus on building models that excel at real-world, interactive tasks. It directs innovation toward practical utility.
A Safe Sandbox for Innovation: Hospitals and clinics can use MedAgentBench as a risk-free testing ground. Before deploying any AI tool in their live EHR, they can rigorously evaluate its performance in this simulated environment, identifying and fixing potential failures without endangering patients.
A Path to Reducing Burnout: The ultimate goal is human-centric. By automating the tedious, screen-based work that consumes so much of a clinician’s day, AI agents could give doctors and nurses the greatest gift: time. Time to listen to patients, time to think critically, time to provide compassionate care. As Dr. Kameron Black, a co-author of the study, put it: “Working on this project convinced me that AI won’t replace doctors anytime soon. It’s more likely to augment our clinical workforce.”

The vision is not of AI replacing the human touch, but of AI handling the digital drudgery, allowing the human professionals to return to the bedside.

Looking Ahead: Challenges and the Road to the Clinic

While MedAgentBench is a monumental leap forward, the researchers are the first to acknowledge its current limitations. Understanding these is key to charting the path forward.

Data Scope: All patient data comes from a single institution—Stanford. While rich, it may not fully represent the diversity of patient populations, diseases, or EHR system configurations found in hospitals worldwide. Future versions of the benchmark will need to incorporate broader, more diverse datasets.
Task Complexity: The initial 300 tasks are a robust start, but they focus primarily on core medical record interactions. They don’t yet encompass the full complexity of healthcare, such as interpreting free-text clinical notes, coordinating care across multiple specialists, or handling emergency, time-sensitive scenarios.
The “Team Sport” Problem: Medicine is a team effort. A real clinical task often requires communication and coordination between doctors, nurses, pharmacists, and social workers. The current MedAgentBench environment tests an AI in isolation, not as part of a dynamic, multi-agent team.

The Stanford team is already working on these challenges. Their research indicates that newer model versions are showing rapid improvement. With deliberate design, rigorous safety protocols, and careful integration, they believe AI agents could start handling basic, well-defined “housekeeping” tasks in clinical settings sooner than many expect.

Frequently Asked Questions (FAQ)

To make this complex topic more accessible, here are answers to some common questions you might have.

Q: Where can I access MedAgentBench? Is it free to use?
A: Yes, MedAgentBench is an open-source project. The code, the virtual environment, and the task datasets are all publicly available on GitHub at https://github.com/stanfordmlgroup/MedAgentBench. This allows researchers and developers worldwide to use, test, and contribute to its development.

Q: I’ve heard of FHIR, but what exactly is it and why is it so important for this project?
A: FHIR (pronounced “fire”) stands for Fast Healthcare Interoperability Resources. It’s a set of international standards that define how healthcare information should be formatted and exchanged between different computer systems. Think of it as the common language that allows your hospital’s EHR to talk to the lab system, the pharmacy, and the billing department. By building MedAgentBench on FHIR, the Stanford team ensured that any AI agent that learns to work in their simulated environment can, in principle, work in any real hospital that uses FHIR-compliant systems, which is now the global norm.

Q: What’s the practical difference between an AI “chatbot” and an AI “agent” in a hospital?
A: It’s the difference between a reference librarian and an administrative assistant.

A chatbot is like a librarian. You ask it a question (“What’s the normal range for sodium?”), and it gives you an answer. It’s great for information, but it doesn’t do anything with that information.
An AI agent is like an administrative assistant. You give it a task (“Check Mr. Smith’s sodium level from this morning and if it’s low, order a repeat test for tomorrow”), and it will go into the system, find the result, make the judgment call based on your instructions, and place the order—all without you having to do each step manually. It’s about action and automation.

Q: Why is the “pass@1” scoring so strict? Isn’t that unfair to the AI?
A: It’s not about being unfair; it’s about being realistic and safe. In a clinical environment, mistakes can have serious, even life-threatening, consequences. You can’t tell a patient, “Sorry, the AI messed up your medication order, but let’s give it another try.” The “pass@1” standard forces developers to create AI systems that are reliable and precise on the first attempt, which is the only standard that matters in healthcare. It’s a reflection of the zero-tolerance policy for errors in patient care.

Q: The best AI only got a 70% score. Does that mean this technology is useless?
A: Absolutely not. A 70% success rate in such a complex, real-world simulation is actually a very promising starting point. It shows the technology has significant potential. However, it also clearly shows that human oversight is non-negotiable. The most viable near-term application is not for AI to work alone, but for it to work with clinicians. For example, the AI could draft orders or pull up relevant patient data, and the doctor would then review and approve it. This “human-in-the-loop” approach leverages AI’s efficiency while maintaining the essential human judgment and safety net. The 70% score is a baseline, not a ceiling, and rapid improvements are expected.

MedAgentBench represents a pivotal moment in the journey of AI in healthcare. It moves the conversation beyond theoretical knowledge and into the realm of practical, actionable assistance. It provides the tools to rigorously test and improve these systems, ensuring that when they do enter the clinic, they are not just smart, but safe, reliable, and truly helpful partners to the medical professionals on the front lines. The goal is clear: to build AI that doesn’t replace the healer, but empowers them.