From GPT-4 to GPT-5: Advancements and Challenges in Medical AI
Introduction
The rapid evolution of artificial intelligence (AI) has transformed healthcare, with large language models (LLMs) like GPT playing a pivotal role. A recent 2025 report by Stanford’s CRFM introduces MedHELM, a benchmark designed to evaluate AI’s medical capabilities. This article breaks down the key findings of GPT-5’s performance, highlighting its strengths, limitations, and implications for clinical practice.
What is MedHELM?
MedHELM is a comprehensive testing framework that evaluates AI models across eight critical medical tasks:
Task | Purpose | Example |
---|---|---|
MedCalc-Bench | Numerical calculations | Drug dosage, lab value analysis |
Medec | Error detection in medical records | Identifying charting mistakes |
HeadQA | Cross-disciplinary reasoning | Solving complex, multi-specialty cases |
Medbullets | Medical knowledge recall | Recalling clinical guidelines |
PubMedQA | Evidence-based question answering | Applying research findings to patient care |
EHRSQL | Structured data processing | Extracting info from electronic health records (EHRs) |
RaceBias | Fairness evaluation | Avoiding racial disparities in recommendations |
MedHallu | Hallucination resistance | Preventing fabricated medical claims |
Key Findings: How Did GPT-5 Perform?
1. Strengths: Where GPT-5 Shines
A. Advanced Numerical Calculations
-
MedCalc-Bench: Tied for first place with DeepSeek R1 (35% accuracy). -
Improvement: Outperformed GPT-4o by 16% in tasks like acid-base calculations. -
Real-world impact: Reliable for dosing adjustments or lab interpretation.
B. Cross-Domain Knowledge Integration
-
HeadQA: Achieved 93% accuracy, a new benchmark high. -
Example: Solving cases requiring knowledge of endocrinology + cardiology + pharmacology.
C. Broader Medical Knowledge
-
Medbullets: 89% accuracy (8% gain over GPT-4). -
Strength: Excels in recalling low-frequency, niche medical facts.
2. Weaknesses: Where GPT-5 Struggles
A. Structured Data Limitations
-
EHRSQL: 18% accuracy (14% drop from GPT-4). -
Common errors: -
Misinterpreting field names (e.g., “systolic BP” vs. “BP”) -
Incomplete SQL queries (missing WHERE
clauses)
-
B. Fairness Concerns
-
RaceBias: 72% accuracy (20% below the leader). -
Risk: Potential bias in recommendations based on patient demographics.
C. Evidence Application Gaps
-
PubMedQA: 67% accuracy (7% below best model). -
Issue: Over-reliance on common answer patterns rather than nuanced evidence.
Efficiency Analysis: Speed vs. Accuracy
Task | GPT-5 Time (s) | Leader Time (s) | Speed Ratio |
---|---|---|---|
MedCalc-Bench | 22.06 | 43.75 | 0.50x (faster) |
EHRSQL | 30.94 | 3.83 | 8.08x (slower) |
-
Long tasks: Faster on complex calculations (e.g., MedCalc-Bench). -
Short tasks: Slower on structured queries (e.g., EHRSQL), compounding cost concerns.
Applications: When to Use GPT-5 in Healthcare
✅ Suitable Use Cases:
-
Clinical decision support: For numerical calculations or multi-specialty reasoning. -
Medical education: As a knowledge resource for students. -
Literature reviews: Extracting key findings from research papers.
⚠️ High-Risk Use Cases:
-
EHR data analysis: Due to structured data limitations. -
Bias-sensitive decisions: Where fairness is critical. -
Fact-critical reporting: Risk of generating unsupported claims.
Future Directions: What Needs Improvement?
1. Technical Enhancements
-
Schema grounding: Improve structured data handling (e.g., SQL generation). -
Bias mitigation: Address fairness regressions via targeted training.
2. Evaluation Upgrades
-
Stress-test structured tasks: Expand EHR query benchmarks. -
Fine-grained error analysis: Categorize hallucination types or bias mechanisms.
FAQs: Common Questions About GPT-5 in Healthcare
Q1: What’s GPT-5’s biggest medical breakthrough?
A: Mastery of complex numerical calculations (e.g., drug dosing), now matching top models.
Q2: When should clinicians avoid AI tools?
A: For tasks requiring precise structured data extraction (e.g., EHR queries) or where bias is a concern.
Q3: Has GPT-5 solved the “hallucination” problem?
A: No—accuracy trails the leader by 5%, so human verification remains essential.
Q4: Why did GPT-5 perform worse on EHR tasks?
A: Struggles with schema constraints (e.g., confusing medical terms or omitting query logic).
Q5: What’s the biggest fairness concern?
A: Recommendations varying by race, even with identical symptoms.
Conclusion
GPT-5 represents progress in medical AI, particularly in calculations and knowledge recall. However, structured data handling and fairness remain critical challenges. As healthcare increasingly adopts AI, rigorous evaluation frameworks like MedHELM will ensure models are both powerful and responsible.
Based on Stanford CRFM’s 2025 MedHELM report. Technical terms simplified for clarity.