From GPT-4 to GPT-5: Advancements and Challenges in Medical AI

Introduction

The rapid evolution of artificial intelligence (AI) has transformed healthcare, with large language models (LLMs) like GPT playing a pivotal role. A recent 2025 report by Stanford’s CRFM introduces MedHELM, a benchmark designed to evaluate AI’s medical capabilities. This article breaks down the key findings of GPT-5’s performance, highlighting its strengths, limitations, and implications for clinical practice.

What is MedHELM?

MedHELM is a comprehensive testing framework that evaluates AI models across eight critical medical tasks:

Task	Purpose	Example
MedCalc-Bench	Numerical calculations	Drug dosage, lab value analysis
Medec	Error detection in medical records	Identifying charting mistakes
HeadQA	Cross-disciplinary reasoning	Solving complex, multi-specialty cases
Medbullets	Medical knowledge recall	Recalling clinical guidelines
PubMedQA	Evidence-based question answering	Applying research findings to patient care
EHRSQL	Structured data processing	Extracting info from electronic health records (EHRs)
RaceBias	Fairness evaluation	Avoiding racial disparities in recommendations
MedHallu	Hallucination resistance	Preventing fabricated medical claims

Key Findings: How Did GPT-5 Perform?

1. Strengths: Where GPT-5 Shines

A. Advanced Numerical Calculations

MedCalc-Bench: Tied for first place with DeepSeek R1 (35% accuracy).
Improvement: Outperformed GPT-4o by 16% in tasks like acid-base calculations.
Real-world impact: Reliable for dosing adjustments or lab interpretation.

B. Cross-Domain Knowledge Integration

HeadQA: Achieved 93% accuracy, a new benchmark high.
Example: Solving cases requiring knowledge of endocrinology + cardiology + pharmacology.

C. Broader Medical Knowledge

Medbullets: 89% accuracy (8% gain over GPT-4).
Strength: Excels in recalling low-frequency, niche medical facts.

2. Weaknesses: Where GPT-5 Struggles

A. Structured Data Limitations

EHRSQL: 18% accuracy (14% drop from GPT-4).
Common errors:
- Misinterpreting field names (e.g., “systolic BP” vs. “BP”)
- Incomplete SQL queries (missing WHERE clauses)

B. Fairness Concerns

RaceBias: 72% accuracy (20% below the leader).
Risk: Potential bias in recommendations based on patient demographics.

C. Evidence Application Gaps

PubMedQA: 67% accuracy (7% below best model).
Issue: Over-reliance on common answer patterns rather than nuanced evidence.

Efficiency Analysis: Speed vs. Accuracy

Task	GPT-5 Time (s)	Leader Time (s)	Speed Ratio
MedCalc-Bench	22.06	43.75	0.50x (faster)
EHRSQL	30.94	3.83	8.08x (slower)

Long tasks: Faster on complex calculations (e.g., MedCalc-Bench).
Short tasks: Slower on structured queries (e.g., EHRSQL), compounding cost concerns.

Applications: When to Use GPT-5 in Healthcare

✅ Suitable Use Cases:

Clinical decision support: For numerical calculations or multi-specialty reasoning.
Medical education: As a knowledge resource for students.
Literature reviews: Extracting key findings from research papers.

⚠️ High-Risk Use Cases:

EHR data analysis: Due to structured data limitations.
Bias-sensitive decisions: Where fairness is critical.
Fact-critical reporting: Risk of generating unsupported claims.

Future Directions: What Needs Improvement?

1. Technical Enhancements

Schema grounding: Improve structured data handling (e.g., SQL generation).
Bias mitigation: Address fairness regressions via targeted training.

2. Evaluation Upgrades

Stress-test structured tasks: Expand EHR query benchmarks.
Fine-grained error analysis: Categorize hallucination types or bias mechanisms.

FAQs: Common Questions About GPT-5 in Healthcare

Q1: What’s GPT-5’s biggest medical breakthrough?

A: Mastery of complex numerical calculations (e.g., drug dosing), now matching top models.

Q2: When should clinicians avoid AI tools?

A: For tasks requiring precise structured data extraction (e.g., EHR queries) or where bias is a concern.

Q3: Has GPT-5 solved the “hallucination” problem?

A: No—accuracy trails the leader by 5%, so human verification remains essential.

Q4: Why did GPT-5 perform worse on EHR tasks?

A: Struggles with schema constraints (e.g., confusing medical terms or omitting query logic).

Q5: What’s the biggest fairness concern?

A: Recommendations varying by race, even with identical symptoms.

Conclusion

GPT-5 represents progress in medical AI, particularly in calculations and knowledge recall. However, structured data handling and fairness remain critical challenges. As healthcare increasingly adopts AI, rigorous evaluation frameworks like MedHELM will ensure models are both powerful and responsible.

Based on Stanford CRFM’s 2025 MedHELM report. Technical terms simplified for clarity.

GPT-5 Medical AI Breakthroughs: Mastering Calculations, Confronting Bias & EHR Challenges