From GPT-4 to GPT-5: Advancements and Challenges in Medical AI

Introduction

The rapid evolution of artificial intelligence (AI) has transformed healthcare, with large language models (LLMs) like GPT playing a pivotal role. A recent 2025 report by Stanford’s CRFM introduces MedHELM, a benchmark designed to evaluate AI’s medical capabilities. This article breaks down the key findings of GPT-5’s performance, highlighting its strengths, limitations, and implications for clinical practice.


What is MedHELM?

MedHELM is a comprehensive testing framework that evaluates AI models across eight critical medical tasks:

Task Purpose Example
MedCalc-Bench Numerical calculations Drug dosage, lab value analysis
Medec Error detection in medical records Identifying charting mistakes
HeadQA Cross-disciplinary reasoning Solving complex, multi-specialty cases
Medbullets Medical knowledge recall Recalling clinical guidelines
PubMedQA Evidence-based question answering Applying research findings to patient care
EHRSQL Structured data processing Extracting info from electronic health records (EHRs)
RaceBias Fairness evaluation Avoiding racial disparities in recommendations
MedHallu Hallucination resistance Preventing fabricated medical claims

Key Findings: How Did GPT-5 Perform?

1. Strengths: Where GPT-5 Shines

A. Advanced Numerical Calculations

  • MedCalc-Bench: Tied for first place with DeepSeek R1 (35% accuracy).
  • Improvement: Outperformed GPT-4o by 16% in tasks like acid-base calculations.
  • Real-world impact: Reliable for dosing adjustments or lab interpretation.

B. Cross-Domain Knowledge Integration

  • HeadQA: Achieved 93% accuracy, a new benchmark high.
  • Example: Solving cases requiring knowledge of endocrinology + cardiology + pharmacology.

C. Broader Medical Knowledge

  • Medbullets: 89% accuracy (8% gain over GPT-4).
  • Strength: Excels in recalling low-frequency, niche medical facts.

2. Weaknesses: Where GPT-5 Struggles

A. Structured Data Limitations

  • EHRSQL: 18% accuracy (14% drop from GPT-4).
  • Common errors:

    • Misinterpreting field names (e.g., “systolic BP” vs. “BP”)
    • Incomplete SQL queries (missing WHERE clauses)

B. Fairness Concerns

  • RaceBias: 72% accuracy (20% below the leader).
  • Risk: Potential bias in recommendations based on patient demographics.

C. Evidence Application Gaps

  • PubMedQA: 67% accuracy (7% below best model).
  • Issue: Over-reliance on common answer patterns rather than nuanced evidence.

Efficiency Analysis: Speed vs. Accuracy

Task GPT-5 Time (s) Leader Time (s) Speed Ratio
MedCalc-Bench 22.06 43.75 0.50x (faster)
EHRSQL 30.94 3.83 8.08x (slower)
  • Long tasks: Faster on complex calculations (e.g., MedCalc-Bench).
  • Short tasks: Slower on structured queries (e.g., EHRSQL), compounding cost concerns.

Applications: When to Use GPT-5 in Healthcare

✅ Suitable Use Cases:

  • Clinical decision support: For numerical calculations or multi-specialty reasoning.
  • Medical education: As a knowledge resource for students.
  • Literature reviews: Extracting key findings from research papers.

⚠️ High-Risk Use Cases:

  • EHR data analysis: Due to structured data limitations.
  • Bias-sensitive decisions: Where fairness is critical.
  • Fact-critical reporting: Risk of generating unsupported claims.

Future Directions: What Needs Improvement?

1. Technical Enhancements

  • Schema grounding: Improve structured data handling (e.g., SQL generation).
  • Bias mitigation: Address fairness regressions via targeted training.

2. Evaluation Upgrades

  • Stress-test structured tasks: Expand EHR query benchmarks.
  • Fine-grained error analysis: Categorize hallucination types or bias mechanisms.

FAQs: Common Questions About GPT-5 in Healthcare

Q1: What’s GPT-5’s biggest medical breakthrough?

A: Mastery of complex numerical calculations (e.g., drug dosing), now matching top models.

Q2: When should clinicians avoid AI tools?

A: For tasks requiring precise structured data extraction (e.g., EHR queries) or where bias is a concern.

Q3: Has GPT-5 solved the “hallucination” problem?

A: No—accuracy trails the leader by 5%, so human verification remains essential.

Q4: Why did GPT-5 perform worse on EHR tasks?

A: Struggles with schema constraints (e.g., confusing medical terms or omitting query logic).

Q5: What’s the biggest fairness concern?

A: Recommendations varying by race, even with identical symptoms.


Conclusion

GPT-5 represents progress in medical AI, particularly in calculations and knowledge recall. However, structured data handling and fairness remain critical challenges. As healthcare increasingly adopts AI, rigorous evaluation frameworks like MedHELM will ensure models are both powerful and responsible.


Based on Stanford CRFM’s 2025 MedHELM report. Technical terms simplified for clarity.