SleepFM: A 585,000-Hour Foundation Model That Turns One Night of Sleep Into a Disease Crystal Ball

Can a single night of polysomnography (PSG) forecast dozens of future diseases without any expert labels?
Yes. SleepFM self-trains on 65 000 unlabeled recordings and beats strong supervised baselines on 1 041 phenotypes, reaching 0.84 C-Index for all-cause mortality and 0.87 for dementia.


What exact problem does SleepFM solve?

Core question: “Why can’t current sleep-AI generalize to new hospitals or predict non-sleep diseases?”
Traditional models need (i) costly manual labels, (ii) fixed electrode montages, and (iii) a fresh training run for every new task. SleepFM removes all three bottlenecks by self-supervised contrastive learning on raw multi-modal signals.


How the pipeline feels in practice (30k ft view)

  1. Collect any PSG → convert to 128 Hz HDF5
  2. Run one pre-training command → obtain a generic “sleep fingerprint” encoder
  3. Attach a 2-layer LSTM head → fine-tune for staging, apnea grading, or 1 041 disease hazards
  4. Deploy; even 10 % labeled data often beats fully supervised baselines

Architecture: channel-agnostic tokens meet leave-one-out contrastive learning

Component Purpose Hyper-params (paper)
1-D CNN tokenizer 5 s → 128-D token 6 layers, 1→128 ch, ELU
Channel attention pool Handles missing/extra leads 8-head transformer
Temporal transformer 5 min context 2 blocks, 8 heads, 0.3 drop
LOO-CL loss Aligns modalities τ=0.1, cosine similarity

Author reflection: The “leave-one-out” trick sounds academic, but it’s the key reason we could ship one checkpoint to hospitals with completely different electrode setups—no re-engineering required.


Data buffet: four open cohorts + one private stash

Dataset Subjects Hours Age range Use in paper
Stanford Sleep Clinic (SSC) 35 052 ~300 k 1–100 y pre-train + disease labels
BioSerenity 18 900 ~160 k 7–90 y pre-train only
MESA 2 237 ~18 k 45–84 y public benchmark
MrOS 3 930 ~31 k ≥ 65 y public benchmark
SHHS 6 441 ~76 k ≥ 40 y hold-out transfer

Note: SHHS was never seen during pre-training; it mimics a brand-new hospital.


From EDF to HDF5: the preprocessing code walk-through

# 0. Install (once)
conda env create -f env.yml && conda activate sleepfm_env

# 1. Convert raw EDFs
python preprocessing/preprocessing.py \
  --input_dir /data/psg/edf \
  --output_dir /data/psg/hdf5 \
  --resample 128 \
  --standardize zscore

  • Zero-phase Butterworth low-pass before down-sampling to prevent aliasing

  • HDF5 structure: /record_id/modality/channel_name → easy, channel-agnostic reading later

Pre-training command: one epoch ≈ 1 h on A100

python sleepfm/pipeline/pretrain.py \
  --config configs/config_set_transformer_contrastive.yaml \
  --data_root /data/psg/hdf5 \
  --split_json configs/dataset_split.json \
  --batch_size 32 --lr 1e-3 --max_epochs 1

Loss should plateau near 0.35 in ≈ 3 k steps (MESA size). We provide an early-stop callback; larger data may need 2–3 epochs.


Down-stream fine-tuning recipes

A. Sleep staging (public MESA example)

  1. Labels required: CSV with Start,Stop,StageName,StageNumber
  2. Generate embeddings (GPU, ~2 min):

    python sleepfm/pipeline/generate_embeddings.py \
      --ckpt sleepfm/checkpoints/model_base \
      --hdf5_dir /data/psg/hdf5 --out_dir /embed/mesa
    
  3. Train lightweight head (CPU okay, <1 min):

    python sleepfm/pipeline/finetune_sleep_staging.py \
      --config configs/config_finetune_sleep_events.yaml \
      --embed_dir /embed/mesa --label_dir /labels/mesa_csv
    
  4. Evaluate:

    python sleepfm/pipeline/evaluate_sleep_staging.py \
      --ckpt outputs/staging_best.pt
    

    Expected macro F1 ≈ 0.72 on MESA (small data, not paper peak).

B. Disease prediction (CoxPH survival)


  • Requires per-subject diagnosis dates → convert to time-to-event matrix

  • Swap head: finetune_diagnosis_coxph.py with config_finetune_diagnosis_coxph.yaml

  • Even 10 % labels beats demographics MLP by +5–7 AUROC points for dementia, CHF, stroke.

Numbers that survived peer-review

Disease (6-y horizon) Event # (SSC) C-Index AUROC
All-cause mortality 224 0.84 0.84
Dementia 221 0.85 0.87
Heart failure 283 0.80 0.83
Stroke 297 0.78 0.81
CKD 354 0.79 0.82
Atrial fibrillation 297 0.78 0.81

Author reflection: Reviewers initially challenged the “too good to be true” dementia AUROC. We reran with 1 000 bootstraps, stratified by age deciles—result held. That robustness is the benefit of gigantic pre-training.


Transfer to an unseen site (SHHS) – zero leakage

Fine-tune on 3 291 SHHS, test on 2 000. Performance (C-Index):

Condition SleepFM Demographics MLP
Cardiovascular death 0.88 0.81
Congestive HF 0.85 0.78
Stroke 0.82 0.75

Gap persists across 10 %–100 % label fractions, confirming label-efficiency value.


Scaling study: bigger self-supervised data → better downstream

We ablated 0 %, 25 %, 50 %, 100 % of pre-training records while keeping downstream fixed.
Key observation: no plateau—neurological, metabolic, and circulatory phenotypes all improve consistently, suggesting users should throw in all available raw PSG before fine-tuning.


Deployment stories (from the lab to the clinic)

Story 1 – Regional hospital with 500 PSG/yr

They had no labeled events. We shipped the base encoder + staging head fine-tuned on MESA. Local IT ran generate_embeddings.py overnight; a sleep technician labeled 50 random studies for calibration. Result: macro F1 0.74, exceeding their legacy 0.68 rule-based scorer.

Story 2 – Pharma trial enrichment

A pharma needed “high dementia risk” volunteers. Instead of costly PET, they used 2018–2020 PSG files (n = 1 800) and SleepFM hazard scores. Enrichment factor = 2.3× (random → top 20 %), saving an estimated 6 M USD recruitment cost.


Hardware & time cheat-sheet

GPU Pre-train 1 epoch (MESA) Embed 1 000 records Finetune staging
A100 40 GB 1 h 90 s 40 s
A40 48 GB 1.2 h 100 s 45 s
RTX 2080 Ti 11 GB 2.5 h * 180 s 90 s
*reduce batch to 8, accumulate 4 steps.

Action Checklist / Implementation Steps

  1. Prepare Linux workstation, ≥ 32 GB RAM, 8 cores, CUDA 11.8+
  2. git clone repo; conda env create -f env.yml
  3. Convert EDF → HDF5 with preprocessing/preprocessing.py
  4. Run pre-train once; save model_base.ckpt
  5. Generate embeddings for all local PSG
  6. Pick task:
    a. Sleep staging → finetune_sleep_staging.py
    b. Disease prediction → prepare time-to-event CSV → finetune_diagnosis_coxph.py
  7. Evaluate with provided scripts; bootstrap 1 000 patients for CI
  8. (Optional) Distill the 128-D vector into your mobile app via 2-layer MLP—latency < 10 ms on edge

One-page Overview

SleepFM is a 4.4 M-parameter transformer that ingests 5-second snippets of EEG/ECG/EMG/Respiratory signals, outputs a 128-D embedding that generalizes across hospitals. Pre-training uses zero labels; fine-tuning needs only a few hundred labeled patients to surpass strong supervised baselines on sleep staging, apnea severity, and 1 041 disease hazards. Public code, weights, and demo notebook live in the GitHub repo zou-group/sleepfm-clinical. Expect C-Index ≥ 0.8 for major clinical endpoints and macro F1 ≈ 0.75 for staging after a coffee-break-length fine-tune.


FAQ

  1. Can I use my 256 Hz PSG?
    Yes—preprocessing will down-sample and apply anti-alias filtering automatically.

  2. Minimum number of PSG records for useful disease fine-tuning?
    With 100–200 labeled cases you usually beat demographics; 500+ approaches paper performance.

  3. Does the model handle daytime EEG or nap studies?
    It was trained on 7–9 h nocturnal recordings; < 3 h yields slightly worse reproducibility—extend with padding or fine-tune on naps.

  4. Is multi-GPU supported?
    The repo uses PyTorch-Lightning; add --devices 2 to enable data-parallel pre-training.

  5. How do I explain predictions to clinicians?
    We provide per-modality ablation scripts—zeroing out BAS, ECG, RESP, or EMG and recomputing hazard drops gives attribution intuition.

  6. License?
    Code & weights: MIT. Stanford SSC data: separate academic/commercial agreement. MESA/MrOS/SHHS: follow their data-use policies.

  7. Can women or minorities benefit equally?
    Training cohort is balanced for sex and includes multi-ethnic datasets (MESA); subgroup AUC differences < 0.02 across sex and self-reported race.

  8. Isn’t 128-D too small for billion-parameter-era?
    We tried 512-D—downstream gains plateau while inference cost 4× rises. 128-D appears to be the compression sweet spot for overnight physiological variance.