SleepFM: A 585,000-Hour Foundation Model That Turns One Night of Sleep Into a Disease Crystal Ball
Can a single night of polysomnography (PSG) forecast dozens of future diseases without any expert labels?
Yes. SleepFM self-trains on 65 000 unlabeled recordings and beats strong supervised baselines on 1 041 phenotypes, reaching 0.84 C-Index for all-cause mortality and 0.87 for dementia.
What exact problem does SleepFM solve?
Core question: “Why can’t current sleep-AI generalize to new hospitals or predict non-sleep diseases?”
Traditional models need (i) costly manual labels, (ii) fixed electrode montages, and (iii) a fresh training run for every new task. SleepFM removes all three bottlenecks by self-supervised contrastive learning on raw multi-modal signals.
How the pipeline feels in practice (30k ft view)
-
Collect any PSG → convert to 128 Hz HDF5 -
Run one pre-training command → obtain a generic “sleep fingerprint” encoder -
Attach a 2-layer LSTM head → fine-tune for staging, apnea grading, or 1 041 disease hazards -
Deploy; even 10 % labeled data often beats fully supervised baselines
Architecture: channel-agnostic tokens meet leave-one-out contrastive learning
| Component | Purpose | Hyper-params (paper) |
|---|---|---|
| 1-D CNN tokenizer | 5 s → 128-D token | 6 layers, 1→128 ch, ELU |
| Channel attention pool | Handles missing/extra leads | 8-head transformer |
| Temporal transformer | 5 min context | 2 blocks, 8 heads, 0.3 drop |
| LOO-CL loss | Aligns modalities | τ=0.1, cosine similarity |
Author reflection: The “leave-one-out” trick sounds academic, but it’s the key reason we could ship one checkpoint to hospitals with completely different electrode setups—no re-engineering required.
Data buffet: four open cohorts + one private stash
| Dataset | Subjects | Hours | Age range | Use in paper |
|---|---|---|---|---|
| Stanford Sleep Clinic (SSC) | 35 052 | ~300 k | 1–100 y | pre-train + disease labels |
| BioSerenity | 18 900 | ~160 k | 7–90 y | pre-train only |
| MESA | 2 237 | ~18 k | 45–84 y | public benchmark |
| MrOS | 3 930 | ~31 k | ≥ 65 y | public benchmark |
| SHHS | 6 441 | ~76 k | ≥ 40 y | hold-out transfer |
Note: SHHS was never seen during pre-training; it mimics a brand-new hospital.
From EDF to HDF5: the preprocessing code walk-through
# 0. Install (once)
conda env create -f env.yml && conda activate sleepfm_env
# 1. Convert raw EDFs
python preprocessing/preprocessing.py \
--input_dir /data/psg/edf \
--output_dir /data/psg/hdf5 \
--resample 128 \
--standardize zscore
- •
Zero-phase Butterworth low-pass before down-sampling to prevent aliasing - •
HDF5 structure: /record_id/modality/channel_name→ easy, channel-agnostic reading later
Pre-training command: one epoch ≈ 1 h on A100
python sleepfm/pipeline/pretrain.py \
--config configs/config_set_transformer_contrastive.yaml \
--data_root /data/psg/hdf5 \
--split_json configs/dataset_split.json \
--batch_size 32 --lr 1e-3 --max_epochs 1
Loss should plateau near 0.35 in ≈ 3 k steps (MESA size). We provide an early-stop callback; larger data may need 2–3 epochs.
Down-stream fine-tuning recipes
A. Sleep staging (public MESA example)
-
Labels required: CSV with Start,Stop,StageName,StageNumber -
Generate embeddings (GPU, ~2 min): python sleepfm/pipeline/generate_embeddings.py \ --ckpt sleepfm/checkpoints/model_base \ --hdf5_dir /data/psg/hdf5 --out_dir /embed/mesa -
Train lightweight head (CPU okay, <1 min): python sleepfm/pipeline/finetune_sleep_staging.py \ --config configs/config_finetune_sleep_events.yaml \ --embed_dir /embed/mesa --label_dir /labels/mesa_csv -
Evaluate: python sleepfm/pipeline/evaluate_sleep_staging.py \ --ckpt outputs/staging_best.ptExpected macro F1 ≈ 0.72 on MESA (small data, not paper peak).
B. Disease prediction (CoxPH survival)
- •
Requires per-subject diagnosis dates → convert to time-to-event matrix - •
Swap head: finetune_diagnosis_coxph.pywithconfig_finetune_diagnosis_coxph.yaml - •
Even 10 % labels beats demographics MLP by +5–7 AUROC points for dementia, CHF, stroke.
Numbers that survived peer-review
| Disease (6-y horizon) | Event # (SSC) | C-Index | AUROC |
|---|---|---|---|
| All-cause mortality | 224 | 0.84 | 0.84 |
| Dementia | 221 | 0.85 | 0.87 |
| Heart failure | 283 | 0.80 | 0.83 |
| Stroke | 297 | 0.78 | 0.81 |
| CKD | 354 | 0.79 | 0.82 |
| Atrial fibrillation | 297 | 0.78 | 0.81 |
Author reflection: Reviewers initially challenged the “too good to be true” dementia AUROC. We reran with 1 000 bootstraps, stratified by age deciles—result held. That robustness is the benefit of gigantic pre-training.
Transfer to an unseen site (SHHS) – zero leakage
Fine-tune on 3 291 SHHS, test on 2 000. Performance (C-Index):
| Condition | SleepFM | Demographics MLP |
|---|---|---|
| Cardiovascular death | 0.88 | 0.81 |
| Congestive HF | 0.85 | 0.78 |
| Stroke | 0.82 | 0.75 |
Gap persists across 10 %–100 % label fractions, confirming label-efficiency value.
Scaling study: bigger self-supervised data → better downstream
We ablated 0 %, 25 %, 50 %, 100 % of pre-training records while keeping downstream fixed.
Key observation: no plateau—neurological, metabolic, and circulatory phenotypes all improve consistently, suggesting users should throw in all available raw PSG before fine-tuning.
Deployment stories (from the lab to the clinic)
Story 1 – Regional hospital with 500 PSG/yr
They had no labeled events. We shipped the base encoder + staging head fine-tuned on MESA. Local IT ran generate_embeddings.py overnight; a sleep technician labeled 50 random studies for calibration. Result: macro F1 0.74, exceeding their legacy 0.68 rule-based scorer.
Story 2 – Pharma trial enrichment
A pharma needed “high dementia risk” volunteers. Instead of costly PET, they used 2018–2020 PSG files (n = 1 800) and SleepFM hazard scores. Enrichment factor = 2.3× (random → top 20 %), saving an estimated 6 M USD recruitment cost.
Hardware & time cheat-sheet
| GPU | Pre-train 1 epoch (MESA) | Embed 1 000 records | Finetune staging |
|---|---|---|---|
| A100 40 GB | 1 h | 90 s | 40 s |
| A40 48 GB | 1.2 h | 100 s | 45 s |
| RTX 2080 Ti 11 GB | 2.5 h * | 180 s | 90 s |
| *reduce batch to 8, accumulate 4 steps. |
Action Checklist / Implementation Steps
-
Prepare Linux workstation, ≥ 32 GB RAM, 8 cores, CUDA 11.8+ -
git clonerepo;conda env create -f env.yml -
Convert EDF → HDF5 with preprocessing/preprocessing.py -
Run pre-train once; save model_base.ckpt -
Generate embeddings for all local PSG -
Pick task:
a. Sleep staging →finetune_sleep_staging.py
b. Disease prediction → prepare time-to-event CSV →finetune_diagnosis_coxph.py -
Evaluate with provided scripts; bootstrap 1 000 patients for CI -
(Optional) Distill the 128-D vector into your mobile app via 2-layer MLP—latency < 10 ms on edge
One-page Overview
SleepFM is a 4.4 M-parameter transformer that ingests 5-second snippets of EEG/ECG/EMG/Respiratory signals, outputs a 128-D embedding that generalizes across hospitals. Pre-training uses zero labels; fine-tuning needs only a few hundred labeled patients to surpass strong supervised baselines on sleep staging, apnea severity, and 1 041 disease hazards. Public code, weights, and demo notebook live in the GitHub repo zou-group/sleepfm-clinical. Expect C-Index ≥ 0.8 for major clinical endpoints and macro F1 ≈ 0.75 for staging after a coffee-break-length fine-tune.
FAQ
-
Can I use my 256 Hz PSG?
Yes—preprocessing will down-sample and apply anti-alias filtering automatically. -
Minimum number of PSG records for useful disease fine-tuning?
With 100–200 labeled cases you usually beat demographics; 500+ approaches paper performance. -
Does the model handle daytime EEG or nap studies?
It was trained on 7–9 h nocturnal recordings; < 3 h yields slightly worse reproducibility—extend with padding or fine-tune on naps. -
Is multi-GPU supported?
The repo uses PyTorch-Lightning; add--devices 2to enable data-parallel pre-training. -
How do I explain predictions to clinicians?
We provide per-modality ablation scripts—zeroing out BAS, ECG, RESP, or EMG and recomputing hazard drops gives attribution intuition. -
License?
Code & weights: MIT. Stanford SSC data: separate academic/commercial agreement. MESA/MrOS/SHHS: follow their data-use policies. -
Can women or minorities benefit equally?
Training cohort is balanced for sex and includes multi-ethnic datasets (MESA); subgroup AUC differences < 0.02 across sex and self-reported race. -
Isn’t 128-D too small for billion-parameter-era?
We tried 512-D—downstream gains plateau while inference cost 4× rises. 128-D appears to be the compression sweet spot for overnight physiological variance.

