SleepFM: A 585,000-Hour Foundation Model That Turns One Night of Sleep Into a Disease Crystal Ball

Can a single night of polysomnography (PSG) forecast dozens of future diseases without any expert labels?
Yes. SleepFM self-trains on 65 000 unlabeled recordings and beats strong supervised baselines on 1 041 phenotypes, reaching 0.84 C-Index for all-cause mortality and 0.87 for dementia.

What exact problem does SleepFM solve?

Core question: “Why can’t current sleep-AI generalize to new hospitals or predict non-sleep diseases?”
Traditional models need (i) costly manual labels, (ii) fixed electrode montages, and (iii) a fresh training run for every new task. SleepFM removes all three bottlenecks by self-supervised contrastive learning on raw multi-modal signals.

How the pipeline feels in practice (30k ft view)

Collect any PSG → convert to 128 Hz HDF5
Run one pre-training command → obtain a generic “sleep fingerprint” encoder
Attach a 2-layer LSTM head → fine-tune for staging, apnea grading, or 1 041 disease hazards
Deploy; even 10 % labeled data often beats fully supervised baselines

Architecture: channel-agnostic tokens meet leave-one-out contrastive learning

Component	Purpose	Hyper-params (paper)
1-D CNN tokenizer	5 s → 128-D token	6 layers, 1→128 ch, ELU
Channel attention pool	Handles missing/extra leads	8-head transformer
Temporal transformer	5 min context	2 blocks, 8 heads, 0.3 drop
LOO-CL loss	Aligns modalities	τ=0.1, cosine similarity

Author reflection: The “leave-one-out” trick sounds academic, but it’s the key reason we could ship one checkpoint to hospitals with completely different electrode setups—no re-engineering required.

Data buffet: four open cohorts + one private stash

Dataset	Subjects	Hours	Age range	Use in paper
Stanford Sleep Clinic (SSC)	35 052	~300 k	1–100 y	pre-train + disease labels
BioSerenity	18 900	~160 k	7–90 y	pre-train only
MESA	2 237	~18 k	45–84 y	public benchmark
MrOS	3 930	~31 k	≥ 65 y	public benchmark
SHHS	6 441	~76 k	≥ 40 y	hold-out transfer

Note: SHHS was never seen during pre-training; it mimics a brand-new hospital.

From EDF to HDF5: the preprocessing code walk-through

# 0. Install (once)
conda env create -f env.yml && conda activate sleepfm_env

# 1. Convert raw EDFs
python preprocessing/preprocessing.py \
  --input_dir /data/psg/edf \
  --output_dir /data/psg/hdf5 \
  --resample 128 \
  --standardize zscore

•

Zero-phase Butterworth low-pass before down-sampling to prevent aliasing
•

HDF5 structure: /record_id/modality/channel_name → easy, channel-agnostic reading later

Pre-training command: one epoch ≈ 1 h on A100

python sleepfm/pipeline/pretrain.py \
  --config configs/config_set_transformer_contrastive.yaml \
  --data_root /data/psg/hdf5 \
  --split_json configs/dataset_split.json \
  --batch_size 32 --lr 1e-3 --max_epochs 1

Loss should plateau near 0.35 in ≈ 3 k steps (MESA size). We provide an early-stop callback; larger data may need 2–3 epochs.

Down-stream fine-tuning recipes

A. Sleep staging (public MESA example)

Labels required: CSV with Start,Stop,StageName,StageNumber

Generate embeddings (GPU, ~2 min):

python sleepfm/pipeline/generate_embeddings.py \
  --ckpt sleepfm/checkpoints/model_base \
  --hdf5_dir /data/psg/hdf5 --out_dir /embed/mesa

Train lightweight head (CPU okay, <1 min):

python sleepfm/pipeline/finetune_sleep_staging.py \
  --config configs/config_finetune_sleep_events.yaml \
  --embed_dir /embed/mesa --label_dir /labels/mesa_csv

Evaluate:

python sleepfm/pipeline/evaluate_sleep_staging.py \
  --ckpt outputs/staging_best.pt

Expected macro F1 ≈ 0.72 on MESA (small data, not paper peak).

B. Disease prediction (CoxPH survival)

•

Requires per-subject diagnosis dates → convert to time-to-event matrix
•

Swap head: finetune_diagnosis_coxph.py with config_finetune_diagnosis_coxph.yaml
•

Even 10 % labels beats demographics MLP by +5–7 AUROC points for dementia, CHF, stroke.

Numbers that survived peer-review

Disease (6-y horizon)	Event # (SSC)	C-Index	AUROC
All-cause mortality	224	0.84	0.84
Dementia	221	0.85	0.87
Heart failure	283	0.80	0.83
Stroke	297	0.78	0.81
CKD	354	0.79	0.82
Atrial fibrillation	297	0.78	0.81

Author reflection: Reviewers initially challenged the “too good to be true” dementia AUROC. We reran with 1 000 bootstraps, stratified by age deciles—result held. That robustness is the benefit of gigantic pre-training.

Transfer to an unseen site (SHHS) – zero leakage

Fine-tune on 3 291 SHHS, test on 2 000. Performance (C-Index):

Condition	SleepFM	Demographics MLP
Cardiovascular death	0.88	0.81
Congestive HF	0.85	0.78
Stroke	0.82	0.75

Gap persists across 10 %–100 % label fractions, confirming label-efficiency value.

Scaling study: bigger self-supervised data → better downstream

We ablated 0 %, 25 %, 50 %, 100 % of pre-training records while keeping downstream fixed.
Key observation: no plateau—neurological, metabolic, and circulatory phenotypes all improve consistently, suggesting users should throw in all available raw PSG before fine-tuning.

Deployment stories (from the lab to the clinic)

Story 1 – Regional hospital with 500 PSG/yr

They had no labeled events. We shipped the base encoder + staging head fine-tuned on MESA. Local IT ran generate_embeddings.py overnight; a sleep technician labeled 50 random studies for calibration. Result: macro F1 0.74, exceeding their legacy 0.68 rule-based scorer.

Story 2 – Pharma trial enrichment

A pharma needed “high dementia risk” volunteers. Instead of costly PET, they used 2018–2020 PSG files (n = 1 800) and SleepFM hazard scores. Enrichment factor = 2.3× (random → top 20 %), saving an estimated 6 M USD recruitment cost.

Hardware & time cheat-sheet

GPU	Pre-train 1 epoch (MESA)	Embed 1 000 records	Finetune staging
A100 40 GB	1 h	90 s	40 s
A40 48 GB	1.2 h	100 s	45 s
RTX 2080 Ti 11 GB	2.5 h *	180 s	90 s
*reduce batch to 8, accumulate 4 steps.

Action Checklist / Implementation Steps

Prepare Linux workstation, ≥ 32 GB RAM, 8 cores, CUDA 11.8+
git clone repo; conda env create -f env.yml
Convert EDF → HDF5 with preprocessing/preprocessing.py
Run pre-train once; save model_base.ckpt
Generate embeddings for all local PSG
Pick task:
a. Sleep staging → finetune_sleep_staging.py
b. Disease prediction → prepare time-to-event CSV → finetune_diagnosis_coxph.py
Evaluate with provided scripts; bootstrap 1 000 patients for CI
(Optional) Distill the 128-D vector into your mobile app via 2-layer MLP—latency < 10 ms on edge

One-page Overview

SleepFM is a 4.4 M-parameter transformer that ingests 5-second snippets of EEG/ECG/EMG/Respiratory signals, outputs a 128-D embedding that generalizes across hospitals. Pre-training uses zero labels; fine-tuning needs only a few hundred labeled patients to surpass strong supervised baselines on sleep staging, apnea severity, and 1 041 disease hazards. Public code, weights, and demo notebook live in the GitHub repo zou-group/sleepfm-clinical. Expect C-Index ≥ 0.8 for major clinical endpoints and macro F1 ≈ 0.75 for staging after a coffee-break-length fine-tune.

FAQ

Can I use my 256 Hz PSG?
Yes—preprocessing will down-sample and apply anti-alias filtering automatically.
Minimum number of PSG records for useful disease fine-tuning?
With 100–200 labeled cases you usually beat demographics; 500+ approaches paper performance.
Does the model handle daytime EEG or nap studies?
It was trained on 7–9 h nocturnal recordings; < 3 h yields slightly worse reproducibility—extend with padding or fine-tune on naps.
Is multi-GPU supported?
The repo uses PyTorch-Lightning; add --devices 2 to enable data-parallel pre-training.
How do I explain predictions to clinicians?
We provide per-modality ablation scripts—zeroing out BAS, ECG, RESP, or EMG and recomputing hazard drops gives attribution intuition.
License?
Code & weights: MIT. Stanford SSC data: separate academic/commercial agreement. MESA/MrOS/SHHS: follow their data-use policies.
Can women or minorities benefit equally?
Training cohort is balanced for sex and includes multi-ethnic datasets (MESA); subgroup AUC differences < 0.02 across sex and self-reported race.
Isn’t 128-D too small for billion-parameter-era?
We tried 512-D—downstream gains plateau while inference cost 4× rises. 128-D appears to be the compression sweet spot for overnight physiological variance.

Sleep Foundation Model Predicts Future Diseases From One Night of PSG Data