“While GPT-4o is still treating heartbeats as pixel art, Stanford has taught a 1-billion-parameter Llama to read 12-lead ECGs—cutting VRAM by 70 % and quadrupling F1, while printing a discharge summary with human-like reasoning.”
TL;DR
-
Reproduce in minutes: one Docker command turns a 1 B Llama into a “time-series specialist” that ingests ECG, EEG or accelerometer data of any length. -
Deploy today: Gradio demo + CUDA/Mac MPS image included; offline hospital-ready pipeline in < 30 min. -
Hack freely: open-source CoT datasets + training scripts; swap two lines to stream glucose, BP or industrial sensors.
Introduction | Why Your LLM Still Can’t Read an ECG
2 a.m. in the ICU. A monitor spits out 1 000 Hz of voltage. The resident glances at the trace; the AI assistant converts it to a PNG, feeds GPT-4o and gets:
“Sharp spikes visible—consult physician.”
The bottleneck isn’t model size—it’s modality. Continuous, high-resolution signals are not sentences or photos; they are symphonies. Forcing them through text or image tokenizers is like translating Beethoven to emojis and asking for the chorus.
OpenTSLM closes the gap by making time series a native language inside pretrained LLMs. Result: a 1 B Llama reaches 69.9 % F1 on sleep staging, leaving 200 B GPT-4o (15.5 %) in the dust while using 40 GB VRAM—versus >110 GB for the next best baseline.
1. Intuition | How a Mini-Model Crushes a Giant
1.1 Soft-prompt vs Cross-attention—The Modality War
Approach | VRAM (10-s ECG) | Sleep F1 | Clinically Explainable |
---|---|---|---|
GPT-4o image | 110 GB | 15.5 % | ❌ |
OpenTSLM-SoftPrompt | 64 GB | 69.9 % | ✅ |
OpenTSLM-Flamingo | 40 GB | 69.9 % | ✅ |
SoftPrompt concatenates learnable patch tokens to the text context—cheap but quadratic. Flamingo squeezes any length into 64 latent vectors and lets the LLM query them via gated cross-attention. Signal length becomes irrelevant; VRAM stays flat.
1.2 Explainability—No Physician Signature, No Deploy
Five Stanford cardiologists graded 84 ECG rationales using ACC/AHA criteria. 92.9 % were judged correct or partially correct; 85.1 % appropriately incorporated patient context (medications, artifacts, age).
The model doesn’t spit labels—it writes reports:
“ST elevation ≥0.2 mV in V1-V3 with reciprocal ST depression, coupled with acute chest pain for 3 h, indicates acute anterior MI.”
2. Hands-On | Zero-to-Demo in 30 Minutes
2.1 One-liner Docker (GPU)
docker run --gpus all -p 7860:7860 \
ghcr.io/stanfordbdhg/opentslm:1.0-cuda \
python -m app.demo --model OpenTSLMFlamingo \
--checkpoint stanford-opentslm-1b-ecg
Drag a 12-lead CSV onto the Gradio UI → get a structured report <10 s.
2.2 Native Install (for hackers)
git clone https://github.com/StanfordBDHG/OpenTSLM.git
cd OpenTSLM && pip install -r requirements.txt
huggingface-cli login # request Llama-3.2-1B access
python curriculum_learning.py --model OpenTSLMFlamingo \
--stages stage5_ecg_cot \
--device cuda --eval_only
2.3 Python API Snippet (Minimal Runnable Example)
from opentslm import FlamingoPipeline
pipe = FlamingoPipeline("stanford-opentslm-1b-ecg")
ecg = load_csv("12lead_10s_1000hz.csv") # shape (12, 10 000)
out = pipe(ecg, prompt="What is the rhythm?")
print(out["rationale"], out["answer"])
# → Sinus rhythm with occasional PACs Normal
Input: raw 12-lead voltage, 10 s, 1 kHz.
Output: JSON with rationale
+ answer
, ready for EMR insertion.
3. Advanced | Plugging Your Own Sensor
3.1 The Triplet OpenTSLM Consumes
-
signal
: ndarray (channels, length) -
meta
: sampling rate, unit, mean, std -
prompt
: natural-language question
{
"signal": [[0.12, -0.05, ...], [0.11, -0.03, ...]],
"meta": {"fs": 1000, "unit": "mV", "mean": 0.02, "std": 0.08},
"prompt": "Does this patient have atrial fibrillation?"
}
3.2 Three Steps to Add a New Modality
-
Tweak PatchEncoder.patch_len
to match sensor density (e.g., 64 for 100 Hz). -
Inherit BaseDataset
and return(signal, prompt, answer)
tuples. -
Write a prompt template—be verbose; the model loves adjectives.
4. Deep Dive | Why VRAM Drops 70 %
SoftPrompt turns 10 000 × 12 leads into 3 750 patch tokens—context explodes.
Flamingo compresses patches into 64 latent vectors; text attends via gated cross-attention. Memory complexity falls from O(L·N) to O(64·N); sequence length no longer hurts.
Figure: Flamingo flatlines; SoftPrompt goes exponential and OOM at 10k.
5. Clinical Deployment | From Weights to FDA Desk
Stanford’s clinician rubric (simplified):
-
Recognition: Did the model cite key waves (ST ↑, δ wave)? -
Reasoning: Linked findings to the question? -
Context: Considered age, pacemaker, lead misplacement?
Scores: 65.5 % / 62.5 % / 85.1 %.
Takeaway: good enough for first-pass screening, yet complex cases still demand human sign-off—exactly the human-in-the-loop pathway FDA likes to see.
6. FAQ
Q1: I only have 24 GB VRAM—can I play?
A: Yes, official 8-bit script ships; training fits in 20 GB, inference in 6 GB.
Q2: Chinese prompts possible?
A: Llama-3.2 tokenizer includes Chinese; just prompt in Mandarin, get Mandarin rationale.
Q3: Continuous BP forecast?
A: Verified—set patch_len = 32
for 64 Hz, swap dataset; MAE 4.2 mmHg on MIMIC-IV.
7. Next Frontiers
-
☾ On-device 1 B model detecting AFib offline from a smartwatch. -
☾ Industrial bearings that narrate their remaining useful life. -
☾ A clinician asking, “Tell me what happened to this patient last night—ECG, SpO₂, respiration—in one story.”
8. Take-Home Challenges
-
Ablate patch_len
from 32 → 8; measure latency vs F1 and publish a 300-word report. -
Compress Flamingo-1B to ≤6 GB while keeping F1 drop <3 % using quantization + pruning + shared cross-attention. Post your recipe as a GitHub issue.
References & Links
[1] Langer P. et al., “OpenTSLM: Time-Series Language Models for Reasoning over Multivariate Medical Text- and Time-Series Data,” arXiv:2510.02410, 2025.
[2] Official repo & Docker: https://github.com/StanfordBDHG/OpenTSLM