Health Predictions from Your Wrist: Why “Behavior” Beats Raw Sensor Data

Smartwatches and fitness trackers now sit on more than a billion wrists, quietly logging heartbeats, footsteps, and sleep minutes. Most research still focuses on the millisecond-level waveforms these devices produce—PPG, ECG, accelerometer streams. A new large-scale study led by Apple and the American Heart Association flips the script. It shows that higher-level behavior metrics—things like daily step counts, resting heart-rate trends, and six-minute walk distance—can be turned into a foundation model that outperforms or complements traditional biosignal models on fifty-seven real-world health tasks.

Below you will find a pragmatic, end-to-end walk-through of the paper Beyond Sensor Data: Foundation Models of Behavioral Data from Wearables Improve Health Predictions (ICML 2025). All numbers, methods, and caveats come directly from the source; nothing has been added.


1. The Data Lake Behind the Study

Statistic Value
Dataset Apple Heart & Movement Study (AHMS)
Participants 161 855 U.S. residents with ≥90 days of follow-up
Observation span Up to 5 years
Granularity Hourly summaries
Total volume 15.14 million participant-weeks ≈ 2.5 billion hours

Data collection was opt-in via the Apple Research app. Participants could pause or delete data at any time. The cohort skews toward higher income, White, and male users—important context when interpreting generalizability.


2. From Raw Sensors to 27 Human-Readable Signals

Instead of feeding raw waveforms into the model, the authors used 27 validated HealthKit quantities. Each quantity is computed by Apple’s on-device algorithms and synced hourly.

Category Examples Typical Sampling Notes
Activity Step count (watch & phone), active calories, flights climbed, exercise minutes Hourly Sums for cumulative metrics
Cardiovascular Resting heart rate, walking heart-rate average, heart-rate variability (SDNN) Daily or hourly HRV uses overnight windows
Vital signs Overnight respiratory rate, wrist temperature, blood oxygen Overnight or infrequent SpO₂ only on Series 6+
Gait & mobility Walking speed, step length, double-support percentage, fall count Every walk or opportunistic Falls <3 % of users
Body measurements BMI, body mass Manual entry Sparsest variables
Functional capacity VO₂max, six-minute walk distance Weekly or after workouts Require outdoor GPS workouts

These metrics are irregularly sampled—some users log 1 000+ hourly observations per week, others <300. Missingness is handled explicitly in the model design.


3. Model Design: WBM (Wearable Behavior Model)

3.1 Pre-processing Pipeline

  1. Weekly windows: each sample = 168 consecutive hours starting Monday 00:00.
  2. Inclusion criteria: ≥5 days of heart-rate data in the week, ≥5 qualifying weeks per user.
  3. Z-scores: every variable normalized to zero mean, unit variance across the full training set.
  4. Missingness mask: one binary flag per variable per hour, appended as extra channels (27 → 54).

Resulting input tensor: 168 × 54.

3.2 Tokenization & Architecture Search

The team ran a grid search over three tokenizers and three backbones. TST + Mamba-2 won.

Tokenizer Idea Why It Did (or Didn’t) Work
TST (dense) Zero-fill gaps → 168×54 matrix → linear patch embedding Surprisingly robust; global-mean imputation beat subject-level imputation
mTAN Multi-time attention with learned masks Good on paper, slightly worse in practice
Tuple One token per observation (time, variable, value) Handles sparsity natively, but memory grows with sequence length
Backbone Key Property Outcome
Self-attention Transformer Absolute positional encodings Competitive but memory-hungry
Rotary Transformer Relative RoPE encodings Slight edge on some tasks
Mamba-2 Bidirectional selective state-space Best trade-off of speed, memory, accuracy

Final hyper-parameters (all ablated):

  • 24 Mamba-2 layers
  • Hidden size 256
  • 2.7 % dropout
  • 23.3 % token-drop augmentation
  • InfoNCE + KoLeo regularization (λ = 0.21)
  • Trained 6 epochs on 8×A100 GPUs (≈16 hours)

4. Downstream Evaluation: 57 Tasks, Two Families

4.1 Static (Inter-subject) Tasks

Predict traits that rarely change.

Task Metric Baseline WBM PPG WBM+PPG
Age MAE↓ 7.89 3.67 2.89 2.46
Biological sex AUROC↑ 0.931 0.999 0.997 0.999

Baseline = simple mean & standard deviation of the 27 variables plus age/sex/BMI if allowed.

Medical history & medications (47 tasks)

  • WBM alone beats the baseline in 39 of 47 cases.
  • The combination (WBM+PPG) wins in 42 of 47 cases with a median AUROC gain of 0.009 over the best single model.

4.2 Dynamic (Intra-subject) Tasks

Detect time-varying states within the same person.

Task # Weeks WBM AUROC PPG AUROC WBM+PPG AUROC Notes
Pregnancy 24 k 0.864 0.873 0.921 Large behavioral + physiological shift
Respiratory infection 96 k 0.749 0.730 0.763 Captures activity dips & HRV changes
Diabetes (HbA1c) 116 k 0.826 0.866 0.872 PPG alone already strong
Injury 26 k 0.680 0.673 0.688 Gait metrics give extra signal
Sleep efficiency 671 k 0.424 vs 0.182 baseline 0.182 0.438 Overnight inactivity captured by behavior

5. Practical Take-aways for Product Teams

5.1 When Behavior Data Wins

  • Lifestyle-driven conditions (sleep disorders, pregnancy, orthopedic injuries)
  • 24 × 7 coverage required (PPG gaps at night, during charging)
  • Resource-constrained devices (behavior metrics already computed on watch)

5.2 When Raw Biosignals Still Matter

  • Physiology-centered diseases (diabetes, atrial fibrillation)
  • Short-term anomalies (arrhythmia bursts)
  • Regulatory pathways where PPG/ECG algorithms already cleared FDA

5.3 Deployment Checklist

Step Recommendation
Data readiness Ensure ≥5 days of watch wear per week per user
Feature parity Replicate Apple’s 27 signals or map to your ecosystem’s equivalents
Model choice Start with ridge-probed embeddings; fine-tune only if labels >10 k
Fairness audit Benchmark across age, sex, race subgroups (paper provides baselines)
Privacy On-device computation + differential privacy if cloud aggregation needed

6. Limitations the Authors Highlight

  1. Selection bias: iPhone + Apple Watch users ≠ general population.
  2. Label quality: Self-reported medical history may misclassify.
  3. Hardware lock-in: Algorithms tuned for Apple sensors; transfer unknown.
  4. Static windows: Weekly snapshots miss sub-week events (e.g., arrhythmia lasting minutes).
  5. Ethical risk: Could widen health-equity gaps if only high-income users benefit.

7. FAQ: Quick Answers from the Paper

Q1: Could I run this on Android or Garmin data?
A: Conceptually yes, but you must re-map the 27 variables and retrain. Gait metrics, VO₂max, and HRV algorithms differ across vendors.

Q2: How much GPU budget do I need?
A: Reproduction with 256-d embeddings, batch 192, converged in <20 GPU-hours on A100. Consumer-grade RTX 4090 (24 GB) can fit a 12-layer model.

Q3: Is the model interpretable?
A: Linear probe coefficients give per-feature importance. Sparse variables (VO₂max, 6-min walk) often carry outsized weight despite low frequency.

Q4: Does it forecast future events?
A: No. WBM produces snap-shot embeddings; forecasting would require an autoregressive decoder or predictor head.


8. Minimal Reproduction Recipe (Pseudocode)

# 1. Prepare weekly tensors
# Each row: user_id, week_start, variable, hour, value
df = pl.scan_parquet('hourly_behavior.parquet')

# 2. Pivot to 168×54 matrix (27 vars + 27 masks)
tensor = (df
          .pivot(index=['user_id','week_start','hour'],
                 columns='variable',
                 values='value')
          .fill_null(0)        # global-mean already applied
          .with_columns([pl.all().is_not_null().cast(pl.UInt8)]))

# 3. Tokenize: linear projection of each hour
class TSTTokenizer(nn.Module):
    def __init__(self, d_model=256):
        super().__init__()
        self.proj = nn.Linear(54, d_model)

    def forward(self, x):      # (B, 168, 54)
        return self.proj(x)    # (B, 168, d_model)

# 4. Mamba-2 encoder (bi-directional)
from mamba_ssm import Mamba
encoder = nn.Sequential(*[Mamba(d_model=256) for _ in range(24)])

# 5. InfoNCE + KoLeo loss (see paper Appendix A.4.1)

9. Closing Thoughts

The study reframes wearable analytics: start with human-level behavior, not millisecond physiology. For most everyday health insights—sleep quality, pregnancy detection, injury recovery—aggregated behavior signals are simpler, cheaper, and surprisingly powerful. Raw PPG still rules for arrhythmia or glucose risk, but combining both modalities yields the best of both worlds.

If you are building next-gen wellness apps, remote patient monitoring, or risk-scoring engines, treat this paper as a blueprint: gather high-quality behavior streams, respect missingness, and let a lightweight encoder do the heavy lifting.


Full citation:
Erturk E., Kamran F. et al. Beyond Sensor Data: Foundation Models of Behavioral Data from Wearables Improve Health Predictions. Proc. ICML 2025.