Health Predictions from Your Wrist: Why “Behavior” Beats Raw Sensor Data

Smartwatches and fitness trackers now sit on more than a billion wrists, quietly logging heartbeats, footsteps, and sleep minutes. Most research still focuses on the millisecond-level waveforms these devices produce—PPG, ECG, accelerometer streams. A new large-scale study led by Apple and the American Heart Association flips the script. It shows that higher-level behavior metrics—things like daily step counts, resting heart-rate trends, and six-minute walk distance—can be turned into a foundation model that outperforms or complements traditional biosignal models on fifty-seven real-world health tasks.

Below you will find a pragmatic, end-to-end walk-through of the paper Beyond Sensor Data: Foundation Models of Behavioral Data from Wearables Improve Health Predictions (ICML 2025). All numbers, methods, and caveats come directly from the source; nothing has been added.

1. The Data Lake Behind the Study

Statistic	Value
Dataset	Apple Heart & Movement Study (AHMS)
Participants	161 855 U.S. residents with ≥90 days of follow-up
Observation span	Up to 5 years
Granularity	Hourly summaries
Total volume	15.14 million participant-weeks ≈ 2.5 billion hours

Data collection was opt-in via the Apple Research app. Participants could pause or delete data at any time. The cohort skews toward higher income, White, and male users—important context when interpreting generalizability.

2. From Raw Sensors to 27 Human-Readable Signals

Instead of feeding raw waveforms into the model, the authors used 27 validated HealthKit quantities. Each quantity is computed by Apple’s on-device algorithms and synced hourly.

Category	Examples	Typical Sampling	Notes
Activity	Step count (watch & phone), active calories, flights climbed, exercise minutes	Hourly	Sums for cumulative metrics
Cardiovascular	Resting heart rate, walking heart-rate average, heart-rate variability (SDNN)	Daily or hourly	HRV uses overnight windows
Vital signs	Overnight respiratory rate, wrist temperature, blood oxygen	Overnight or infrequent	SpO₂ only on Series 6+
Gait & mobility	Walking speed, step length, double-support percentage, fall count	Every walk or opportunistic	Falls <3 % of users
Body measurements	BMI, body mass	Manual entry	Sparsest variables
Functional capacity	VO₂max, six-minute walk distance	Weekly or after workouts	Require outdoor GPS workouts

These metrics are irregularly sampled—some users log 1 000+ hourly observations per week, others <300. Missingness is handled explicitly in the model design.

3. Model Design: WBM (Wearable Behavior Model)

3.1 Pre-processing Pipeline

Weekly windows: each sample = 168 consecutive hours starting Monday 00:00.
Inclusion criteria: ≥5 days of heart-rate data in the week, ≥5 qualifying weeks per user.
Z-scores: every variable normalized to zero mean, unit variance across the full training set.
Missingness mask: one binary flag per variable per hour, appended as extra channels (27 → 54).

Resulting input tensor: 168 × 54.

3.2 Tokenization & Architecture Search

The team ran a grid search over three tokenizers and three backbones. TST + Mamba-2 won.

Tokenizer	Idea	Why It Did (or Didn’t) Work
TST (dense)	Zero-fill gaps → 168×54 matrix → linear patch embedding	Surprisingly robust; global-mean imputation beat subject-level imputation
mTAN	Multi-time attention with learned masks	Good on paper, slightly worse in practice
Tuple	One token per observation (time, variable, value)	Handles sparsity natively, but memory grows with sequence length

Backbone	Key Property	Outcome
Self-attention Transformer	Absolute positional encodings	Competitive but memory-hungry
Rotary Transformer	Relative RoPE encodings	Slight edge on some tasks
Mamba-2	Bidirectional selective state-space	Best trade-off of speed, memory, accuracy

Final hyper-parameters (all ablated):

24 Mamba-2 layers
Hidden size 256
2.7 % dropout
23.3 % token-drop augmentation
InfoNCE + KoLeo regularization (λ = 0.21)
Trained 6 epochs on 8×A100 GPUs (≈16 hours)

4. Downstream Evaluation: 57 Tasks, Two Families

4.1 Static (Inter-subject) Tasks

Predict traits that rarely change.

Task	Metric	Baseline	WBM	PPG	WBM+PPG
Age	MAE↓	7.89	3.67	2.89	2.46
Biological sex	AUROC↑	0.931	0.999	0.997	0.999

Baseline = simple mean & standard deviation of the 27 variables plus age/sex/BMI if allowed.

Medical history & medications (47 tasks)

WBM alone beats the baseline in 39 of 47 cases.
The combination (WBM+PPG) wins in 42 of 47 cases with a median AUROC gain of 0.009 over the best single model.

4.2 Dynamic (Intra-subject) Tasks

Detect time-varying states within the same person.

Task	# Weeks	WBM AUROC	PPG AUROC	WBM+PPG AUROC	Notes
Pregnancy	24 k	0.864	0.873	0.921	Large behavioral + physiological shift
Respiratory infection	96 k	0.749	0.730	0.763	Captures activity dips & HRV changes
Diabetes (HbA1c)	116 k	0.826	0.866	0.872	PPG alone already strong
Injury	26 k	0.680	0.673	0.688	Gait metrics give extra signal
Sleep efficiency	671 k	0.424 vs 0.182 baseline	0.182	0.438	Overnight inactivity captured by behavior

5. Practical Take-aways for Product Teams

5.1 When Behavior Data Wins

Lifestyle-driven conditions (sleep disorders, pregnancy, orthopedic injuries)
24 × 7 coverage required (PPG gaps at night, during charging)
Resource-constrained devices (behavior metrics already computed on watch)

5.2 When Raw Biosignals Still Matter

Physiology-centered diseases (diabetes, atrial fibrillation)
Short-term anomalies (arrhythmia bursts)
Regulatory pathways where PPG/ECG algorithms already cleared FDA

5.3 Deployment Checklist

Step	Recommendation
Data readiness	Ensure ≥5 days of watch wear per week per user
Feature parity	Replicate Apple’s 27 signals or map to your ecosystem’s equivalents
Model choice	Start with ridge-probed embeddings; fine-tune only if labels >10 k
Fairness audit	Benchmark across age, sex, race subgroups (paper provides baselines)
Privacy	On-device computation + differential privacy if cloud aggregation needed

6. Limitations the Authors Highlight

Selection bias: iPhone + Apple Watch users ≠ general population.
Label quality: Self-reported medical history may misclassify.
Hardware lock-in: Algorithms tuned for Apple sensors; transfer unknown.
Static windows: Weekly snapshots miss sub-week events (e.g., arrhythmia lasting minutes).
Ethical risk: Could widen health-equity gaps if only high-income users benefit.

7. FAQ: Quick Answers from the Paper

Q1: Could I run this on Android or Garmin data?
A: Conceptually yes, but you must re-map the 27 variables and retrain. Gait metrics, VO₂max, and HRV algorithms differ across vendors.

Q2: How much GPU budget do I need?
A: Reproduction with 256-d embeddings, batch 192, converged in <20 GPU-hours on A100. Consumer-grade RTX 4090 (24 GB) can fit a 12-layer model.

Q3: Is the model interpretable?
A: Linear probe coefficients give per-feature importance. Sparse variables (VO₂max, 6-min walk) often carry outsized weight despite low frequency.

Q4: Does it forecast future events?
A: No. WBM produces snap-shot embeddings; forecasting would require an autoregressive decoder or predictor head.

8. Minimal Reproduction Recipe (Pseudocode)

# 1. Prepare weekly tensors
# Each row: user_id, week_start, variable, hour, value
df = pl.scan_parquet('hourly_behavior.parquet')

# 2. Pivot to 168×54 matrix (27 vars + 27 masks)
tensor = (df
          .pivot(index=['user_id','week_start','hour'],
                 columns='variable',
                 values='value')
          .fill_null(0)        # global-mean already applied
          .with_columns([pl.all().is_not_null().cast(pl.UInt8)]))

# 3. Tokenize: linear projection of each hour
class TSTTokenizer(nn.Module):
    def __init__(self, d_model=256):
        super().__init__()
        self.proj = nn.Linear(54, d_model)

    def forward(self, x):      # (B, 168, 54)
        return self.proj(x)    # (B, 168, d_model)

# 4. Mamba-2 encoder (bi-directional)
from mamba_ssm import Mamba
encoder = nn.Sequential(*[Mamba(d_model=256) for _ in range(24)])

# 5. InfoNCE + KoLeo loss (see paper Appendix A.4.1)

9. Closing Thoughts

The study reframes wearable analytics: start with human-level behavior, not millisecond physiology. For most everyday health insights—sleep quality, pregnancy detection, injury recovery—aggregated behavior signals are simpler, cheaper, and surprisingly powerful. Raw PPG still rules for arrhythmia or glucose risk, but combining both modalities yields the best of both worlds.

If you are building next-gen wellness apps, remote patient monitoring, or risk-scoring engines, treat this paper as a blueprint: gather high-quality behavior streams, respect missingness, and let a lightweight encoder do the heavy lifting.

Full citation:
Erturk E., Kamran F. et al. Beyond Sensor Data: Foundation Models of Behavioral Data from Wearables Improve Health Predictions. Proc. ICML 2025.

Wearable Behavior Data Outperforms Raw Sensor Metrics in Health Predictions, Study Reveals