Tahoe-x1: A 3-Billion-Parameter Foundation Model That Turns Single-Cell Data Into Cancer-Target Gold

Yes, a single transformer trained on 266 million perturbed cells now predicts which genes a tumor really needs to survive—and which drugs will break them.

What problem does Tahoe-x1 solve, and why should data-science or bio teams care?

Tahoe-x1 (Tx1) closes the gap between giant single-cell atlases and actionable cancer biology. It learns a unified “language” for genes, cells, and small-molecule perturbations, then transfers that knowledge to brand-new tumors or drug contexts without expensive wet-lab screens.

Core idea in 30 seconds

Take-away	Concrete proof from the paper
Scaling laws work for cells	70 M → 1.3 B → 3 B params → monotonic gains on 4 cancer tasks
Compute cost usually kills scaling	Tx1 is 3–30× more FLOP-efficient than scGPT, Geneformer, SE-600M
Perturbation data matter more than size	Pre-training on Tahoe-100M (100 M drugged cells) unlocks context-specific gene essentiality and zero-shot drug response prediction

Article map

Architecture & training tricks that save 90 % GPU hours
Gene-embedding super-powers: essentiality + hallmark discovery
Cell-embedding super-powers: cell-type ID and perturbation separability
Drug-response super-powers: few-shot & zero-shot prediction with Arc-ST
Hands-on: reproduce a 70 M pre-training run in 15 min
Author’s reflection / lessons learned
Action checklist & one-page overview
FAQ

How Tx1 squeezes 3 B parameters onto commodity GPU clusters

Question answered: “What engineering choices let a biology model scale like LLMs without a government budget?”

Summary: FlashAttention v2, fully-sharded data-parallel, streaming I/O, and a clever mask-free attention pattern cut memory 10× and raise model FLOP utilization to 43 %.

1. Token construction: genes are words, cells are documents

Fixed-length sequence of Ensembl IDs (≤ 2 048)
Expression quantized into B = 12 bins (left-sided)
Special tokens: <cls> (global cell state), <drug> (optional Morgan fingerprint)

2. Masked-expression denoising objective

50 % of gene tokens masked
Two regression heads:
– Gene-aware decoder (token → expression)
– Cell-aware decoder (<cls> only → expression via bilinear attention)
Loss = ½ (MSEgene + MSEcell)

3. Attention redesign that deletes the bottleneck

scGPT uses two attention calls: one masked (Torch) + one dense (Flash).
Tx1-3B simply removes the mask; all tokens attend to all tokens.
→ Entire sequence uses FlashAttention v2 → 2.4× speed-up, 10× memory drop, zero quality loss.

4. Training infra snapshot

Component	Tx1-3B value
GPUs	128 × NVIDIA H200
Precision	BF16 + FP32 master
Parallelism	FSDP full-shard
Batch	10 240 cells
Micro-batches	40 gradient accumulation
Streaming	Composer StreamingDataset
MFU	≈ 0.43

Author’s reflection:
“We started with the usual biologist mind-set—‘add biological priors, hand-craft masks’. Once we swallowed the NLP pill—‘remove inductive bias, let data speak’—training cost fell off a cliff and downstream metrics kept climbing. Humbling but liberating.”

Gene embeddings: from cosine similarity to new oncology targets

Question answered: “How good are the gene vectors really at flagging cancer dependencies?”

Summary: Tx1-3B vectors beat every tested embedding (SE-600M, Geneformer, Transcriptformer, PCA) on two gold-standard tasks: DepMap essentiality and MSigDB hallmark membership.

DepMap task set-up

Input: CCLE RNA-seq → Tx1 gene embedding (mean across cell lines)
Target: CERES knockout fitness score
Models: Random-forest (broadly essential) + RF per strata (context essential)

Metric	Tx1-3B	SE-600M	Linear baseline
AUROC broad	0.94	0.87	0.89
Mean AUROC context	0.70	0.64	0.65

Application scenario – CRISPR triage:
A target-discovery team needs 50 new essential genes in KRAS-mutant colorectal lines. They feed CCLE profiles into the frozen Tx1 encoder, rank genes by predicted CERES, then run CRISPR on the top 100. Wet-lab hit-rate climbs from 8 % (random) to 38 % (Tx1-guided).

MSigDB hallmark recovery

50 hallmarks, 4 384 genes, multi-label prediction
Tx1-3B AUPRC = 0.31 (next best 0.24)

Operational example – pathway extension:
Curators want evidence that PXDN belongs to “Reactive Oxygen Species Pathway”. Tx1 cosine nearest neighbours of PXDN include SOD1, CAT, GPX1—all MSigDB ROS members—giving independent AI support for the annotation.

Cell embeddings: capturing identity & drug response in one vector

Question answered: “Do Tx1 cell vectors cluster by biology or by batch noise?”

Summary: Across Tabula Sapiens and two perturbation atlases, Tx1 embeddings separate (i) 5 major tissues, (ii) > 200 cell types, and (iii) drug-treated vs control cells better than classic HVG space.

Tabula Sapiens 2.0 benchmark

Model	Accuracy	Macro-F1
Tx1-3B	0.93	0.91
Transcriptformer (multi-species)	0.90	0.88

Perturbation separability index

Metric: k-NN accuracy (treated vs control), averaged over 1 100 compounds.

Dataset	Tx1-3B	SE-600M	HVG-500
Tahoe-100M	0.74	0.67	0.58
Parse PBMC	0.73	0.68	0.59

Real-world use – quality control in high-throughput screens:
A core facility runs 50-plate drug pilot. They embed each plate with Tx1, colour by treatment, and immediately spot three plates where vehicle and treated populations overlap—flagging pipetting errors before wasting downstream RNA-seq budget.

Predicting unseen drug responses with Arc-State + Tx1

Question answered: “Can the model tell us what happens when we drug a patient-derived line never seen during training?”

Summary: Yes. Tx1 embeddings plug into Arc Institute’s State-Transition module to predict post-treatment expression deltas in zero-shot donors or cell lines with Pearson ∆C up to 0.59—matching direct gene-space models while using 20× fewer features.

Few-shot versus zero-shot

Setting	Tx1+ST	Gene-space	SE-600M+ST
Tahoe-100M few-shot	0.80	0.81	0.64
Parse PBMC zero-shot (4 donors left out)	0.59	0.60	0.48

Scenario – in-silico phase-I mimic:
Biotech has cyto-toxicity read-outs for 12 cell lines, zero data on 30 others. They train ST on Tx1 space with the 12, simulate IC50 curves for the 30, and prioritise 6 lines for in-vivo PDX. Animal spend drops 55 %; program timeline shortens by 4 months.

Reproduce a 70 M training run in 15 minutes

Question answered: “How do I actually launch a run on my own GPU box or cloud account?”

Summary: Pull Docker → one YAML → composer train.py. Streaming data loaders mean no 200 GB download wait.

Step-by-step

Install

docker pull ghcr.io/tahoebio/tahoe-x1:latest

Quick config (excerpt)

# configs/test_run.yaml
model:
  name: tahoe_x1
  d_model: 512
  n_layers: 12
  attn_impl: flash
  use_attn_mask: false
train_loader:
  dataset:
    remote: s3://tahoe-hackathon-data/MFM/tahoe_100m_MDS_v2/
    local: /tmp/mds
    split: train
max_duration: 1000ba   # ~ 1 epoch on 1M cells

Launch

docker run --gpus all --shm-size=64g -v $PWD:/workspace \
       -w /workspace ghcr.io/tahoebio/tahoe-x1:latest \
       composer scripts/train.py configs/test_run.yaml

Inspect embeddings

from scripts.inference.predict_embeddings import predict_embeddings
cfg = OmegaConf.load("scripts/inference/configs/predict.yaml")
adata = predict_embeddings(cfg)
sc.pp.neighbors(adata, use_rep="Tx1-70m")
sc.pl.umap(adata, color="cell_type")

Author’s reflection / lessons learned

Biology enjoys scale, but only with the right data recipe.
We first trained on 200 M observational cells; DepMap AUROC crawled at 0.55. Adding 94 M perturbed cells from Tahoe-100M catapulted us to 0.94. Causal variation > corpus size.
Attention masks can be too clever.
Months were lost crafting sparse gene-gene masks based on KEGG. Stripping them out gave faster training and better metrics. Sometimes the model just needs room to breathe.
Evaluation must mirror real decisions.
Academic benchmarks love AUC. Drug hunters ask “How many CRISPRs can I avoid?” We therefore report hit-rate uplift and animal-use reduction—metrics that decide budgets.

Action checklist / implementation steps

[ ] Pull official Docker image (avoids CUDA/flash-attn pain)
[ ] Start with 70 M config; verify loss ≈ 0.38 after 1 k batches
[ ] Generate embeddings for your scRNA-seq file via supplied predict script
[ ] Run built-in DepMap benchmark; confirm AUROC ≥ 0.85 (70 M) or ≥ 0.92 (3 B)
[ ] Fine-tune on in-house perturbation plate (prepare MDS, update YAML load_path)
[ ] Couple Tx1 embeddings with Arc-ST for drug-response generalisation
[ ] Track compute: aim for ≥ 0.35 MFU; if lower, increase batch or micro-batch

One-page overview

Goal: Predict gene essentiality, pathway membership, and drug response directly from single-cell profiles.
Tool: Transformer encoder (70 M–3 B params) pretrained with masked expression modelling on 266 M cells, 1 100 drugs.
Key tech: FlashAttention v2, FSDP, streaming MDS, no custom mask, bilinear cell decoder.
Top numbers:

3–30× FLOP saving vs prior models
0.94 AUROC on DepMap broad essentials
0.31 AUPRC on MSigDB hallmarks
0.59 Pearson ∆C in zero-shot drug response
Artifacts: Apache-2.0 code, weights, tutorials at github.com/tahoebio/tahoe-x1 and huggingface.co/tahoebio/Tahoe-x1.

FAQ

Q1: Do I need drug data to use Tx1?
A: No. Drug token is optional; model works on vanilla scRNA-seq.

Q2: Minimum GPU memory for 3 B inference?
A: ≈ 24 GB with BF16 and flash-attn; 40 GB safe buffer.

Q3: Can Tx1 embed spatial transcriptomics?
A: Current vocab is gene × expression only; spatial coordinates would need external fusion.

Q4: Is the 100 M-cell perturbation set public?
A: Yes, Hugging Face dataset tahoebio/Tahoe-100M, S3 mirror free to download.

Q5: How long to train 3 B from scratch?
A: ~ 3.2 days on 128 × H200 (128k GPU-hours); 70 M fits on 4 × A100 in 6 hours.

Q6: Does Tx1 predict protein level?
A: Not yet; embeddings are transcriptomic. Proteomics fine-tuning is on the roadmap.

Q7: Licence?
A: Apache 2.0 for code & weights—commercial use allowed.

Tahoe-x1: How a 3B-Parameter AI Model Transforms Single-Cell Data into Cancer Drug Targets