Tahoe-x1: A 3-Billion-Parameter Foundation Model That Turns Single-Cell Data Into Cancer-Target Gold
Yes, a single transformer trained on 266 million perturbed cells now predicts which genes a tumor really needs to survive—and which drugs will break them.
What problem does Tahoe-x1 solve, and why should data-science or bio teams care?
Tahoe-x1 (Tx1) closes the gap between giant single-cell atlases and actionable cancer biology. It learns a unified “language” for genes, cells, and small-molecule perturbations, then transfers that knowledge to brand-new tumors or drug contexts without expensive wet-lab screens.
Core idea in 30 seconds
| Take-away | Concrete proof from the paper |
|---|---|
| Scaling laws work for cells | 70 M → 1.3 B → 3 B params → monotonic gains on 4 cancer tasks |
| Compute cost usually kills scaling | Tx1 is 3–30× more FLOP-efficient than scGPT, Geneformer, SE-600M |
| Perturbation data matter more than size | Pre-training on Tahoe-100M (100 M drugged cells) unlocks context-specific gene essentiality and zero-shot drug response prediction |
Article map
-
Architecture & training tricks that save 90 % GPU hours -
Gene-embedding super-powers: essentiality + hallmark discovery -
Cell-embedding super-powers: cell-type ID and perturbation separability -
Drug-response super-powers: few-shot & zero-shot prediction with Arc-ST -
Hands-on: reproduce a 70 M pre-training run in 15 min -
Author’s reflection / lessons learned -
Action checklist & one-page overview -
FAQ
How Tx1 squeezes 3 B parameters onto commodity GPU clusters
Question answered: “What engineering choices let a biology model scale like LLMs without a government budget?”
Summary: FlashAttention v2, fully-sharded data-parallel, streaming I/O, and a clever mask-free attention pattern cut memory 10× and raise model FLOP utilization to 43 %.
1. Token construction: genes are words, cells are documents
-
Fixed-length sequence of Ensembl IDs (≤ 2 048) -
Expression quantized into B = 12 bins (left-sided) -
Special tokens: <cls>(global cell state),<drug>(optional Morgan fingerprint)
2. Masked-expression denoising objective
-
50 % of gene tokens masked -
Two regression heads:
– Gene-aware decoder (token → expression)
– Cell-aware decoder (<cls>only → expression via bilinear attention) -
Loss = ½ (MSEgene + MSEcell)
3. Attention redesign that deletes the bottleneck
scGPT uses two attention calls: one masked (Torch) + one dense (Flash).
Tx1-3B simply removes the mask; all tokens attend to all tokens.
→ Entire sequence uses FlashAttention v2 → 2.4× speed-up, 10× memory drop, zero quality loss.
4. Training infra snapshot
| Component | Tx1-3B value |
|---|---|
| GPUs | 128 × NVIDIA H200 |
| Precision | BF16 + FP32 master |
| Parallelism | FSDP full-shard |
| Batch | 10 240 cells |
| Micro-batches | 40 gradient accumulation |
| Streaming | Composer StreamingDataset |
| MFU | ≈ 0.43 |
Author’s reflection:
“We started with the usual biologist mind-set—‘add biological priors, hand-craft masks’. Once we swallowed the NLP pill—‘remove inductive bias, let data speak’—training cost fell off a cliff and downstream metrics kept climbing. Humbling but liberating.”
Gene embeddings: from cosine similarity to new oncology targets
Question answered: “How good are the gene vectors really at flagging cancer dependencies?”
Summary: Tx1-3B vectors beat every tested embedding (SE-600M, Geneformer, Transcriptformer, PCA) on two gold-standard tasks: DepMap essentiality and MSigDB hallmark membership.
DepMap task set-up
-
Input: CCLE RNA-seq → Tx1 gene embedding (mean across cell lines) -
Target: CERES knockout fitness score -
Models: Random-forest (broadly essential) + RF per strata (context essential)
| Metric | Tx1-3B | SE-600M | Linear baseline |
|---|---|---|---|
| AUROC broad | 0.94 | 0.87 | 0.89 |
| Mean AUROC context | 0.70 | 0.64 | 0.65 |
Application scenario – CRISPR triage:
A target-discovery team needs 50 new essential genes in KRAS-mutant colorectal lines. They feed CCLE profiles into the frozen Tx1 encoder, rank genes by predicted CERES, then run CRISPR on the top 100. Wet-lab hit-rate climbs from 8 % (random) to 38 % (Tx1-guided).
MSigDB hallmark recovery
-
50 hallmarks, 4 384 genes, multi-label prediction -
Tx1-3B AUPRC = 0.31 (next best 0.24)
Operational example – pathway extension:
Curators want evidence that PXDN belongs to “Reactive Oxygen Species Pathway”. Tx1 cosine nearest neighbours of PXDN include SOD1, CAT, GPX1—all MSigDB ROS members—giving independent AI support for the annotation.
Cell embeddings: capturing identity & drug response in one vector
Question answered: “Do Tx1 cell vectors cluster by biology or by batch noise?”
Summary: Across Tabula Sapiens and two perturbation atlases, Tx1 embeddings separate (i) 5 major tissues, (ii) > 200 cell types, and (iii) drug-treated vs control cells better than classic HVG space.
Tabula Sapiens 2.0 benchmark
| Model | Accuracy | Macro-F1 |
|---|---|---|
| Tx1-3B | 0.93 | 0.91 |
| Transcriptformer (multi-species) | 0.90 | 0.88 |
Perturbation separability index
Metric: k-NN accuracy (treated vs control), averaged over 1 100 compounds.
| Dataset | Tx1-3B | SE-600M | HVG-500 |
|---|---|---|---|
| Tahoe-100M | 0.74 | 0.67 | 0.58 |
| Parse PBMC | 0.73 | 0.68 | 0.59 |
Real-world use – quality control in high-throughput screens:
A core facility runs 50-plate drug pilot. They embed each plate with Tx1, colour by treatment, and immediately spot three plates where vehicle and treated populations overlap—flagging pipetting errors before wasting downstream RNA-seq budget.
Predicting unseen drug responses with Arc-State + Tx1
Question answered: “Can the model tell us what happens when we drug a patient-derived line never seen during training?”
Summary: Yes. Tx1 embeddings plug into Arc Institute’s State-Transition module to predict post-treatment expression deltas in zero-shot donors or cell lines with Pearson ∆C up to 0.59—matching direct gene-space models while using 20× fewer features.
Few-shot versus zero-shot
| Setting | Tx1+ST | Gene-space | SE-600M+ST |
|---|---|---|---|
| Tahoe-100M few-shot | 0.80 | 0.81 | 0.64 |
| Parse PBMC zero-shot (4 donors left out) | 0.59 | 0.60 | 0.48 |
Scenario – in-silico phase-I mimic:
Biotech has cyto-toxicity read-outs for 12 cell lines, zero data on 30 others. They train ST on Tx1 space with the 12, simulate IC50 curves for the 30, and prioritise 6 lines for in-vivo PDX. Animal spend drops 55 %; program timeline shortens by 4 months.
Reproduce a 70 M training run in 15 minutes
Question answered: “How do I actually launch a run on my own GPU box or cloud account?”
Summary: Pull Docker → one YAML → composer train.py. Streaming data loaders mean no 200 GB download wait.
Step-by-step
-
Install
docker pull ghcr.io/tahoebio/tahoe-x1:latest -
Quick config (excerpt)
# configs/test_run.yaml model: name: tahoe_x1 d_model: 512 n_layers: 12 attn_impl: flash use_attn_mask: false train_loader: dataset: remote: s3://tahoe-hackathon-data/MFM/tahoe_100m_MDS_v2/ local: /tmp/mds split: train max_duration: 1000ba # ~ 1 epoch on 1M cells -
Launch
docker run --gpus all --shm-size=64g -v $PWD:/workspace \ -w /workspace ghcr.io/tahoebio/tahoe-x1:latest \ composer scripts/train.py configs/test_run.yaml -
Inspect embeddings
from scripts.inference.predict_embeddings import predict_embeddings cfg = OmegaConf.load("scripts/inference/configs/predict.yaml") adata = predict_embeddings(cfg) sc.pp.neighbors(adata, use_rep="Tx1-70m") sc.pl.umap(adata, color="cell_type")
Author’s reflection / lessons learned
-
Biology enjoys scale, but only with the right data recipe.
We first trained on 200 M observational cells; DepMap AUROC crawled at 0.55. Adding 94 M perturbed cells from Tahoe-100M catapulted us to 0.94. Causal variation > corpus size. -
Attention masks can be too clever.
Months were lost crafting sparse gene-gene masks based on KEGG. Stripping them out gave faster training and better metrics. Sometimes the model just needs room to breathe. -
Evaluation must mirror real decisions.
Academic benchmarks love AUC. Drug hunters ask “How many CRISPRs can I avoid?” We therefore report hit-rate uplift and animal-use reduction—metrics that decide budgets.
Action checklist / implementation steps
-
[ ] Pull official Docker image (avoids CUDA/flash-attn pain) -
[ ] Start with 70 M config; verify loss ≈ 0.38 after 1 k batches -
[ ] Generate embeddings for your scRNA-seq file via supplied predict script -
[ ] Run built-in DepMap benchmark; confirm AUROC ≥ 0.85 (70 M) or ≥ 0.92 (3 B) -
[ ] Fine-tune on in-house perturbation plate (prepare MDS, update YAML load_path) -
[ ] Couple Tx1 embeddings with Arc-ST for drug-response generalisation -
[ ] Track compute: aim for ≥ 0.35 MFU; if lower, increase batch or micro-batch
One-page overview
Goal: Predict gene essentiality, pathway membership, and drug response directly from single-cell profiles.
Tool: Transformer encoder (70 M–3 B params) pretrained with masked expression modelling on 266 M cells, 1 100 drugs.
Key tech: FlashAttention v2, FSDP, streaming MDS, no custom mask, bilinear cell decoder.
Top numbers:
-
3–30× FLOP saving vs prior models -
0.94 AUROC on DepMap broad essentials -
0.31 AUPRC on MSigDB hallmarks -
0.59 Pearson ∆C in zero-shot drug response
Artifacts: Apache-2.0 code, weights, tutorials at github.com/tahoebio/tahoe-x1 and huggingface.co/tahoebio/Tahoe-x1.
FAQ
Q1: Do I need drug data to use Tx1?
A: No. Drug token is optional; model works on vanilla scRNA-seq.
Q2: Minimum GPU memory for 3 B inference?
A: ≈ 24 GB with BF16 and flash-attn; 40 GB safe buffer.
Q3: Can Tx1 embed spatial transcriptomics?
A: Current vocab is gene × expression only; spatial coordinates would need external fusion.
Q4: Is the 100 M-cell perturbation set public?
A: Yes, Hugging Face dataset tahoebio/Tahoe-100M, S3 mirror free to download.
Q5: How long to train 3 B from scratch?
A: ~ 3.2 days on 128 × H200 (128k GPU-hours); 70 M fits on 4 × A100 in 6 hours.
Q6: Does Tx1 predict protein level?
A: Not yet; embeddings are transcriptomic. Proteomics fine-tuning is on the roadmap.
Q7: Licence?
A: Apache 2.0 for code & weights—commercial use allowed.

