From Data Chaos to Tissue Atlases: How SpaSEG Makes Spatial Transcriptomics Simple
1. Why Spatial Transcriptomics Matters (and Where It Hurts)
Imagine cutting a thin slice of brain or tumor tissue and asking, “Which genes are where?”
Spatial transcriptomics (SRT) does exactly that. Instead of grinding tissue into single-cell soup, it keeps every cell in its original neighborhood and records gene activity in situ.
The payoff: you can see immune cells swarming around a tumor margin, or layer-specific neurons sitting exactly where they should.
The pain: a single experiment can produce half a million data points—each carrying thousands of gene counts. Traditional tools choke on size, lose spatial context, or refuse to work across different SRT platforms (10x Visium, Stereo-seq, MERFISH, etc.).
2. Meet SpaSEG: A Four-in-One Toolkit
SpaSEG is an unsupervised deep-learning framework built by BGI-Research and published in Genome Biology (2025).
In one pipeline it does:
-
Spatial domain identification – finds tissue regions with similar gene patterns. -
Multi-section alignment – stitches neighboring slices into a 3-D map. -
Spatially-variable gene (SVG) discovery – genes that switch on/off between regions. -
Cell–cell interaction inference – guesses who is talking to whom, based on ligand–receptor pairs.
The trick: SpaSEG treats every spot as a pixel in a multi-channel image and runs a lightweight convolutional neural network (CNN).
No manual tuning, no platform-specific hacks.
3. How It Works in Plain English
Real-world step | SpaSEG analogy |
---|---|
Remove low-quality spots & genes | Crop and clean the image |
PCA + z-score normalization | Compress color channels |
CNN with 3×3 filters | Look at local neighborhoods |
Edge-strength loss | Keep boundaries smooth, not pixelated |
Two-stage training | “Preview” mode → “polish” mode |
3.1 Two-Stage Training Cheat-Sheet
Stage | Epochs | Loss | Purpose |
---|---|---|---|
Warm-up | 400 | Mean-squared error | Initialize sensible weights |
Refinement | ≤2 000 | α × cross-entropy + β × edge-strength | Final clusters with crisp edges |
Recommended weights
-
Single slice: α = 0.4, β = 0.7 -
Multiple slices: α = 0.2, β = 0.4
4. Quick Installation & Mini-Workflow
Environment
-
Python ≥ 3.9 -
PyTorch ≥ 1.12 (GPU optional but recommended)
One-line install
pip install stereopy
Five-line starter notebook
import stereopy as st
data = st.io.read_10x_h5('my_visium_file.h5') # 1. load
st.pp.normalize_total(data) # 2. normalize
st.pp.pca(data, n_comps=50) # 3. reduce
st.tl.spa_seg(data, n_domains=6) # 4. segment
st.pl.domain(data, color='spa_seg') # 5. visualize
5. Benchmark Highlights (What You Actually Get)
Dataset | Platform | Spots | Speed-up vs. SpaGCN | Memory peak |
---|---|---|---|---|
Human DLPFC | 10x Visium | 3,000 | ~3× | < 2 GB |
Mouse whole brain | Stereo-seq | 526,716 | 26× | 9 GB |
Mouse embryo | seqFISH | 6,400 | 30× | < 1 GB |
Breast IDC | 10x Visium | 4,000 | 5× | < 2 GB |
6. Tutorial 1: Identify Tissue Layers in Human DLPFC
Goal: reproduce the famous six-layer cortex + white-matter map.
-
Download spatialLIBD sample 151673
. -
Run the 5-line starter above. -
Compare to manual labels: -
ARI = 0.554 (higher than BayesSpace, SpaGCN, Leiden) -
Layers 2–6 clearly separated; layer 4 slightly fuzzy (known issue).
-
7. Tutorial 2: Million-Spot Mouse Brain Without Tears
Goal: handle Stereo-seq Bin20 (10 µm spots) on a single GPU.
-
Pre-binning: aggregate DNB counts into 10 µm bins → 526 k spots. -
PCA: 50 components (explains >80 % variance). -
SpaSEG finishes in 8 minutes; SpaGCN runs out of memory; Leiden takes 20 minutes and smears boundaries.
8. Tutorial 3: Stitch Four Adjacent Slices into 3-D
Goal: align mouse olfactory-bulb sections without external alignment software.
-
Load four consecutive Stereo-seq slices. -
Concatenate into one AnnData
object; addbatch_key='slice_id'
. -
Run multi-slice SpaSEG ( alpha=0.2, beta=0.4
). -
Granular cell layer (GCL) and subependymal zone (SEZ) line up automatically; F1_LISI score +25 % over Harmony/LIGER.
9. Tutorial 4: Find Region-Specific Genes
Goal: discover genes that only turn on in the hippocampus.
After segmentation:
svg = st.tl.spatial_variable_genes(data, domain_key='spa_seg')
st.pl.gene(data, genes=['Nnat','Krt10','Ibsp'])
Gene | Domain | Known role |
---|---|---|
Nnat | Brain | Neuron development |
Krt10 | Epidermis | Keratinization |
Ibsp | Cartilage | Bone formation |
All hits pass:
-
log2FC > 1.5 -
in-domain expression ratio > 75 % -
FDR < 0.05
10. Tutorial 5: Map Who Talks to Whom
Goal: predict ligand–receptor pairs that drive tumor-immune crosstalk.
Workflow:
-
Spatial domains → SpaSEG -
Cell fractions → cell2location deconvolution -
L-R list → Squidpy curates CellPhoneDB + OmniPath pairs -
Score per spot → geometric mean co-expression × neighbor entropy -
Validation → correlate spot score with downstream gene expression
Example from breast IDC:
-
CXCL12–CXCR4 between CAFs and T cells -
LTB–LTBR at tumor border
Spearman correlation 0.78 vs. known downstream targets.
11. FAQ – Troubleshooting in Real Projects
Q1: I only have 8 GB of RAM. Can I still run half-million-spot data?
Yes. Reduce batch_size
or switch to CPU mode. Runtime increases ~2× but stays within hours.
Q2: How do I choose the number of spatial domains?
Start with anatomical knowledge (e.g., 6 cortical layers).
Check NMI/ARI elbow plot; SpaSEG merges over-clustered regions automatically after 2 000 epochs.
Q3: My Stereo-seq file is not a perfect grid—will accuracy suffer?
SpaSEG rescales coordinates to [0, 1] and zero-pads empty pixels. Empirical ARI loss < 0.02.
Q4: Can I combine Visium and MERFISH in one run?
Not yet. Cross-platform batch correction is on the roadmap. For now, analyze separately and compare SVG lists.
12. Limitations & Roadmap
-
H&E images: not used in current release; multimodal version planned. -
Sparse matrices: PCA denoising is default; more aggressive imputation in testing. -
Cross-platform batch: manual harmonization required today.
13. When to Choose SpaSEG – Decision Table
Need | Recommendation |
---|---|
Stereo-seq >100 k spots | Use SpaSEG for speed |
Multi-section 3-D atlas | Use multi-slice mode |
Clinical tumor heterogeneity | Use SVG + L-R pipeline |
Teaching demo | 5-line notebook is enough |
14. Key Takeaway
SpaSEG turns gigabytes of chaotic spot-level counts into interpretable tissue maps—all with a few dozen lines of Python.
Whether you study brain layers, tumor margins, or embryonic development, you get:
-
Speed: minutes instead of hours -
Accuracy: highest reported ARI/NMI across 12 benchmark datasets -
Simplicity: one package, one function call per task
Try the notebook today and spend your saved time on biology, not code.
Quick Links
-
Paper: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03697-1 -
Docs & tutorials: https://stereopy.readthedocs.io/en/v1.6.0/Tutorials(Multi-sample)/SpaSEG.html