From Data Chaos to Tissue Atlases: How SpaSEG Makes Spatial Transcriptomics Simple


1. Why Spatial Transcriptomics Matters (and Where It Hurts)

Imagine cutting a thin slice of brain or tumor tissue and asking, “Which genes are where?”
Spatial transcriptomics (SRT) does exactly that. Instead of grinding tissue into single-cell soup, it keeps every cell in its original neighborhood and records gene activity in situ.

The payoff: you can see immune cells swarming around a tumor margin, or layer-specific neurons sitting exactly where they should.
The pain: a single experiment can produce half a million data points—each carrying thousands of gene counts. Traditional tools choke on size, lose spatial context, or refuse to work across different SRT platforms (10x Visium, Stereo-seq, MERFISH, etc.).


2. Meet SpaSEG: A Four-in-One Toolkit

SpaSEG is an unsupervised deep-learning framework built by BGI-Research and published in Genome Biology (2025).
In one pipeline it does:

  1. Spatial domain identification – finds tissue regions with similar gene patterns.
  2. Multi-section alignment – stitches neighboring slices into a 3-D map.
  3. Spatially-variable gene (SVG) discovery – genes that switch on/off between regions.
  4. Cell–cell interaction inference – guesses who is talking to whom, based on ligand–receptor pairs.

The trick: SpaSEG treats every spot as a pixel in a multi-channel image and runs a lightweight convolutional neural network (CNN).
No manual tuning, no platform-specific hacks.


3. How It Works in Plain English

Real-world step SpaSEG analogy
Remove low-quality spots & genes Crop and clean the image
PCA + z-score normalization Compress color channels
CNN with 3×3 filters Look at local neighborhoods
Edge-strength loss Keep boundaries smooth, not pixelated
Two-stage training “Preview” mode → “polish” mode

3.1 Two-Stage Training Cheat-Sheet

Stage Epochs Loss Purpose
Warm-up 400 Mean-squared error Initialize sensible weights
Refinement ≤2 000 α × cross-entropy + β × edge-strength Final clusters with crisp edges

Recommended weights

  • Single slice: α = 0.4, β = 0.7
  • Multiple slices: α = 0.2, β = 0.4

4. Quick Installation & Mini-Workflow

Environment

  • Python ≥ 3.9
  • PyTorch ≥ 1.12 (GPU optional but recommended)

One-line install

pip install stereopy

Five-line starter notebook

import stereopy as st

data = st.io.read_10x_h5('my_visium_file.h5')  # 1. load
st.pp.normalize_total(data)                    # 2. normalize
st.pp.pca(data, n_comps=50)                    # 3. reduce
st.tl.spa_seg(data, n_domains=6)               # 4. segment
st.pl.domain(data, color='spa_seg')            # 5. visualize

5. Benchmark Highlights (What You Actually Get)

Dataset Platform Spots Speed-up vs. SpaGCN Memory peak
Human DLPFC 10x Visium 3,000 ~3× < 2 GB
Mouse whole brain Stereo-seq 526,716 26× 9 GB
Mouse embryo seqFISH 6,400 30× < 1 GB
Breast IDC 10x Visium 4,000 < 2 GB

6. Tutorial 1: Identify Tissue Layers in Human DLPFC

Goal: reproduce the famous six-layer cortex + white-matter map.

  1. Download spatialLIBD sample 151673.
  2. Run the 5-line starter above.
  3. Compare to manual labels:

    • ARI = 0.554 (higher than BayesSpace, SpaGCN, Leiden)
    • Layers 2–6 clearly separated; layer 4 slightly fuzzy (known issue).

7. Tutorial 2: Million-Spot Mouse Brain Without Tears

Goal: handle Stereo-seq Bin20 (10 µm spots) on a single GPU.

  1. Pre-binning: aggregate DNB counts into 10 µm bins → 526 k spots.
  2. PCA: 50 components (explains >80 % variance).
  3. SpaSEG finishes in 8 minutes; SpaGCN runs out of memory; Leiden takes 20 minutes and smears boundaries.

8. Tutorial 3: Stitch Four Adjacent Slices into 3-D

Goal: align mouse olfactory-bulb sections without external alignment software.

  1. Load four consecutive Stereo-seq slices.
  2. Concatenate into one AnnData object; add batch_key='slice_id'.
  3. Run multi-slice SpaSEG (alpha=0.2, beta=0.4).
  4. Granular cell layer (GCL) and subependymal zone (SEZ) line up automatically; F1_LISI score +25 % over Harmony/LIGER.

9. Tutorial 4: Find Region-Specific Genes

Goal: discover genes that only turn on in the hippocampus.

After segmentation:

svg = st.tl.spatial_variable_genes(data, domain_key='spa_seg')
st.pl.gene(data, genes=['Nnat','Krt10','Ibsp'])
Gene Domain Known role
Nnat Brain Neuron development
Krt10 Epidermis Keratinization
Ibsp Cartilage Bone formation

All hits pass:

  • log2FC > 1.5
  • in-domain expression ratio > 75 %
  • FDR < 0.05

10. Tutorial 5: Map Who Talks to Whom

Goal: predict ligand–receptor pairs that drive tumor-immune crosstalk.

Workflow:

  1. Spatial domains → SpaSEG
  2. Cell fractions → cell2location deconvolution
  3. L-R list → Squidpy curates CellPhoneDB + OmniPath pairs
  4. Score per spot → geometric mean co-expression × neighbor entropy
  5. Validation → correlate spot score with downstream gene expression

Example from breast IDC:

  • CXCL12–CXCR4 between CAFs and T cells
  • LTB–LTBR at tumor border
    Spearman correlation 0.78 vs. known downstream targets.

11. FAQ – Troubleshooting in Real Projects

Q1: I only have 8 GB of RAM. Can I still run half-million-spot data?
Yes. Reduce batch_size or switch to CPU mode. Runtime increases ~2× but stays within hours.

Q2: How do I choose the number of spatial domains?
Start with anatomical knowledge (e.g., 6 cortical layers).
Check NMI/ARI elbow plot; SpaSEG merges over-clustered regions automatically after 2 000 epochs.

Q3: My Stereo-seq file is not a perfect grid—will accuracy suffer?
SpaSEG rescales coordinates to [0, 1] and zero-pads empty pixels. Empirical ARI loss < 0.02.

Q4: Can I combine Visium and MERFISH in one run?
Not yet. Cross-platform batch correction is on the roadmap. For now, analyze separately and compare SVG lists.


12. Limitations & Roadmap

  • H&E images: not used in current release; multimodal version planned.
  • Sparse matrices: PCA denoising is default; more aggressive imputation in testing.
  • Cross-platform batch: manual harmonization required today.

13. When to Choose SpaSEG – Decision Table

Need Recommendation
Stereo-seq >100 k spots Use SpaSEG for speed
Multi-section 3-D atlas Use multi-slice mode
Clinical tumor heterogeneity Use SVG + L-R pipeline
Teaching demo 5-line notebook is enough

14. Key Takeaway

SpaSEG turns gigabytes of chaotic spot-level counts into interpretable tissue maps—all with a few dozen lines of Python.
Whether you study brain layers, tumor margins, or embryonic development, you get:

  • Speed: minutes instead of hours
  • Accuracy: highest reported ARI/NMI across 12 benchmark datasets
  • Simplicity: one package, one function call per task

Try the notebook today and spend your saved time on biology, not code.


Quick Links

  • Paper: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03697-1
  • Docs & tutorials: https://stereopy.readthedocs.io/en/v1.6.0/Tutorials(Multi-sample)/SpaSEG.html