A runnable, honest walkthrough of the core Perturb-seq analysis pipeline on a real CRISPR screen —
from raw guide counts to perturbation effect sizes — using pertpy
and scanpy. Each script is a single, self-contained step that prints
its result and the methodological lesson behind it.
The point isn't the dataset; it's the decisions that make or break a Perturb-seq analysis: how you call guides, how you tell real knockouts from cells that escaped editing, why per-cell differential expression lies to you, and how to summarize a perturbation's effect as one defensible number.
Papalexi et al. 2021, Nat Genet — an ECCITE-seq
CRISPR screen: knockouts of 25 genes (+ non-targeting controls) in THP-1 cells under IFN-γ
stimulation, with simultaneous RNA, surface protein (ADT), hashing (HTO), and raw guide counts
(GDO). pertpy ships it as a MuData; the first run downloads it (~590 MB) to ./data/.
The screen targets the IFN-γ / JAK-STAT axis, so we know the biological ground truth: STAT1, JAK2, IFNGR1/2, IRF1, SMAD4 should be the strongest perturbations. That makes it an ideal harness for checking whether a pipeline actually recovers known biology.
| Script | Question | Method |
|---|---|---|
run_guide_assignment.py |
Which perturbation is each cell carrying? | argmax over raw guide counts + threshold → target; validate vs paper labels |
run_mixscape.py |
Is a guide-bearing cell actually perturbed? | Mixscape: perturbation signature (kNN-NT correction) + per-target GMM → KO vs NP |
run_de.py |
Which genes change — and how do we avoid fooling ourselves? | per-cell Wilcoxon vs pseudobulk-across-replicates (the pseudoreplication trap) |
run_edistance.py |
How strong is each perturbation, and is it real? | E-distance (energy distance) ranking + permutation E-test |
1. Guide assignment — assign each cell to its dominant guide above a count threshold, map guide → target gene, and check against the paper's labels. At threshold 5, ~100% of cells are assigned with ~100% target concordance. Lesson: this upstream step's threshold and "one dominant guide" rule shape everything downstream — it isn't free.
2. Mixscape — not every cell carrying a guide is edited. Mixscape computes a per-cell perturbation signature (expression minus its nearest NT neighbors, in PCA space) then fits a 2-component mixture per target to split KO (truly perturbed) from NP (non-perturbed / escaping). Result: of ~18k guide-bearing cells, only 1,740 are called KO — and the KO rate ranges from SMAD4 55% down to 0% for many targets. Lesson: "carries a guide" ≠ "is perturbed"; filtering to KO cells sharpens every downstream test.
3. Differential expression — the pseudoreplication trap — for STAT1 vs NT:
| Test | Significant genes | Replication unit |
|---|---|---|
| per-cell Wilcoxon | 438 (FDR<0.05) | each cell (wrong — cells aren't independent) |
| pseudobulk t-test | 0 (FDR<0.05), 162 at nominal p | each biological replicate (right) |
Lesson: per-cell DE treats thousands of cells as independent replicates and inflates significance by
orders of magnitude. Pseudobulk tests across the true (here, 3) replicates — far more conservative.
The signal is visible at nominal p, but a naïve per-gene t-test on 3 replicates barely survives FDR,
which is exactly why production pipelines use DESeq2/edgeR's shared-variance models (e.g.
pt.tl.PyDESeq2 / pt.tl.EdgeR) on pseudobulk counts.
4. E-distance — one multivariate effect size per target, plus a permutation test. The ranking is exactly right — STAT1 > JAK2 > IFNGR2 > SMAD4 > IFNGR1 (the core IFN-γ/JAK-STAT pathway). But effect size ≠ significance: 14/25 targets pass the E-test, and the largest-shift target isn't always the most significant. STAT1 has the largest distance yet fails the E-test — it has the smallest cell count (444) and highest within-group spread (the same escaper heterogeneity Mixscape flagged as 25% KO), which widens its permutation null. Meanwhile NFKBIA has a tiny distance but passes, because it's large-n and tight. Lesson: report distance for "how strong" and the E-test for "is it real" — they answer different questions.
These steps are deliberately consistent: STAT1's low Mixscape-KO fraction (step 2) is the same heterogeneity that sinks its E-test significance (step 4). A coherent pipeline tells one story.
The scanpy/pertpy stack is happiest on Python 3.11.
cd perturbseq-starter
python3.11 -m venv .venv # or: conda create -p .venv python=3.11
.venv/bin/pip install -r requirements.txt
# pertpy's Mixscape GMM needs scikit-learn < 1.6:
.venv/bin/pip install "scikit-learn<1.6".venv/bin/python run_guide_assignment.py --threshold 5
.venv/bin/python run_mixscape.py
.venv/bin/python run_de.py --target STAT1
.venv/bin/python run_edistance.py --n-perms 1000The first script triggers the one-time ~590 MB download into ./data/ (gitignored).
src/data.py # load Papalexi MuData; RNA + GDO accessors; standard scanpy preprocess
run_guide_assignment.py
run_mixscape.py
run_de.py
run_edistance.py
requirements.txt
- The pseudobulk DE here is a plain logCPM Welch t-test to make the pseudoreplication point with no extra dependencies; for real work use DESeq2/edgeR on raw pseudobulk counts.
- Guide assignment is done with an explicit numpy argmax rather than
pt.pp.GuideAssignment(pertpy 1.0.3's AnnData dispatch forassign_to_max_guidehas a bug); the logic is identical and clearer. - E-distance is computed in PCA space (the common default); the representation and #PCs affect the absolute numbers but not the ranking.
Part of a small single-cell / computational-biology portfolio — each a runnable, honestly-documented pipeline:
- perturbseq-starter — Perturb-seq / CRISPR-screen analysis: guide assignment → Mixscape → pseudobulk-vs-per-cell DE → E-distance. (this repo)
- scatac-starter — single-cell ATAC: TSS-enrichment QC → tile/spectral clustering → gene activity → MACS3 peaks + TF-motif enrichment.
- scrna-workflow — the same multi-sample scRNA pipeline as a reproducible DAG in both Snakemake and Nextflow.