perturbseq-starter

A runnable, honest walkthrough of the core Perturb-seq analysis pipeline on a real CRISPR screen — from raw guide counts to perturbation effect sizes — using pertpy and scanpy. Each script is a single, self-contained step that prints its result and the methodological lesson behind it.

The point isn't the dataset; it's the decisions that make or break a Perturb-seq analysis: how you call guides, how you tell real knockouts from cells that escaped editing, why per-cell differential expression lies to you, and how to summarize a perturbation's effect as one defensible number.

Dataset

Papalexi et al. 2021, Nat Genet — an ECCITE-seq CRISPR screen: knockouts of 25 genes (+ non-targeting controls) in THP-1 cells under IFN-γ stimulation, with simultaneous RNA, surface protein (ADT), hashing (HTO), and raw guide counts (GDO). pertpy ships it as a MuData; the first run downloads it (~590 MB) to ./data/.

The screen targets the IFN-γ / JAK-STAT axis, so we know the biological ground truth: STAT1, JAK2, IFNGR1/2, IRF1, SMAD4 should be the strongest perturbations. That makes it an ideal harness for checking whether a pipeline actually recovers known biology.

The four steps

Script	Question	Method
`run_guide_assignment.py`	Which perturbation is each cell carrying?	argmax over raw guide counts + threshold → target; validate vs paper labels
`run_mixscape.py`	Is a guide-bearing cell actually perturbed?	Mixscape: perturbation signature (kNN-NT correction) + per-target GMM → KO vs NP
`run_de.py`	Which genes change — and how do we avoid fooling ourselves?	per-cell Wilcoxon vs pseudobulk-across-replicates (the pseudoreplication trap)
`run_edistance.py`	How strong is each perturbation, and is it real?	E-distance (energy distance) ranking + permutation E-test

What each step actually shows (verified output)

1. Guide assignment — assign each cell to its dominant guide above a count threshold, map guide → target gene, and check against the paper's labels. At threshold 5, ~100% of cells are assigned with ~100% target concordance. Lesson: this upstream step's threshold and "one dominant guide" rule shape everything downstream — it isn't free.

2. Mixscape — not every cell carrying a guide is edited. Mixscape computes a per-cell perturbation signature (expression minus its nearest NT neighbors, in PCA space) then fits a 2-component mixture per target to split KO (truly perturbed) from NP (non-perturbed / escaping). Result: of ~18k guide-bearing cells, only 1,740 are called KO — and the KO rate ranges from SMAD4 55% down to 0% for many targets. Lesson: "carries a guide" ≠ "is perturbed"; filtering to KO cells sharpens every downstream test.

3. Differential expression — the pseudoreplication trap — for STAT1 vs NT:

Test	Significant genes	Replication unit
per-cell Wilcoxon	438 (FDR<0.05)	each cell (wrong — cells aren't independent)
pseudobulk t-test	0 (FDR<0.05), 162 at nominal p	each biological replicate (right)

Lesson: per-cell DE treats thousands of cells as independent replicates and inflates significance by orders of magnitude. Pseudobulk tests across the true (here, 3) replicates — far more conservative. The signal is visible at nominal p, but a naïve per-gene t-test on 3 replicates barely survives FDR, which is exactly why production pipelines use DESeq2/edgeR's shared-variance models (e.g. pt.tl.PyDESeq2 / pt.tl.EdgeR) on pseudobulk counts.

4. E-distance — one multivariate effect size per target, plus a permutation test. The ranking is exactly right — STAT1 > JAK2 > IFNGR2 > SMAD4 > IFNGR1 (the core IFN-γ/JAK-STAT pathway). But effect size ≠ significance: 14/25 targets pass the E-test, and the largest-shift target isn't always the most significant. STAT1 has the largest distance yet fails the E-test — it has the smallest cell count (444) and highest within-group spread (the same escaper heterogeneity Mixscape flagged as 25% KO), which widens its permutation null. Meanwhile NFKBIA has a tiny distance but passes, because it's large-n and tight. Lesson: report distance for "how strong" and the E-test for "is it real" — they answer different questions.

These steps are deliberately consistent: STAT1's low Mixscape-KO fraction (step 2) is the same heterogeneity that sinks its E-test significance (step 4). A coherent pipeline tells one story.

Setup

The scanpy/pertpy stack is happiest on Python 3.11.

cd perturbseq-starter
python3.11 -m venv .venv        # or: conda create -p .venv python=3.11
.venv/bin/pip install -r requirements.txt
# pertpy's Mixscape GMM needs scikit-learn < 1.6:
.venv/bin/pip install "scikit-learn<1.6"

Run

.venv/bin/python run_guide_assignment.py --threshold 5
.venv/bin/python run_mixscape.py
.venv/bin/python run_de.py --target STAT1
.venv/bin/python run_edistance.py --n-perms 1000

The first script triggers the one-time ~590 MB download into ./data/ (gitignored).

Layout

src/data.py            # load Papalexi MuData; RNA + GDO accessors; standard scanpy preprocess
run_guide_assignment.py
run_mixscape.py
run_de.py
run_edistance.py
requirements.txt

Notes / honest caveats

The pseudobulk DE here is a plain logCPM Welch t-test to make the pseudoreplication point with no extra dependencies; for real work use DESeq2/edgeR on raw pseudobulk counts.
Guide assignment is done with an explicit numpy argmax rather than pt.pp.GuideAssignment (pertpy 1.0.3's AnnData dispatch for assign_to_max_guide has a bug); the logic is identical and clearer.
E-distance is computed in PCA space (the common default); the representation and #PCs affect the absolute numbers but not the ranking.

Related projects

Part of a small single-cell / computational-biology portfolio — each a runnable, honestly-documented pipeline:

perturbseq-starter — Perturb-seq / CRISPR-screen analysis: guide assignment → Mixscape → pseudobulk-vs-per-cell DE → E-distance. (this repo)
scatac-starter — single-cell ATAC: TSS-enrichment QC → tile/spectral clustering → gene activity → MACS3 peaks + TF-motif enrichment.
scrna-workflow — the same multi-sample scRNA pipeline as a reproducible DAG in both Snakemake and Nextflow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

perturbseq-starter

Dataset

The four steps

What each step actually shows (verified output)

Setup

Run

Layout

Notes / honest caveats

Related projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_de.py		run_de.py
run_edistance.py		run_edistance.py
run_guide_assignment.py		run_guide_assignment.py
run_mixscape.py		run_mixscape.py

Folders and files

Latest commit

History

Repository files navigation

perturbseq-starter

Dataset

The four steps

What each step actually shows (verified output)

Setup

Run

Layout

Notes / honest caveats

Related projects

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages