Skip to content

princello/perturbseq-starter

Repository files navigation

perturbseq-starter

A runnable, honest walkthrough of the core Perturb-seq analysis pipeline on a real CRISPR screen — from raw guide counts to perturbation effect sizes — using pertpy and scanpy. Each script is a single, self-contained step that prints its result and the methodological lesson behind it.

The point isn't the dataset; it's the decisions that make or break a Perturb-seq analysis: how you call guides, how you tell real knockouts from cells that escaped editing, why per-cell differential expression lies to you, and how to summarize a perturbation's effect as one defensible number.

Dataset

Papalexi et al. 2021, Nat Genet — an ECCITE-seq CRISPR screen: knockouts of 25 genes (+ non-targeting controls) in THP-1 cells under IFN-γ stimulation, with simultaneous RNA, surface protein (ADT), hashing (HTO), and raw guide counts (GDO). pertpy ships it as a MuData; the first run downloads it (~590 MB) to ./data/.

The screen targets the IFN-γ / JAK-STAT axis, so we know the biological ground truth: STAT1, JAK2, IFNGR1/2, IRF1, SMAD4 should be the strongest perturbations. That makes it an ideal harness for checking whether a pipeline actually recovers known biology.

The four steps

Script Question Method
run_guide_assignment.py Which perturbation is each cell carrying? argmax over raw guide counts + threshold → target; validate vs paper labels
run_mixscape.py Is a guide-bearing cell actually perturbed? Mixscape: perturbation signature (kNN-NT correction) + per-target GMM → KO vs NP
run_de.py Which genes change — and how do we avoid fooling ourselves? per-cell Wilcoxon vs pseudobulk-across-replicates (the pseudoreplication trap)
run_edistance.py How strong is each perturbation, and is it real? E-distance (energy distance) ranking + permutation E-test

What each step actually shows (verified output)

1. Guide assignment — assign each cell to its dominant guide above a count threshold, map guide → target gene, and check against the paper's labels. At threshold 5, ~100% of cells are assigned with ~100% target concordance. Lesson: this upstream step's threshold and "one dominant guide" rule shape everything downstream — it isn't free.

2. Mixscape — not every cell carrying a guide is edited. Mixscape computes a per-cell perturbation signature (expression minus its nearest NT neighbors, in PCA space) then fits a 2-component mixture per target to split KO (truly perturbed) from NP (non-perturbed / escaping). Result: of ~18k guide-bearing cells, only 1,740 are called KO — and the KO rate ranges from SMAD4 55% down to 0% for many targets. Lesson: "carries a guide" ≠ "is perturbed"; filtering to KO cells sharpens every downstream test.

3. Differential expression — the pseudoreplication trap — for STAT1 vs NT:

Test Significant genes Replication unit
per-cell Wilcoxon 438 (FDR<0.05) each cell (wrong — cells aren't independent)
pseudobulk t-test 0 (FDR<0.05), 162 at nominal p each biological replicate (right)

Lesson: per-cell DE treats thousands of cells as independent replicates and inflates significance by orders of magnitude. Pseudobulk tests across the true (here, 3) replicates — far more conservative. The signal is visible at nominal p, but a naïve per-gene t-test on 3 replicates barely survives FDR, which is exactly why production pipelines use DESeq2/edgeR's shared-variance models (e.g. pt.tl.PyDESeq2 / pt.tl.EdgeR) on pseudobulk counts.

4. E-distance — one multivariate effect size per target, plus a permutation test. The ranking is exactly right — STAT1 > JAK2 > IFNGR2 > SMAD4 > IFNGR1 (the core IFN-γ/JAK-STAT pathway). But effect size ≠ significance: 14/25 targets pass the E-test, and the largest-shift target isn't always the most significant. STAT1 has the largest distance yet fails the E-test — it has the smallest cell count (444) and highest within-group spread (the same escaper heterogeneity Mixscape flagged as 25% KO), which widens its permutation null. Meanwhile NFKBIA has a tiny distance but passes, because it's large-n and tight. Lesson: report distance for "how strong" and the E-test for "is it real" — they answer different questions.

These steps are deliberately consistent: STAT1's low Mixscape-KO fraction (step 2) is the same heterogeneity that sinks its E-test significance (step 4). A coherent pipeline tells one story.

Setup

The scanpy/pertpy stack is happiest on Python 3.11.

cd perturbseq-starter
python3.11 -m venv .venv        # or: conda create -p .venv python=3.11
.venv/bin/pip install -r requirements.txt
# pertpy's Mixscape GMM needs scikit-learn < 1.6:
.venv/bin/pip install "scikit-learn<1.6"

Run

.venv/bin/python run_guide_assignment.py --threshold 5
.venv/bin/python run_mixscape.py
.venv/bin/python run_de.py --target STAT1
.venv/bin/python run_edistance.py --n-perms 1000

The first script triggers the one-time ~590 MB download into ./data/ (gitignored).

Layout

src/data.py            # load Papalexi MuData; RNA + GDO accessors; standard scanpy preprocess
run_guide_assignment.py
run_mixscape.py
run_de.py
run_edistance.py
requirements.txt

Notes / honest caveats

  • The pseudobulk DE here is a plain logCPM Welch t-test to make the pseudoreplication point with no extra dependencies; for real work use DESeq2/edgeR on raw pseudobulk counts.
  • Guide assignment is done with an explicit numpy argmax rather than pt.pp.GuideAssignment (pertpy 1.0.3's AnnData dispatch for assign_to_max_guide has a bug); the logic is identical and clearer.
  • E-distance is computed in PCA space (the common default); the representation and #PCs affect the absolute numbers but not the ranking.

Related projects

Part of a small single-cell / computational-biology portfolio — each a runnable, honestly-documented pipeline:

  • perturbseq-starter — Perturb-seq / CRISPR-screen analysis: guide assignment → Mixscape → pseudobulk-vs-per-cell DE → E-distance. (this repo)
  • scatac-starter — single-cell ATAC: TSS-enrichment QC → tile/spectral clustering → gene activity → MACS3 peaks + TF-motif enrichment.
  • scrna-workflow — the same multi-sample scRNA pipeline as a reproducible DAG in both Snakemake and Nextflow.

About

Runnable Perturb-seq analysis pipeline (guide assignment, Mixscape, pseudobulk-vs-per-cell DE, E-distance) on the Papalexi 2021 ECCITE-seq CRISPR screen — pertpy + scanpy

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages