wisp builds population count masks from BAM/CRAM alignments or all-sites VCFs. It is a companion tool for pixy, though it works equally well on its own.
Population count masks let you correctly compute the denominators of π, dxy, Watterson's θ, and Tajima's D when working from a variants-only VCF — callable sites are counted per population rather than collapsed into a single cohort-wide pass/fail.
Wisp is inspired by mop, clam, and makes heavy use of mosdepth.
Note:
wispis pre-release (alpha) and pending validation. It will be distributed on Bioconda aswisp-maskonce stable.
mamba create -n wisp python=3.11 pip git -c conda-forge
mamba activate wisp
mamba install -c conda-forge samtools bcftools htslib mosdepth
python -m pip install "git+https://github.com/samuk-lab/wisp.git"wisp runs mosdepth on each sample and collapses per-sample pass intervals into a population count mask:
wisp from-alignments \
--samples samples.tsv \
--variants-vcf variants.vcf.gz \
--out results \
--threads 4 \
--jobs 2The sample table is tab-delimited with an optional header:
sample_id population alignment
sample_1 popA /path/sample_1.bam
sample_2 popB /path/sample_2.cram
If BAM/CRAM read groups include sample names, they must match the corresponding sample_id.
When --variants-vcf is provided, wisp uses the variants-only VCF to fill in omitted
alignment thresholds: --min-dp/--max-dp from per-sample FORMAT/DP and --min-mapq
from INFO/MQ when those fields are available. Manually supplied threshold flags take
precedence. The same VCF is also scanned for indels, symbolic structural variants, breakends,
and multi-nucleotide polymorphisms; those reference spans are removed from every sample pass BED,
so they are omitted from the sparse final BED and interpreted as zero passing samples in every
population.
wisp from-vcf \
--all-sites-vcf all_sites.vcf.gz \
--popfile populations.tsv \
--min-dp 10 \
--max-dp 80 \
--out resultsThe population file is tab-delimited with an optional header:
sample_id population
sample_1 popA
sample_2 popB
A few things to know about VCF mode:
- Every sample in the population file must appear in the VCF. VCF samples absent from the population file are ignored (with a warning).
- Records may carry any
FILTERvalue — the input is assumed to have been filtered as desired before runningwisp. - A sample passes a site when
FORMAT/DP >= --min-dp, if suppliedFORMAT/DP <= --max-dp, and, whenFORMAT/GTis present, the genotype is not missing. - At duplicate
CHROM:POSrecords, a sample passes a site if any duplicate passes the depth thresholds. Duplicates must be contiguous, as in a coordinate-sorted VCF. - By default, all record types are used (SNPs, indels, symbolic alleles, invariant sites). Pass
--snps-onlyto exclude indel sites while retaining invariant sites.
wisp writes two files to --out:
| File | Description |
|---|---|
wisp.bed.gz |
bgzip-compressed, tabix-indexed population count mask |
wisp.bed.gz.tbi |
tabix index |
Use --output-prefix to change the filename stem; .bed.gz is always appended.
The population count mask is sparse: intervals absent from the output carry zero passing samples across all populations. Two comment-prefixed header lines precede the data:
- A JSON metadata line recording run parameters, output format, column names, and per-population sample counts.
- A column header line, e.g.
#chrom start end GBR YRI.
Fixture-generation scripts using 1000 Genomes data live in tests/test_data/scripts/:
bash tests/test_data/scripts/download_1000g_10sample_chr20_highcov_fixture.sh
bash tests/test_data/scripts/download_1000g_20sample_4chrom_highcov_fixture.shFull documentation lives in docs/.
