This is the chipseq pipeline from the Sequana project.
| Overview: | ChIP-seq pipeline from raw reads to peaks, IDR statistics, and functional annotation |
|---|---|
| Input: | Paired or single-end FastQ files and a CSV experimental design file |
| Output: | HTML summary report, narrow/broad peak files, IDR statistics, bigwig tracks, annotation tables, and IGV session file |
| Status: | Production |
| Citation: | Cokelaer et al, (2017), 'Sequana': a Set of Snakemake NGS pipelines, Journal of Open Source Software, 2(16), 352, JOSS DOI https://doi.org/10.21105/joss.00352 |
pip install sequana_chipseq --upgrade
You will also need the third-party tools listed under Requirements below.
1. Prepare a design file design.csv:
type,condition,replicat,sample_name IP,EXP1,1,IP_EXP1_rep1 IP,EXP1,2,IP_EXP1_rep2 Input,EXP1,1,Input_EXP1
typemust beIP(immunoprecipitated) orInput(control).sample_namemust match the prefix of the corresponding FastQ file (e.g.IP_EXP1_rep1matchesIP_EXP1_rep1_R1_.fastq.gz).- At least two IP replicates per condition are required for IDR analysis.
2. Prepare a genome directory named after the genome, containing:
<name>.fa— reference genome FASTA<name>.gffor<name>.gff3— gene annotation
Example:
ecoli_MG1655/ ├── ecoli_MG1655.fa └── ecoli_MG1655.gff
3. Set up the pipeline:
sequana_chipseq \
--input-directory DATAPATH \
--genome-directory /path/to/ecoli_MG1655 \
--design-file design.csv
4. Run the pipeline:
cd chipseq bash chipseq.sh
sequana_chipseq --help
Key pipeline-specific options:
--genome-directory- Path to the genome directory (must contain
<name>.faand<name>.gff). --design-file- CSV experimental design file (see Quick Start above).
--aligner-choice- Aligner to use. Currently only
bowtie2is supported. --blacklist-file- BED3 file of genomic regions to exclude from analysis (tab-separated: chromosome, start, end).
--genome-size- Effective genome size for macs3 peak calling. Automatically computed from the FASTA file if not provided; override with a plain integer.
--do-fingerprints- Enable
plotFingerprintQC to assess ChIP enrichment quality.
Run on a SLURM cluster:
cd chipseq sbatch chipseq.sh
Or drive Snakemake directly:
snakemake -s chipseq.rules --cores 4 --stats stats.txt
Run every tool inside pre-built containers — no local tool installation needed:
sequana_chipseq \
--input-directory DATAPATH \
--genome-directory /path/to/genome \
--design-file design.csv \
--apptainer-prefix ~/.sequana/apptainers
Then run as usual:
cd chipseq bash chipseq.sh
The following tools must be available (install via conda/bioconda):
mamba env create -f environment.yml
- bowtie2 — read alignment
- fastp — adapter trimming and quality filtering
- fastqc — per-read quality control
- samtools — BAM sorting, indexing, and flagstat
- bedtools — bedGraph generation from BAM files (
genomeCoverageBed) - ucsc-bedgraphtobigwig — bedGraph to bigWig conversion (
bedGraphToBigWig) - deeptools — fingerprint QC (
plotFingerprint) and multi-sample bigwig summary (multiBigwigSummary) - macs3 — narrow and broad peak calling
- homer — peak annotation (
annotatePeaks.pl) - idr — Irreproducibility Discovery Rate between replicates (installed from sequana/idr fork via pip; the upstream bioconda package is Python 3.10-only)
- multiqc — aggregated QC report
- Trimming — fastp removes low-quality reads and adapters.
- QC — FastQC on raw and cleaned reads.
- Alignment — bowtie2 maps reads to the reference genome.
- [Optional] Mark duplicates — Picard marks PCR duplicates.
- [Optional] Blacklist removal — bedtools removes artefact-prone regions.
- bigwig — per-sample coverage tracks for genome browsers (bedtools
genomeCoverageBed→ UCSCbedGraphToBigWig); an IGV session file (igv.xml) is generated to preload all tracks. - [Optional] Fingerprints — plotFingerprint QC to assess ChIP enrichment.
- Phantom peak — strand cross-correlation analysis (NSC, RSC, Qtag scores).
- Peak calling — macs3 detects narrow and broad peaks for each IP vs Input pair.
- FRiP — Fraction of Reads in Peaks per sample and comparison.
- IDR — Irreproducibility Discovery Rate on true replicates, pseudo-replicates, and self-pseudo-replicates.
- Annotation — homer annotates peaks relative to genomic features.
- MultiQC — aggregated QC across all samples.
- HTML report — summary with phantom peaks, FRiP plots, IDR tables, and annotation plots.
Here is the latest documented configuration file. Key sections:
general— aligner choice and genome directory pathfastp— trimming options (length, quality, adapters)fastqc— FastQC options and threadsbowtie2_mapping/bowtie2_index— mapping options, threads, memorymacs3— peak calling parameters (genome size, bandwidth, q-value, broad cutoff)idr— IDR thresholds, rank metric, number of pseudo-replicatesfingerprints— enable/disable and number of binsmark_duplicates— enable/disable PCR duplicate markingremove_blacklist— enable/disable and path to BED blacklisttrimming— enable/disable read trimming and choice of trimming toolphantom— use SPP (use_spp: true) instead of the built-in sequana phantom-peak detectionigv— enable/disable generation of the IGV session file (igv.xml)multiqc— MultiQC options
| Version | Description |
|---|---|
| 0.13.0 |
|
| 0.12.0 |
|
| 0.11.0 |
|
| 0.10.0 |
|
| 0.9.1 |
|
| 0.9.0 |
|
| 0.8.0 | First release. |
To contribute to this project, please take a look at the Contributing Guidelines first. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

