Welcome to the omnisplice read me.
Omnisplice is a tool to categorize all reads present at exon's extremity. get more info reading the paper: https://www.biorxiv.org/content/10.1101/2025.04.06.647416v1
This file follow the following organisation:
- Quick uses
- Installation
- Input Files
- Command details
- Output Files
- Context and discussion.
- Algorithm overview
Don't hesitate to open an issue if you have any problem or have any suggestions.
The same read will be counted as many time as a read overlap different feature. even it is the same position (different transcript same gene).
omni_splice --gtf <gtf file> --input <indexBamFile> -o <outFilePrefix> --anchor 3
# specify the lib type for unstranded data
omni_splice --gtf <gtf file> --input <indexBamFile> -o <outFilePrefix> --Libtype Unstranded
# extending the overhang to 5 pb
omni_splice --gtf <gtf file> --input <indexBamFile> -o <outFilePrefix> --overhang 5
# for backsplicing you need to first extract the clipped reads
omni_splice --gtf <gtf file> --input <indexBamFile> -o <outFilePrefix> --read-to-write soft-clipped
# then you need bowtie 2 in your OS path, plus a bowtie 2 reference for your genome
backsplicing -i <omnisplice_out.clipped> -o <outputPrefix>> -b <bowtie2 ref> -g <gtf> -m <min clipped size, default 20>
# for comparison
compare_conditions run-all --control-files control1.junction control2.junction --treatment-files treatment1.junction treatment2.junction --min-read 10 --outfile-prefix <out file>
# specific splicinfg factor
compare_conditions run --control-files control1.junction control2.junction --treatment-files treatment1.junction treatment2.junction --min-read 10 --outfile-prefix <out file> --splicing-fail Unspliced Clipped Exon_other --splicing-ok Spliced
# quick plot of genes (barplot as in manuscript)
coming out soon!
python gene_visualisation.py --group1 control_S1.junctions control_S2.junctions --group1_name Control --group2 mutant_S1.junctions mutant_S2.junctions --group2_name mutant --gene_list gene_id1 gene_id2 --outfile_prefix outname
to use omni splice you need access to a linux machine (mac never tested). you need rust, to install rust go to (https://www.rust-lang.org/tools/install). you also need cmake. (brew install cmake on mac.)
# Clone this repository using git
> git clone <####>
> cd omnisplice
> cargo build --release
# this is a compilation command it will output a bunch of text including some wanring
# and should finish with: "Finished `release` profile [optimized] target(s) "
you can add target directory to your path
omnisplice executables will then be at /omnisplice/target/release/omni_splice if you want to add it to your Path, you can add the release directory to your ~/.bashrc. example: export PATH=${PATH}:"/omnisplice/target/release/"
if you encouter any error, have any questions, want to propose improvment or let us now by opening a new issue on this github.
To see if you have rust installed type "cargo" on the command line, if the terminal return a line with "command not found" you need to install it.
It is very easy just follow this link:
https://www.rust-lang.org/tools/install
if you have rust installed you may need to update it:
rustup update
- gtf: The gtf must have a valid "gene_id" and "transcript_id" for every feature annotated as "exon".
omnisplice will look at all the exon in this file. if you want to limit your search to subset of gene or exon.
only include the one your are interested in. (this is particularly helpfull when using the "readToWrite" option as to limit the size of the output).
- input: a valid position sorted indexed **bam file**
required:
--input -i <sortedIndexedBamFile>
--output -o <outFilePrefix>
--gtf -g <gtfFile>
optional:
--mapq <Minimum mapq default 13>
--LibType <Rna seq libtype, default frFirstStrand>
--overhang <default 1>
--flag_in <Bam flag a read must have>
--flag_out <Bam flag a read must not have>
--readToWrite <used to output read of a specific category>
--unspliced_def <used for splicing efficiency default unspliced>
--spliced_def <used for splicing efficiency default spliced>
❯ omni_splice -h
Usage:
omni_splice [OPTIONS] --input <INPUT> --output-file-prefix <OUTPUT_FILE_PREFIX> --gtf <GTF>
Options:
-i, --input <INPUT>
Name of Input file
-o, --output-file-prefix <OUTPUT_FILE_PREFIX>
Prefix name to be used for Output file
-g, --gtf <GTF>
Name of GTF Input file define the feature to look at (v1) only consider
feature annotated as exon if you use output_write_read with the whole genome
the output can be very large, you may want to subset genes / features you are
interested in
--overhang <OVERHANG>
size of overhang [default: 1]
--output-write-read <OUTPUT_WRITE_READ>
path to a file (must not exist) Uses if you want to output the reads with their category.
by default output all reads, this behaviour can be change using flags: clipped...
--flag-in <FLAG_IN>
[default: 0]
--flag-out <FLAG_OUT>
[default: 3840]
--mapq <MAPQ>
[default: 13]
--read-to-write <READ_TO_WRITE>...
space separated list of the annotated read you want to extract; i.e.
all clipped read or all spliced read ... [possible values: read-through,
read-junction, unexpected, fail-pos-filter, wrong-strand, fail-qc,
empty-pileup, skipped, soft-clipped, overhang-fail, empty, all]
--unspliced-def <UNSPLICED_DEF>...
space separated list the column to use for "unspliced" for the splicing defect table.
you can regenrate this using the splicing_efficiency exe What to consider as unspliced?
unspliced: 10, clipped: 11, exon_intron: 12, exon_other: 13, skipped: 14, wrong_strand:15,
isoform:16
by default only use "-u 10" -> unspliced (readthrough) reads \n to use unspliced and clipped : "-u 10 11" [default: 10]
--spliced-def <SPLICED_DEF>...
What to consider as unspliced? unspliced: 10, clipped: 11, exon_intron: 12, exon_other: 13, skipped: 14,
wrong_strand:15, isoform: 16
by default only use "-u 9" -> spliced (readthrough) reads \n to use spliced and isform : "-u 9 16" [default: 9]
--libtype <LIBTYPE>
Librairy types used for the RNAseq most modern stranded RNAseq are frFirstStrand which is the default value.
acceptable value: frFirstStrand, frSecondStrand, fFirstStrand, fSecondStrand, ffFirstStrand, ffSecondStrand,
rfFirstStrand, rfSecondStrand, rFirstStrand, rSecondStrand, Unstranded, PairedUnstranded [default: frFirstStrand]
-h, --help
Print help
-V, --version
Print version
Omnisplice by default expect reverse pair end reads.
the libtype argument can be confusing. But it represent your librairy layout and if relevant is helpfull to determine the strandedness of your reads.
-
fr imply that your paired reads are oriented towards each other (most/majority of modern layout uses this)
--> <--
-
ff
--> -->
-
rf
<-- -->
-
FirstStrand imply that your sequencing reads ar reverse complemented compare to the reference genome
-
SecondStrand imply that your sequencing reads are NOT reverse complemented compared to the reference genome
- Unstranded (for single end layout)
- PairedUnstranded (for pair end)
OmniSplice uses the flag from the alignment + the layout information (libtype argument) to determine the strandedness of a reads. It does that using a library I made you can check up the code here: https://github.com/rLannes/BAMstrandSpecifier
OmniSplice does not consume much ressource. Execution time will depend on the size of the bam, and the number of chromosome in the gtf. Memory will depend on the number of "exon" Feature in the gtf .
category file (.tsv file):
Chromosome | Position of the feature | gene ID | transcript ID | Strand | ExonType | ReadCategory | Read Support
for each feature they are as many line as ReadCategory found for this specific feature
with:
ExonType: [Donnor | Acceptor] indicate if the exon extremity is a junction donnor of acceptor.
Note: the start of the gene is Acceptor, the end of the gene is Donnor.
ReadCategory:
* ReadJunction(n, m) -> junction read from n, m
* Unspliced -> read is unspliced
* Skipped(n, m) -> skipped , read has junction from n to m that include the feature.
* SoftClipped -> Read is clipped at the junction.
* Empty -> not read found at this junction
* WrongStrand -> The read come from a fragment which strand is different feom the feature
* FailPosFilter -> the way the algorithm works to reduce the number of comparison may give some false positive.
you can discard it, no meaningfull biologicaly.
* OverhangFail -> The read fail the overhang test.
* Unexpected -> if you see this please open a bug report.
both exon and junction file describe events. But the exon file give the detail by exon end whereas the junction file give the results per junction.
the exon file (.tsv file):
contig | gene_name | transcript_name | exon_number | ambiguous | strand | pos | next | exon_type | spliced | unspliced | clipped | exon_other | skipped | wrong_strand | e_isoform
the junction file (.tsv file):
contig | gene_name | transcript_name | intron_number | strand | ambiguous | Donnor | Acceptor | spliced | unspliced | clipped | exon_other | skipped | wrong_strand | e_isoform
For each read find all exon extremity it does overlap.
for each read R:
for each extremity E:
Test if fail pos filter.
Test if the strand match. -> if False Wrong Strand
Test if the read skipped the junction.
Tesf if the read fail the overhang.
Test if the read is unspliced at the feature specifically.
Test if the read is SoftClipped.
Test if the read is a Junction read.