Skip to content

jibeomko/PanIsoGuard

Repository files navigation

PanIsoGuard

Caller-agnostic adjudication of long-read RNA-seq novel isoforms.

CI License: MIT AND BSL-1.0

PanIsoGuard is a post-processing / decision layer that ingests the novel isoform calls produced by long-read isoform callers (FLAIR, IsoQuant, Bambu, ESPRESSO, TALON, …) and/or SQANTI3 output, together with a BAM and optional evidence inputs (short-read SJ.tab, personalized haplotype FASTA), and re-classifies each novel call into a confidence class with a machine-readable mechanistic attribution and a provenance / circularity flag.

PanIsoGuard does not recompute SQANTI3 QC descriptors. It consumes them as priors and integrates them with independent path-level evidence from caller consensus, short-read junction support, long-read mapping, personalized haplotypes, and pangenome-supported junctions — it does not re-derive TSS/TTS, ORF/NMD, polyA, or splice motifs, and is not a SQANTI3-style QC filter (see docs/relationship_to_sqanti3.md). Its contribution is the adjudication logic — how those orthogonal evidence axes are integrated into a transparent, auditable verdict (thresholds live in a runtime config/rules.default.toml, and each verdict carries a rule_trace; see docs/decision_engine.md).

PanIsoGuard is an independent project, not affiliated with or endorsed by the SQANTI3 authors. It interoperates with SQANTI3 by reading its output file only (no SQANTI3 code is bundled or linked; SQANTI3's GPL-3.0 does not reach PanIsoGuard's MIT code). If you use SQANTI3 in your pipeline, please cite it — see docs/relationship_to_sqanti3.md.

Equivalently, each candidate isoform can be viewed as a path through a gene-local splice graph — novel junctions are edges absent from the known (reference) graph — and the same deterministic verdict can be read in those terms. This is a framing of the existing engine (no graph model or new algorithm); each verdict additionally carries a machine-readable graph_trace (novel-edge count, edge support, graph distance). See docs/method_graph.md.

Contents

At a glance

PanIsoGuard overview: long-read novel isoform calls (real or artifact?) are checked against four kinds of evidence — SQANTI QC priors, short-read junctions, long-read mapping, and variants/reference bias — and sorted into plain-language verdicts: real novel, reference-bias rescued, uncertain (held), or artifact.

Consumes caller + SQANTI3 output (does not replace them); adds an orthogonal, auditable verdict layer, each call carrying a machine-readable rule_trace. The four output classes shown are a simplified grouping — the full set is listed below. Vector source: docs/figures/overview.svg.

Status: alpha. Four evidence axes (SQANTI priors, short-read junctions, BAM read-level mapping, variant/reference-bias) plus a file-based pangenome reference-bias tier are implemented and tested, along with the adjudicate / benchmark / ablate / combine subcommands.

Validation. The adjudication logic is validated against ground truth (SQANTI-SIM AUPRC 0.970 vs a 0.831 baseline, and well-calibrated — ECE/Brier ≤ 0.013 on the run truth sets; see docs/validation.md).

Thresholds. A SQANTI-SIM (v49 chr22) threshold sweep finds AUPRC robust (0.969–0.970) across the grid with the shipped default within 1e-4 of grid-best — the conservative defaults are near-optimal there (benchmark/results/sqanti_sim/sweep.tsv), though not yet swept on additional datasets.

Pangenome. The file-based pangenome reference-bias rescue is validated on the real HPRC v1.1 chr22 graph (GATE-1): 0 false rescues on real FLAIR novel junctions, correct rescue on real population deletions, firewall holding under circular-risk provenance (benchmark/pangenome). Validated at chr22 scale; the in-process GBZ traversal remains future work. Treat the confidence classes as calibrated ordinal evidence integration, not a tuned probability.

When to use PanIsoGuard

Use it when you have novel long-read isoform calls and need to decide which to trust:

  • You ran more than one isoform caller (FLAIR / IsoQuant / Bambu / ESPRESSO / TALON, …) and have several disagreeing novel-isoform sets. PanIsoGuard integrates them caller-agnostically by splice chain and stratifies each novel call by cross-caller agreement — single-caller novels are mostly artifacts, multi-caller agreement is a strong, matcher-robust confidence signal (benchmark/multicaller).
  • You want a transparent confidence class per novel call, not a flat GTF — each verdict is one of 7 classes with a machine-readable rule_trace (and an optional PDF report), so you can filter HIGH/MEDIUM_CONF_NOVEL and audit the rest instead of eyeballing reads.
  • You have a personalized haplotype or a pangenome and want to catch reference-bias false novelty — a junction that looks novel only because the sample differs from the linear reference. The rescue is a high-specificity guardrail (it never over-promotes; a circularity firewall blocks rescues that would rest on the sample's own RNA), most useful for non-reference / personalized-genome samples (benchmark/hg002).

It is not a caller or a QC re-implementation. It sits above the callers and consumes SQANTI3 QC as priors — it does not re-derive TSS/TTS, ORF/NMD, polyA, or splice motifs (docs/relationship_to_sqanti3.md), and its combine step is a clean re-implementation of gffcompare -i, not a new merge (docs/relationship_to_merge_tools.md).

Subcommands

Command Purpose
adjudicate classify novel isoforms → confidence class + mechanistic attribution + provenance (3 output files)
benchmark non-redundancy vs SQANTI3 on novel isoforms (2×2, Jaccard, McNemar)
ablate per-evidence-axis class-change (which axis drives which calls)
combine integrate multiple callers' isoforms by intron-chain fingerprint → caller-support matrix
version version, linked htslib, compiled-in capabilities

panisoguard <command> --help for options. Input formats: docs/input_formats.md.

Usage

adjudicate is the main entry point. The only hard requirements are a SQANTI3 classification and an output prefix — every evidence input below is optional and simply switches on another axis (see Evidence tiers). PanIsoGuard is caller- and organism-agnostic; the paths below are placeholders for your own caller output, reference, and reads (any long-read caller, any genome build).

Baseline — SQANTI priors only (no short/long-read evidence; novel calls are flagged or held, never positively confirmed):

panisoguard adjudicate \
  --classification classification.txt \
  --isoforms-bed   isoforms.bed \
  --ref-gtf        annotation.gtf \
  --out-prefix     out/sample

Recommended — add short-read junctions (--sj-tab) and the long-read BAM (--bam), the two axes that let a novel isoform be confirmed or rejected on evidence:

panisoguard adjudicate \
  --classification classification.txt \
  --isoforms-bed   isoforms.bed \
  --ref-gtf        annotation.gtf \
  --sj-tab         SJ.out.tab \
  --bam            aligned.bam \
  --reference      genome.fa \
  --out-prefix     out/sample

Reference-bias rescue — add a personalized haplotype FASTA. Provenance gates the circularity firewall: wgs/external may promote to a rescue verdict, while rna_derived/unknown are held as AMBIGUOUS:

panisoguard adjudicate \
  --classification classification.txt \
  --isoforms-bed   isoforms.bed \
  --ref-gtf        annotation.gtf \
  --reference      genome.fa \
  --reference-haplotype haplotype1.fa \
  --reference-haplotype haplotype2.fa \
  --haplotype-provenance wgs \
  --out-prefix     out/sample

Pangenome reference-bias rescue (file-based; validated on HPRC v1.1 chr22) — supply graph-supported splice junctions (pre-extracted from a pangenome graph such as HPRC with vg/rpvg). An isoform whose novel junctions are all realizable on a graph haplotype path is rescued as reference bias. This is independent evidence only if the junction set comes from population assemblies, so it is gated by --pangenome-provenance (the same circularity firewall as the variant axis): population/external promote, while the default unknown (or sample_derived) is held AMBIGUOUS:

panisoguard adjudicate \
  --classification classification.txt \
  --isoforms-bed   isoforms.bed \
  --ref-gtf        annotation.gtf \
  --pangenome-junctions pangenome_junctions.tsv \
  --pangenome-provenance population \
  --out-prefix     out/sample

Combine several callers first (optional) — merge isoforms by intron-chain fingerprint into a caller-support matrix, then feed the union to adjudicate:

panisoguard combine \
  --gtf flair:flair.gtf \
  --gtf isoquant:isoquant.gtf \
  --gtf bambu:bambu.gtf \
  --ref-gtf annotation.gtf \
  --out caller_support_matrix.tsv

combine is a clean re-implementation of the established N-way intron-chain comparison (it reproduces gffcompare -i exactly; multi-caller consensus is shared practice, not a PanIsoGuard invention). Its value is feeding caller agreement into the adjudicator as one auditable evidence axis. PanIsoGuard's differentiator is the reference-bias rescue + circularity firewall — see docs/relationship_to_merge_tools.md.

Outputs

adjudicate writes three files at <out-prefix>:

File Contents
<prefix>.adjudicated.tsv one row per isoform — confidence class, primary mechanism, novel-junction support counts
<prefix>.attribution.jsonl per-isoform rule_trace (every rule that fired, in order) + graph_trace (splice-graph view — see docs/method_graph.md) + bio_flags (SQANTI3 QC descriptors passed through, verdict-neutral — see docs/relationship_to_sqanti3.md)
<prefix>.provenance.log which axes were active + circularity status of the run

Benchmarking & ablation

# non-redundancy vs the SQANTI3 filter on novel isoforms (2x2, Jaccard, McNemar)
panisoguard benchmark [adjudicate options] --out bench/

# per-axis contribution: which calls change when an axis is removed
panisoguard ablate    [adjudicate options] --axes short_read,mapping,variant --out abl/

Quick example

A tiny text-only dataset with checked-in expected outputs is available under examples/tiny/. It exercises one known isoform, one short-read-supported novel isoform, and one unsupported artifact call.

cmake --build build -j
cd examples/tiny
./run.sh

The script writes output/sample.{adjudicated.tsv,attribution.jsonl,provenance.log} and compares them against examples/tiny/expected/.

Confidence classes

HIGH_CONF_KNOWN · HIGH_CONF_NOVEL · MEDIUM_CONF_NOVEL · LOW_CONF_PARTIAL · PAN_REF_RESCUED_FALSE_NOVEL · AMBIGUOUS · ARTIFACT — emitted as a deterministic projection of a 2-axis evidence grid (novelty-support × artifact-mechanism). PAN_REF_RESCUED_FALSE_NOVEL is reached when a novel junction is explained by reference bias — either a personalized haplotype (variant axis, --reference-haplotype) or a pangenome graph path (pangenome axis, --pangenome-junctions).

Evidence tiers

Tier Input Required? Mechanism
0 SQANTI3 classification (priors) + STAR SJ.tab recommended short-read junction corroboration
1 BAM (HiFi/ONT) recommended read-level mapping (low-MAPQ / supplementary spanning-read fractions; soft-clip / indel-near also reported)
2 personalized haplotype FASTA (--reference-haplotype) optional variant-created/destroyed splice-site motif
3 pangenome graph junctions (--pangenome-junctions) optional all novel junctions realizable on a graph haplotype path → reference bias (provenance-gated; validated on HPRC v1.1 chr22)

Documentation

Doc Contents
docs/architecture.md The five layers (CLI → readers → core data model → evidence → decision) and data flow.
docs/algorithm.md The per-isoform adjudication algorithm and the decision projection.
docs/function_io.md Module-by-module input → output contracts and key data types.
docs/decision_engine.md The 2-axis grid, rescue precedence, and the circularity firewall.
docs/method_graph.md The splice-graph framing: isoform = path, novel junction = edge, novelty = graph distance, and the graph_trace.
docs/relationship_to_sqanti3.md What PanIsoGuard consumes from SQANTI3 vs does not recompute; the bio_flags pass-through.
docs/relationship_to_merge_tools.md How combine relates to gffcompare / TAMA / Bambu-NDR, and where PanIsoGuard is actually differentiated.
docs/input_formats.md Every input file format and its options.
docs/validation.md What is verified and the truth-based validation plan.
CHANGELOG.md · docs/releasing.md Changelog, and the release / bioconda runbook.

Repository layout

PanIsoGuard/
|-- README.md                       # quick-start, repository map, and architecture map
|-- CMakeLists.txt                  # C++17/CMake build, install target, test wiring
|-- cmake/FindHTSlib.cmake          # htslib discovery for source and conda builds
|-- config/rules.default.toml        # default thresholds and rule gates
|-- include/panisoguard/             # typed module interfaces
|   |-- types.hpp                    # Junction, IntronChain, Transcript, Evidence primitives
|   |-- adjudicator.hpp              # per-isoform evidence collection and decision API
|   |-- rules.hpp, verdict.hpp       # rule configuration, classes, mechanisms, rule traces
|   |-- consensus.hpp, fingerprint.hpp, interval_index.hpp
|   |-- gtf.hpp, bed12.hpp, sqanti.hpp, sj_tab.hpp, pangenome.hpp
|   `-- bam_features.hpp, variant_motif.hpp, result_writer.hpp
|-- src/
|   |-- cli/                         # subcommands: adjudicate, combine, benchmark, ablate
|   |-- io/                          # SQANTI/GTF/BED12/SJ/pangenome readers + result writer
|   |-- evidence/                    # htslib-backed BAM and FASTA/faidx evidence axes
|   `-- core/                        # consensus, catalog, rule engine, verdict projection
|-- tests/
|   |-- unit/                        # Catch2 tests, module by module
|   `-- data/tiny/                   # minimal BAM/GTF/BED/SQANTI/SJ/FASTA fixtures
|-- docs/                            # architecture, algorithm, input contracts, validation plan
|-- docs/figures/                    # overview figure source and rendered README image
|-- workflow/                        # optional Snakemake orchestration around PanIsoGuard
|-- benchmark/                       # synthetic axes, truth sets, SIRV, HG002, calibration notes
|-- examples/tiny/                   # 5-minute dataset with expected adjudicate outputs
|-- recipes/bioconda/                # Bioconda meta.yaml and build.sh
|-- thirdparty/                      # vendored single-header components and licenses
|-- LICENSE
`-- THIRDPARTY.txt

Architecture map

%%{init: {"theme": "base", "themeCSS": "svg { background: #ffffff; }", "themeVariables": {"background": "#ffffff", "mainBkg": "#ffffff", "fontSize": "17px", "fontFamily": "Arial, sans-serif", "primaryTextColor": "#111827", "lineColor": "#334155", "arrowheadColor": "#334155"}, "flowchart": {"htmlLabels": true, "curve": "linear", "nodeSpacing": 24, "rankSpacing": 32}}}%%
flowchart LR
  CLI["<b>CLI</b><br/>adjudicate | combine<br/>benchmark | ablate"]
  Iso["<b>Isoform inputs</b><br/>GTF/BED12<br/>SQANTI3 classification"]
  Context["<b>Evidence context</b><br/>reference GTF | SJ.tab<br/>BAM/CRAM | FASTA | graph TSV"]
  Rules["<b>Rule gates</b><br/>rules.default.toml"]

  Normalize["<b>1. Normalize</b><br/>src/io readers<br/>typed transcript + junction models"]
  Consensus["<b>2. Merge callers</b><br/>intron-chain fingerprints<br/>caller support matrix"]
  Evidence["<b>3. Build evidence</b><br/>SQANTI3 QC<br/>short‑read SJ<br/>Long Read BAM mapping<br/>Variant/Haplotype<br/>Pangenome"]
  Decide["<b>4. Decide</b><br/>EvidenceVector to RuleEngine<br/>class + mechanism + trace"]
  Outputs["<b>Outputs</b><br/>*.adjudicated.tsv<br/>*.attribution.jsonl | *.provenance.log<br/>caller_support_matrix.tsv"]

  CLI --> Normalize
  Iso --> Normalize
  Normalize --> Consensus
  Consensus --> Evidence
  Evidence --> Decide
  Decide --> Outputs
  Context --> Evidence
  Rules --> Decide
  Consensus -.-> Outputs

  classDef command fill:#fff7e6,stroke:#b7791f,stroke-width:1.8px,color:#3a2500,font-size:17px;
  classDef input fill:#edf6ff,stroke:#2f6fa8,stroke-width:1.8px,color:#0f2438,font-size:17px;
  classDef process fill:#eefaf1,stroke:#2f855a,stroke-width:1.8px,color:#102a16,font-size:17px;
  classDef evidence fill:#f5f0ff,stroke:#6b46c1,stroke-width:2px,color:#241447,font-size:17px;
  classDef decision fill:#fff1f1,stroke:#c53030,stroke-width:2.2px,color:#3b0d0d,font-size:17px;
  classDef output fill:#edfafa,stroke:#2c7a7b,stroke-width:1.8px,color:#0f2f2f,font-size:17px;

  class CLI command;
  class Iso,Context,Rules input;
  class Normalize,Consensus process;
  class Evidence evidence;
  class Decide decision;
  class Outputs output;
  linkStyle default stroke:#334155,stroke-width:3.5px;
Loading

Build

Requires a C++17 compiler, CMake ≥ 3.20, and htslib ≥ 1.18.

git clone <repo> && cd PanIsoGuard
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build            # unit suite
./build/panisoguard --version

htslib is discovered from $CONDA_PREFIX; override with -DCMAKE_PREFIX_PATH=/prefix.

Install (conda)

A bioconda recipe is provided under recipes/bioconda/. Once released:

conda install -c bioconda -c conda-forge panisoguard

Container

Build a self-contained image (the C++ binary + the optional PDF report tool) — works today without waiting on the conda release:

docker build -t panisoguard .
docker run --rm panisoguard panisoguard --version
docker run --rm -v "$PWD":/data -w /data panisoguard \
    adjudicate --classification cls.txt --isoforms-gtf iso.gtf --ref-gtf ref.gtf --out-prefix run

Try it (no downloads)

examples/multi_caller/run.sh    # combine 3 callers -> consensus verdict, on tiny fixtures (<1 s)
examples/tiny/run.sh            # single-caller adjudication

The optional PDF report (SQANTI3-style) is a Python companion — pip install ./python, then panisoguard-report --prefix run (see python/).

Runtime & memory

Single-threaded, on a whole-genome isoform set (GRCh38 + GENCODE v49):

Step Wall Peak RAM
reference catalog (GENCODE v49) ~3 s ~0.34 GB
adjudicate (SQANTI priors + short-read SJ) ~5 s ~0.55 GB
+ variant axis (faidx motif) ~40 s ~0.6 GB
+ BAM mapping axis ~1.6 min ~0.6 GB

The upstream callers + alignment dominate end-to-end time; PanIsoGuard's own adjudication is the fast tail.

Orchestration (optional)

workflow/ provides a Snakemake pipeline that runs the upstream callers in parallel over a single shared alignment and pipes into PanIsoGuard (combine + adjudicate). It is a thin convenience wrapper, not the core tool.

Validation

See docs/validation.md for what is verified and the truth-based validation plan (SQANTI-SIM, HG002/HPRC, LRGASP).

License

PanIsoGuard is MIT-licensed — see LICENSE.

It bundles three third-party single-header components under thirdparty/, with their license texts included: cgranges (IITree.h, MIT), toml++ (MIT), and Catch2 (Boost Software License 1.0, test-only). See THIRDPARTY.txt for attribution. The effective combined license of the redistributed source is MIT AND BSL-1.0. htslib is a dynamically-linked dependency, not vendored.

About

Caller-agnostic adjudication of long-read RNA-seq novel isoforms

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors