Skip to content

External sumstats jk#230

Open
juhis wants to merge 15 commits into
masterfrom
external_sumstats_jk
Open

External sumstats jk#230
juhis wants to merge 15 commits into
masterfrom
external_sumstats_jk

Conversation

@juhis

@juhis juhis commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Performance work on the report task and one fix

  • LD step: parallel tabix fetching (memory-bounded by design) + narrow variant1-indexed fetch, de-quadratic greedy grouping, asTuple parsing, CS-mode batched fetch
  • Annotation: batch large phenotypes by 3 Mb regions (lowered variant_switch to 5,000) instead of one remote tabix fetch per variant
  • Bug fix: off-by-one in generate_chrom_ranges silently dropped ~1 variant per 3 Mb chunk from annotation in large phenotypes (showed up as NA lead stats). Now correct + regression test added
  • Net: T2D prod report task compute ~5.5 h → ~25 min; massive speedup through LD optimizations on many-signal non-FinnGen sumstats with no credible sets
  • Adds --pval-is-mlog10p / --finngen-variants-only, a local benchmark driver, [STAGE] timing logs, and equivalence tests

Backward compatibility: existing WDL inputs and CLI commands work unchanged (all new knobs optional with prior-behavior defaults)

juhis and others added 14 commits May 20, 2026 12:30
…rt input files in varied formats; add an option for mlog10p values instead of p-values in input
LD handling was a major bottleneck on inputs with many signals: the grouping
loop fetched LD serially, one blocking call per peak, and each TabixLD.get_range
split every line in the window only to discard ~all of them.

Parsing: get_range now rejects non-matching lines with a cheap substring test
before str.split, parses only matched rows, skips object creation for rows below
the r2 threshold, and avoids the per-row chrX->chr23 replace. An opt-in
--ld-assume-variant1-indexed flag narrows the fetch to a 1bp window at the lead
position (valid when the file is indexed by variant1 pos), eliminating the N*M
line scan; default off keeps the correct-for-any-layout wide fetch.

Parallelization: LD for a lead depends only on (lead, range, threshold), never on
grouping state, so add LDAccess.get_ranges() to prefetch LD for all candidate
leads up front. TabixLD parallelizes it across worker processes (pysam files
aren't picklable, so each worker re-opens its own via a Pool initializer). The
greedy grouping loop then reads from the cache, preserving identical output.
ld_grouping, credible_grouping, the CS branch, and post_process_hits all use it.
New --ld-workers arg (default cpu_count, 1 = serial).

Adds testing/test_tabixld.py with a synthetic indexed-by-variant1 fixture
covering parsing, threshold, chrX normalization, narrow-vs-wide equivalence, and
serial-vs-parallel equivalence.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
the three per-row parse loops in grouping.py (simple/ld/filter) and load_region
in load_tabix.py did a header->index dict lookup for every column on every row.
hoist them to integer locals once before the loop. also defer the beta float()
parse until after the p2/p1 threshold check, since most genome-wide rows fail it
and never need beta. output is unchanged (a row with unparseable beta that passes
the threshold is still skipped).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
let pysam split each line into columns in C rather than calling l.split("\t")
per row in python. add TabixResource.fetch_all_tuples() for the whole-file
grouping loops and use parser=pysam.asTuple() in load_region. verified
byte-identical parsing vs the old split path on the test resource (132 rows).
the LD parse loop in linkage.py is intentionally left on raw strings — it relies
on a cheap substring reject before splitting.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
the greedy loop rebuilt p1_pile and p2_pile with a fresh list comprehension on
every iteration and rescanned all of p2_pile to find each lead's partners, giving
~O(loci * pile size) behavior that dominates on many-signal phenotypes. extract
the loop into a pure _greedy_ld_group() that indexes p2 by variant id once and
tracks consumed variants in a set, so per-locus work is O(#partners) instead of
O(pile size). partner order, r2 values and lead selection are preserved.

add test_grouping_equivalence.py: 600 randomized scenarios (overlap, non-overlap,
varying r2 thresholds) assert the new path matches a faithful copy of the old
list-rebuild loop exactly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
credible-set grouping fetched pval/beta one tabix region per variant: once per cs
lead in credible_grouping and once per cs variant in form_groups. add
_load_variants_by_chrom, which sorts each chromosome's variants and fetches one
region per cluster of nearby variants (split on gaps > max_gap), then filters the
result to the requested set. this collapses N per-variant fetches into a handful
per chromosome while a gap cap keeps a chromosome-wide spread from loading the
whole chromosome. verified identical to the per-variant path on the test resource.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
add a timed() context manager to time_decorator and use it to log per-stage wall
time in main (form_groups / annotate / variant report / top report) and inside
ld_grouping (sumstat parse / LD prefetch / greedy grouping). wall-clock logging
works across multiprocessing, unlike cProfile which only sees the main process.
also log p1/p2 candidate counts before the LD prefetch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
benchmarks/run_local_ibd.py reproduces a single WDL report-task run locally on a
chromosome/region slice of an external sumstat (mirrors autoreporting_external.json),
with cProfile and stage timing. it resolves inputs from the WDL json + input tsv,
tabix-slices the chrom from GCS, and runs main.py on the slice with the LD panel
left remote. modes: group (parse+LD+report) and full (+ annotations & catalog).
includes README documenting the chr21 finding (narrow LD fetch ~300x on the LD step,
byte-identical reports on the real finngen_r12 panel).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
profiling the wide ±range fetch on a chr21 IBD slice showed it iterates ~94M tabix
rows across just 5 leads (130s); the 1bp narrow fetch does the same work in 0.29s
and produces byte-identical reports on the real finngen_r12 panel (whose pos column
equals variant1's position). flip assume_variant1_indexed to default True in TabixLD
and expose --ld-assume-variant1-indexed / --no-ld-assume-variant1-indexed (argparse
BooleanOptionalAction) in main.py and post_process_hits.py so a non-variant1-indexed
panel can still opt out. update README and the benchmark runner accordingly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ld_grouping prefetched LD for every genome-wide lead into one dict before
grouping; on many-signal sumstats (IBD: ~22k leads, mostly MHC) that exceeds
the 6GB WDL task and OOM-kills the parent, surfacing as BrokenPipeError in the
Pool workers. Fetch LD lazily one batch of leads at a time (LD_FETCH_BATCH=500)
in significance order via a reusable LDFetcher whose worker pool stays alive
across batches. Peak memory is bounded to one batch regardless of genome size,
and leads consumed as partners before their turn are never fetched (chr6/MHC:
6007 of 20988 fetched, peak RSS 1.68GB vs prior OOM). Output is unchanged
(600-scenario equivalence + chr21 byte-identical report content).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…riants

The 3Mb subrange split (commit e4cf30f) started each wrapped subrange at
region_start = v.pos. tabix fetch(chrom,start,end) is half-open [start,end) in
0-based coords, so a variant at 1-based pos (0-based pos-1) is only returned
when start < pos <= end; starting at pos excludes the variant that begins the
subrange, so it gets no annotation. Only the first subrange per chromosome
(which used pos-1) was unaffected.

Effect: in phenotypes large enough to hit the region-split branch (variant
count > variant_switch), ~1 variant per 3Mb chunk per chromosome was silently
dropped from annotation. On real T2D inputs this dropped 212 of 90,509
variants; 4 happened to be group leads, surfacing as NA lead_mlogp/af/sebeta/
previous_release in the top report (pval stayed populated as it comes from
lead.pval, not the annotation). ExtraCol/PreviousRelease (switch 50k) hit it
first; FG/gnomAD/functional (switch 100k) break too above ~100k variants.

Fix: start wrapped subranges at max(v.pos-1, 0). Verified 212 -> 0 dropped on
the T2D set. The existing test_generate_ranges had encoded the buggy output as
expected (and was stale, missing chrom 3); updated to correct values and added
test_ranges_cover_all_variants asserting the half-open coverage invariant.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…enotypes

TabixAnnotation chose one tabix fetch per variant at/below variant_switch and
batched 3Mb-region fetches above it. The base was 100k with per-subclass
overrides up to 1M (functional/finngen) and 500k (gnomad4), so any realistic
phenotype stayed on the per-variant path: one remote GCS round-trip per
variant. On T2D (90,509 variants) the annotate stage took ~5.5h, all of it
serial per-variant fetches (functional/finngen/gnomad ~1.5h each), and it did
not benefit from more CPUs since annotation is single-threaded.

Benchmarked both branches on the real annotation files over the T2D set:
batched scales with genomic spread (~235 regions), not variant count, so it is
near-flat in N. Full-set batched vs per-variant-extrapolated:
  functional 138s vs ~141min (~61x); finngen 242s vs ~179min (~44x);
  gnomad_4 663s vs ~168min (~15x). Annotate stage ~5.5h -> ~17min.
Crossover is ~1.3k (functional) to ~5k (gnomad, densest); 5000 sits at the
worst-case crossover, so it routes every real phenotype to batched while
leaving tiny phenotypes on per-variant. Lowering is monotonic: phenotypes
below the threshold are unchanged.

Verified batched == per-variant on 3000 real variants for finngen (0 diffs)
and gnomad (0 real diffs; 3 apparent diffs were nan!=nan artifacts, values
identical). This is safe now that the generate_chrom_ranges off-by-one
(ad363bb) is fixed; the high overrides had effectively disabled the batched
path. Removed the per-subclass overrides so all annotations share the base.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@juhis juhis requested review from Lipastomies and juhaa June 11, 2026 14:20
@Lipastomies

Copy link
Copy Markdown
Collaborator

I have yet to look through this completely not to mention testing it, but noticed that a new timing function was introduced. The existing time_func should be removed if a new one is introduced, let's not create multiple ways of doing the same thing. This is in ‎Scripts/time_decorator.py

@juhis

juhis commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Thanks! They're actually two different mechanisms. cProfile only sees the main process so the new approach was needed to catch multiprocessing timings. So I'd prefer to keep both. The new [STAGE] logging was on by default, now gated the timing output to --stage-timing

@Lipastomies Lipastomies left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a number of things I commented in this PR, but my overall impression is that this will not be merged as is.

The PR has multiple great points, mainly:

  • assume tabixed LD is indexed on variant1. That was a major time saver, and will be incorporated as its own PR.
  • fixing the range in annotation generate ranges. That will be incorporated as its own PR, today.
  • adding mlogp support. That will be added as its own PR.
  • optimizing the switch between annotation fetch per variant and go through the whole file.

Things that will not be implemented, at least in their current form:

  • LDFetcher mixes up interface with implementation, and seems overall like a sloppy mess. Since resources like LDAccess are by their nature singletons, there is no need to make globals, I think.
  • I don't really like the placement of many of the helper functions in this PR, they seem haphazard.

Performance testing notes:

  • Using LD grouping, with the single edit of fetching only single variant from tabixLD seems to be much faster than lazy pooled LD fetching. For example R13 K11_IBD_STRICT being 1m40s realtime on single thread and 35min realtime on this branch on 2 threads. I think most of this is due to the fact that LD grouping is in most cases quite "fast-converging" - the leads limit the amount of fetches so fast that prefetching LD means most of the fetched data ends up not being used. I've yet to test this on extremely large meta-analyzed endpoints, but testing on R13 T2D yielded similar results (single variant fetching <3min, this branch > 10 minutes, yet to finish.) I'll comment on the branch when I get results back from a really large kanta endpoint, which should simulate meta-analysis results quite well.

HERE = os.path.dirname(os.path.abspath(__file__))
REPO = os.path.dirname(HERE)
MAIN = os.path.join(REPO, "Scripts", "main.py")
DEFAULT_JSON = "/home/jkarjala/suite/genetics-results-munge/wdl/autoreporting_external.json"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a default json, then please include it in the repo. Otherwise it is dead weight for everyone but you.

Comment thread Scripts/grouping.py
Comment on lines +22 to +31
def _passes_threshold(value:float, threshold:float, pval_is_mlog10p:bool)->bool:
"""Check if value passes significance threshold.
For raw pval: value < threshold (smaller = more significant).
For mlog10p: value > threshold (larger = more significant).
"""
return value > threshold if pval_is_mlog10p else value < threshold

def _passes_threshold_eq(value:float, threshold:float, pval_is_mlog10p:bool)->bool:
"""Like _passes_threshold but inclusive (<= or >=)."""
return value >= threshold if pval_is_mlog10p else value <= threshold

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that there are two implementations of these with an epsilon-sized difference :) What is the reasoning for using Let's use _passes_threshold_eq for all of the times where _passes_threshold would be used, it does differ from earlier behaviour but makes more sense in my opinion.

Comment thread Scripts/grouping.py
Comment on lines +40 to +69
def _load_variants_by_chrom(summstat_resource: TabixResource, variants, data_columns:List[str],
max_gap:int=1_000_000) -> Dict[Variant,List[str]]:
"""Load data_columns for many variants with one tabix fetch per cluster of nearby
variants instead of one fetch per variant. Variants on a chromosome are sorted and split
into clusters wherever the gap between consecutive positions exceeds max_gap, bounding
each fetch span so a chromosome-wide spread doesn't pull the whole chromosome into memory.
Returns {Variant: [cols...]} for the requested variants found in the file.
"""
by_chrom: Dict[str,List[Variant]] = {}
for v in variants:
by_chrom.setdefault(v.chrom, []).append(v)
out: Dict[Variant,List[str]] = {}
for chrom, vs in by_chrom.items():
want = set(vs)
positions = sorted(set(v.pos for v in vs))
clusters = []
start = prev = positions[0]
for p in positions[1:]:
if p - prev > max_gap:
clusters.append((start, prev))
start = p
prev = p
clusters.append((start, prev))
for lo, hi in clusters:
region = summstat_resource.load_region(chrom, lo, hi+1, data_columns)
for var, data in region.items():
if var in want:
out[var] = data
return out

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as generating chromosome ranges, then loading variants in those ranges. I would rather we reuse the group_annotation.generate_chrom_ranges function than roll a copy of that here. We could change that to always produce the optimal clusters (as long gaps between clusters as possible i.e. minimizing the amount of fetched data) with a parameter for max cluster length.
A greedy algorithm for that should be:

for chrom in chroms:
    go over variants, noting down the distances between them. Preferably in a priority queue or similar. This results in one large range per chromosome, consisting of multiple smaller pieces (the distances between single variants).
    while there are ranges larger than the maximum allowed  range, remove its longest distance between variants, creating two ranges with a section removed in between them. Continue until all ranges are smaller than maximum range length.

Runtime is going to be worse than the current linear algorithm, but would result in ideal ranges. Whether that makes any difference would need to be tested.

Comment thread Scripts/data_access/db.py
def get_ranges(self, variants: List[Variant], bp_range: int,
thresholds: Optional[List[float]] = None,
bp_ranges: Optional[List[int]] = None,
workers: int = 1) -> Dict[Variant, List[LDData]]:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather workers is a implementation-specific option (since it is currently only used in TabixLD) that is set when building TabixLD. That way we don't have to propagate this options here, and muddle up the LD access interface with implementation-specific options.

Comment on lines +66 to +69
# start the new subrange at pos-1: tabix fetch is half-open [start,end),
# so a variant at 1-based pos sits at 0-based pos-1 and must be included
# (using pos here dropped every subrange-starting variant from annotation)
region_start = max(v.pos-1,0)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good catch, would drop comments on lines 67-68 and change 66 to be

# start the new subrange at pos-1 as tabix fetch is half-open [start,end)

@Lipastomies

Copy link
Copy Markdown
Collaborator

Addition to the benchmarking:
Comparing LD grouping on a massive endpoint between current master with single variant LD fetching vs this branch, I got:
master + single variant LD fetching:

real    12m44.881s
user    7m25.315s
sys     0m23.998s

This branch:

real    60m45.919s
user    83m24.678s
sys     3m48.553s

I assume this is due to the amount of LD prefetching.

@Lipastomies

Copy link
Copy Markdown
Collaborator

@juhis if you have the time, could you isolate the changes necessary for external summary statistics from this PR, and make a PR of your own? I've separated the majority of performance increases in a much smaller PR #231, so with the external summary stat support those should cover the scope of this PR quite well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants