External sumstats jk#230
Conversation
…rt input files in varied formats; add an option for mlog10p values instead of p-values in input
LD handling was a major bottleneck on inputs with many signals: the grouping loop fetched LD serially, one blocking call per peak, and each TabixLD.get_range split every line in the window only to discard ~all of them. Parsing: get_range now rejects non-matching lines with a cheap substring test before str.split, parses only matched rows, skips object creation for rows below the r2 threshold, and avoids the per-row chrX->chr23 replace. An opt-in --ld-assume-variant1-indexed flag narrows the fetch to a 1bp window at the lead position (valid when the file is indexed by variant1 pos), eliminating the N*M line scan; default off keeps the correct-for-any-layout wide fetch. Parallelization: LD for a lead depends only on (lead, range, threshold), never on grouping state, so add LDAccess.get_ranges() to prefetch LD for all candidate leads up front. TabixLD parallelizes it across worker processes (pysam files aren't picklable, so each worker re-opens its own via a Pool initializer). The greedy grouping loop then reads from the cache, preserving identical output. ld_grouping, credible_grouping, the CS branch, and post_process_hits all use it. New --ld-workers arg (default cpu_count, 1 = serial). Adds testing/test_tabixld.py with a synthetic indexed-by-variant1 fixture covering parsing, threshold, chrX normalization, narrow-vs-wide equivalence, and serial-vs-parallel equivalence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
the three per-row parse loops in grouping.py (simple/ld/filter) and load_region in load_tabix.py did a header->index dict lookup for every column on every row. hoist them to integer locals once before the loop. also defer the beta float() parse until after the p2/p1 threshold check, since most genome-wide rows fail it and never need beta. output is unchanged (a row with unparseable beta that passes the threshold is still skipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
let pysam split each line into columns in C rather than calling l.split("\t")
per row in python. add TabixResource.fetch_all_tuples() for the whole-file
grouping loops and use parser=pysam.asTuple() in load_region. verified
byte-identical parsing vs the old split path on the test resource (132 rows).
the LD parse loop in linkage.py is intentionally left on raw strings — it relies
on a cheap substring reject before splitting.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
the greedy loop rebuilt p1_pile and p2_pile with a fresh list comprehension on every iteration and rescanned all of p2_pile to find each lead's partners, giving ~O(loci * pile size) behavior that dominates on many-signal phenotypes. extract the loop into a pure _greedy_ld_group() that indexes p2 by variant id once and tracks consumed variants in a set, so per-locus work is O(#partners) instead of O(pile size). partner order, r2 values and lead selection are preserved. add test_grouping_equivalence.py: 600 randomized scenarios (overlap, non-overlap, varying r2 thresholds) assert the new path matches a faithful copy of the old list-rebuild loop exactly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
credible-set grouping fetched pval/beta one tabix region per variant: once per cs lead in credible_grouping and once per cs variant in form_groups. add _load_variants_by_chrom, which sorts each chromosome's variants and fetches one region per cluster of nearby variants (split on gaps > max_gap), then filters the result to the requested set. this collapses N per-variant fetches into a handful per chromosome while a gap cap keeps a chromosome-wide spread from loading the whole chromosome. verified identical to the per-variant path on the test resource. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
add a timed() context manager to time_decorator and use it to log per-stage wall time in main (form_groups / annotate / variant report / top report) and inside ld_grouping (sumstat parse / LD prefetch / greedy grouping). wall-clock logging works across multiprocessing, unlike cProfile which only sees the main process. also log p1/p2 candidate counts before the LD prefetch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
benchmarks/run_local_ibd.py reproduces a single WDL report-task run locally on a chromosome/region slice of an external sumstat (mirrors autoreporting_external.json), with cProfile and stage timing. it resolves inputs from the WDL json + input tsv, tabix-slices the chrom from GCS, and runs main.py on the slice with the LD panel left remote. modes: group (parse+LD+report) and full (+ annotations & catalog). includes README documenting the chr21 finding (narrow LD fetch ~300x on the LD step, byte-identical reports on the real finngen_r12 panel). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
profiling the wide ±range fetch on a chr21 IBD slice showed it iterates ~94M tabix rows across just 5 leads (130s); the 1bp narrow fetch does the same work in 0.29s and produces byte-identical reports on the real finngen_r12 panel (whose pos column equals variant1's position). flip assume_variant1_indexed to default True in TabixLD and expose --ld-assume-variant1-indexed / --no-ld-assume-variant1-indexed (argparse BooleanOptionalAction) in main.py and post_process_hits.py so a non-variant1-indexed panel can still opt out. update README and the benchmark runner accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ld_grouping prefetched LD for every genome-wide lead into one dict before grouping; on many-signal sumstats (IBD: ~22k leads, mostly MHC) that exceeds the 6GB WDL task and OOM-kills the parent, surfacing as BrokenPipeError in the Pool workers. Fetch LD lazily one batch of leads at a time (LD_FETCH_BATCH=500) in significance order via a reusable LDFetcher whose worker pool stays alive across batches. Peak memory is bounded to one batch regardless of genome size, and leads consumed as partners before their turn are never fetched (chr6/MHC: 6007 of 20988 fetched, peak RSS 1.68GB vs prior OOM). Output is unchanged (600-scenario equivalence + chr21 byte-identical report content). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…riants The 3Mb subrange split (commit e4cf30f) started each wrapped subrange at region_start = v.pos. tabix fetch(chrom,start,end) is half-open [start,end) in 0-based coords, so a variant at 1-based pos (0-based pos-1) is only returned when start < pos <= end; starting at pos excludes the variant that begins the subrange, so it gets no annotation. Only the first subrange per chromosome (which used pos-1) was unaffected. Effect: in phenotypes large enough to hit the region-split branch (variant count > variant_switch), ~1 variant per 3Mb chunk per chromosome was silently dropped from annotation. On real T2D inputs this dropped 212 of 90,509 variants; 4 happened to be group leads, surfacing as NA lead_mlogp/af/sebeta/ previous_release in the top report (pval stayed populated as it comes from lead.pval, not the annotation). ExtraCol/PreviousRelease (switch 50k) hit it first; FG/gnomAD/functional (switch 100k) break too above ~100k variants. Fix: start wrapped subranges at max(v.pos-1, 0). Verified 212 -> 0 dropped on the T2D set. The existing test_generate_ranges had encoded the buggy output as expected (and was stale, missing chrom 3); updated to correct values and added test_ranges_cover_all_variants asserting the half-open coverage invariant. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…enotypes TabixAnnotation chose one tabix fetch per variant at/below variant_switch and batched 3Mb-region fetches above it. The base was 100k with per-subclass overrides up to 1M (functional/finngen) and 500k (gnomad4), so any realistic phenotype stayed on the per-variant path: one remote GCS round-trip per variant. On T2D (90,509 variants) the annotate stage took ~5.5h, all of it serial per-variant fetches (functional/finngen/gnomad ~1.5h each), and it did not benefit from more CPUs since annotation is single-threaded. Benchmarked both branches on the real annotation files over the T2D set: batched scales with genomic spread (~235 regions), not variant count, so it is near-flat in N. Full-set batched vs per-variant-extrapolated: functional 138s vs ~141min (~61x); finngen 242s vs ~179min (~44x); gnomad_4 663s vs ~168min (~15x). Annotate stage ~5.5h -> ~17min. Crossover is ~1.3k (functional) to ~5k (gnomad, densest); 5000 sits at the worst-case crossover, so it routes every real phenotype to batched while leaving tiny phenotypes on per-variant. Lowering is monotonic: phenotypes below the threshold are unchanged. Verified batched == per-variant on 3000 real variants for finngen (0 diffs) and gnomad (0 real diffs; 3 apparent diffs were nan!=nan artifacts, values identical). This is safe now that the generate_chrom_ranges off-by-one (ad363bb) is fixed; the high overrides had effectively disabled the batched path. Removed the per-subclass overrides so all annotations share the base. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
I have yet to look through this completely not to mention testing it, but noticed that a new timing function was introduced. The existing time_func should be removed if a new one is introduced, let's not create multiple ways of doing the same thing. This is in Scripts/time_decorator.py |
|
Thanks! They're actually two different mechanisms. cProfile only sees the main process so the new approach was needed to catch multiprocessing timings. So I'd prefer to keep both. The new [STAGE] logging was on by default, now gated the timing output to --stage-timing |
Lipastomies
left a comment
There was a problem hiding this comment.
There are a number of things I commented in this PR, but my overall impression is that this will not be merged as is.
The PR has multiple great points, mainly:
- assume tabixed LD is indexed on variant1. That was a major time saver, and will be incorporated as its own PR.
- fixing the range in annotation generate ranges. That will be incorporated as its own PR, today.
- adding mlogp support. That will be added as its own PR.
- optimizing the switch between annotation fetch per variant and go through the whole file.
Things that will not be implemented, at least in their current form:
- LDFetcher mixes up interface with implementation, and seems overall like a sloppy mess. Since resources like LDAccess are by their nature singletons, there is no need to make globals, I think.
- I don't really like the placement of many of the helper functions in this PR, they seem haphazard.
Performance testing notes:
- Using LD grouping, with the single edit of fetching only single variant from tabixLD seems to be much faster than lazy pooled LD fetching. For example R13 K11_IBD_STRICT being 1m40s realtime on single thread and 35min realtime on this branch on 2 threads. I think most of this is due to the fact that LD grouping is in most cases quite "fast-converging" - the leads limit the amount of fetches so fast that prefetching LD means most of the fetched data ends up not being used. I've yet to test this on extremely large meta-analyzed endpoints, but testing on R13 T2D yielded similar results (single variant fetching <3min, this branch > 10 minutes, yet to finish.) I'll comment on the branch when I get results back from a really large kanta endpoint, which should simulate meta-analysis results quite well.
| HERE = os.path.dirname(os.path.abspath(__file__)) | ||
| REPO = os.path.dirname(HERE) | ||
| MAIN = os.path.join(REPO, "Scripts", "main.py") | ||
| DEFAULT_JSON = "/home/jkarjala/suite/genetics-results-munge/wdl/autoreporting_external.json" |
There was a problem hiding this comment.
If there's a default json, then please include it in the repo. Otherwise it is dead weight for everyone but you.
| def _passes_threshold(value:float, threshold:float, pval_is_mlog10p:bool)->bool: | ||
| """Check if value passes significance threshold. | ||
| For raw pval: value < threshold (smaller = more significant). | ||
| For mlog10p: value > threshold (larger = more significant). | ||
| """ | ||
| return value > threshold if pval_is_mlog10p else value < threshold | ||
|
|
||
| def _passes_threshold_eq(value:float, threshold:float, pval_is_mlog10p:bool)->bool: | ||
| """Like _passes_threshold but inclusive (<= or >=).""" | ||
| return value >= threshold if pval_is_mlog10p else value <= threshold |
There was a problem hiding this comment.
I see that there are two implementations of these with an epsilon-sized difference :) What is the reasoning for using Let's use _passes_threshold_eq for all of the times where _passes_threshold would be used, it does differ from earlier behaviour but makes more sense in my opinion.
| def _load_variants_by_chrom(summstat_resource: TabixResource, variants, data_columns:List[str], | ||
| max_gap:int=1_000_000) -> Dict[Variant,List[str]]: | ||
| """Load data_columns for many variants with one tabix fetch per cluster of nearby | ||
| variants instead of one fetch per variant. Variants on a chromosome are sorted and split | ||
| into clusters wherever the gap between consecutive positions exceeds max_gap, bounding | ||
| each fetch span so a chromosome-wide spread doesn't pull the whole chromosome into memory. | ||
| Returns {Variant: [cols...]} for the requested variants found in the file. | ||
| """ | ||
| by_chrom: Dict[str,List[Variant]] = {} | ||
| for v in variants: | ||
| by_chrom.setdefault(v.chrom, []).append(v) | ||
| out: Dict[Variant,List[str]] = {} | ||
| for chrom, vs in by_chrom.items(): | ||
| want = set(vs) | ||
| positions = sorted(set(v.pos for v in vs)) | ||
| clusters = [] | ||
| start = prev = positions[0] | ||
| for p in positions[1:]: | ||
| if p - prev > max_gap: | ||
| clusters.append((start, prev)) | ||
| start = p | ||
| prev = p | ||
| clusters.append((start, prev)) | ||
| for lo, hi in clusters: | ||
| region = summstat_resource.load_region(chrom, lo, hi+1, data_columns) | ||
| for var, data in region.items(): | ||
| if var in want: | ||
| out[var] = data | ||
| return out | ||
|
|
There was a problem hiding this comment.
This is the same as generating chromosome ranges, then loading variants in those ranges. I would rather we reuse the group_annotation.generate_chrom_ranges function than roll a copy of that here. We could change that to always produce the optimal clusters (as long gaps between clusters as possible i.e. minimizing the amount of fetched data) with a parameter for max cluster length.
A greedy algorithm for that should be:
for chrom in chroms:
go over variants, noting down the distances between them. Preferably in a priority queue or similar. This results in one large range per chromosome, consisting of multiple smaller pieces (the distances between single variants).
while there are ranges larger than the maximum allowed range, remove its longest distance between variants, creating two ranges with a section removed in between them. Continue until all ranges are smaller than maximum range length.
Runtime is going to be worse than the current linear algorithm, but would result in ideal ranges. Whether that makes any difference would need to be tested.
| def get_ranges(self, variants: List[Variant], bp_range: int, | ||
| thresholds: Optional[List[float]] = None, | ||
| bp_ranges: Optional[List[int]] = None, | ||
| workers: int = 1) -> Dict[Variant, List[LDData]]: |
There was a problem hiding this comment.
I would rather workers is a implementation-specific option (since it is currently only used in TabixLD) that is set when building TabixLD. That way we don't have to propagate this options here, and muddle up the LD access interface with implementation-specific options.
| # start the new subrange at pos-1: tabix fetch is half-open [start,end), | ||
| # so a variant at 1-based pos sits at 0-based pos-1 and must be included | ||
| # (using pos here dropped every subrange-starting variant from annotation) | ||
| region_start = max(v.pos-1,0) |
There was a problem hiding this comment.
That's a good catch, would drop comments on lines 67-68 and change 66 to be
# start the new subrange at pos-1 as tabix fetch is half-open [start,end)
|
Addition to the benchmarking: This branch: I assume this is due to the amount of LD prefetching. |
|
@juhis if you have the time, could you isolate the changes necessary for external summary statistics from this PR, and make a PR of your own? I've separated the majority of performance increases in a much smaller PR #231, so with the external summary stat support those should cover the scope of this PR quite well. |
Performance work on the report task and one fix
Backward compatibility: existing WDL inputs and CLI commands work unchanged (all new knobs optional with prior-behavior defaults)