Skip to content

fix: resolve pruning pipeline bugs#265

Open
nh13 wants to merge 1 commit into
broadinstitute:mainfrom
nh13:nh/fix-pruning-pipeline
Open

fix: resolve pruning pipeline bugs#265
nh13 wants to merge 1 commit into
broadinstitute:mainfrom
nh13:nh/fix-pruning-pipeline

Conversation

@nh13

@nh13 nh13 commented Feb 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes several bugs in the dataset pruning pipeline (prune_dataset.py), plus an incorrect CLI description in edit_dataset.py.

  • Vestigial fold loopcalculate_pruning_thresholds wrapped its entire body in for fold in range(NUM_FOLDS) but unconditionally returned on the first iteration, so the loop was dead code (and misleading). Removed it.
  • Wrong loader argumentmake_data_loader was passed training_params.num_epochs where num_workers is expected, which could launch hundreds of worker processes.
  • Operator precedence — the cyclic validation-fold index was pruning_fold + 1 % len(fold_datasets), which parses as pruning_fold + (1 % len) and indexes out of range on the final fold. Fixed to (pruning_fold + 1) % len(fold_datasets).
  • Division-by-zero guards — added informative ValueErrors when no data pass the confidence thresholds, and when the summed error rates reach 1.0 (which would otherwise divide by zero and violate the rank-pruning assumptions).
  • Return annotationgenerated_pruned_data_for_fold is a generator; corrected List[int]Generator[Datum, None, None].
  • Wrong CLI descriptions — both prune_dataset.py and edit_dataset.py had argparse descriptions reading "train the Mutect3 artifact model".

Test plan

  • ruff format / ruff check clean; both modules compile and import.
  • Note: the test_prune_dataset integration test requires an EXPERIMENTAL_MODEL checkpoint that is not committed to the repo, so it cannot run in CI. The pruning logic was verified by inspection and targeted checks (generator semantics, cyclic fold indexing, guard conditions).

@nh13 nh13 force-pushed the nh/fix-pruning-pipeline branch from 3fc2b19 to 62e3fb4 Compare March 4, 2026 23:44
@nh13 nh13 changed the base branch from main to nh/fix-broken-tests March 4, 2026 23:45
@nh13 nh13 force-pushed the nh/fix-broken-tests branch from 04da066 to 21a66ec Compare March 4, 2026 23:51
@nh13 nh13 force-pushed the nh/fix-pruning-pipeline branch from 62e3fb4 to 14b83a6 Compare March 4, 2026 23:51
@nh13 nh13 force-pushed the nh/fix-broken-tests branch from 21a66ec to 6708c39 Compare March 4, 2026 23:56
@nh13 nh13 force-pushed the nh/fix-pruning-pipeline branch from 14b83a6 to 00de1c3 Compare March 4, 2026 23:56
@nh13 nh13 force-pushed the nh/fix-broken-tests branch from 6708c39 to 4da02e3 Compare March 5, 2026 01:22
@nh13 nh13 force-pushed the nh/fix-pruning-pipeline branch from 00de1c3 to abc7026 Compare March 5, 2026 01:22
@nh13 nh13 force-pushed the nh/fix-broken-tests branch from 4da02e3 to e76c0e1 Compare March 5, 2026 16:10
@nh13 nh13 force-pushed the nh/fix-pruning-pipeline branch from abc7026 to 7258bee Compare March 5, 2026 16:11
@nh13 nh13 force-pushed the nh/fix-broken-tests branch from e76c0e1 to 4047e6c Compare March 5, 2026 16:45
@nh13 nh13 force-pushed the nh/fix-pruning-pipeline branch from 7258bee to 40eb370 Compare March 5, 2026 16:45
@nh13 nh13 force-pushed the nh/fix-broken-tests branch from 4047e6c to a71a6b7 Compare March 10, 2026 17:17
@nh13 nh13 force-pushed the nh/fix-pruning-pipeline branch from 40eb370 to 1b28cde Compare March 10, 2026 17:17
@nh13 nh13 force-pushed the nh/fix-broken-tests branch from a71a6b7 to 3c4f160 Compare March 10, 2026 17:20
@nh13 nh13 force-pushed the nh/fix-pruning-pipeline branch from 1b28cde to f50802c Compare March 10, 2026 17:20
@nh13 nh13 force-pushed the nh/fix-broken-tests branch from 3c4f160 to 1f4bffd Compare March 12, 2026 04:38
@nh13 nh13 force-pushed the nh/fix-pruning-pipeline branch from f50802c to ac1c6dc Compare March 12, 2026 04:38
@nh13 nh13 force-pushed the nh/fix-broken-tests branch from 1f4bffd to 78a2b91 Compare March 12, 2026 04:39
@nh13 nh13 force-pushed the nh/fix-pruning-pipeline branch from ac1c6dc to e9a3cc1 Compare March 12, 2026 04:39
@nh13 nh13 force-pushed the nh/fix-broken-tests branch from 78a2b91 to f35061b Compare March 14, 2026 03:16
@nh13 nh13 force-pushed the nh/fix-pruning-pipeline branch from e9a3cc1 to 9af8e28 Compare March 14, 2026 03:16
@nh13 nh13 force-pushed the nh/fix-broken-tests branch from f35061b to 04c6545 Compare March 14, 2026 03:19
@nh13 nh13 force-pushed the nh/fix-pruning-pipeline branch from 9af8e28 to cf00273 Compare March 14, 2026 03:19
@nh13 nh13 force-pushed the nh/fix-broken-tests branch from 04c6545 to 0ee302e Compare March 16, 2026 18:37
@nh13 nh13 force-pushed the nh/fix-pruning-pipeline branch from cf00273 to 6d78ec5 Compare March 16, 2026 18:37
@nh13 nh13 force-pushed the nh/fix-broken-tests branch from 0ee302e to 617749c Compare March 16, 2026 20:20
@nh13 nh13 force-pushed the nh/fix-pruning-pipeline branch from 6d78ec5 to 706cd36 Compare March 16, 2026 20:20
@nh13 nh13 force-pushed the nh/fix-broken-tests branch from 617749c to a7c5dfa Compare March 16, 2026 20:34
@nh13 nh13 force-pushed the nh/fix-pruning-pipeline branch from 706cd36 to c4dd9c4 Compare March 16, 2026 20:34
@nh13 nh13 force-pushed the nh/fix-broken-tests branch from a7c5dfa to bf0ece0 Compare March 16, 2026 20:36
@nh13 nh13 force-pushed the nh/fix-pruning-pipeline branch from c4dd9c4 to a35feda Compare March 16, 2026 20:36
@nh13 nh13 force-pushed the nh/fix-broken-tests branch from bf0ece0 to 1321324 Compare March 17, 2026 17:59
@nh13 nh13 force-pushed the nh/fix-pruning-pipeline branch from a35feda to 2b5ad28 Compare March 17, 2026 17:59
@nh13 nh13 changed the base branch from nh/fix-broken-tests to main March 18, 2026 04:59
@nh13 nh13 force-pushed the nh/fix-pruning-pipeline branch from 2b5ad28 to 0137d52 Compare March 18, 2026 05:00
@nh13 nh13 marked this pull request as draft March 18, 2026 05:01
- Remove vestigial `for fold in range(NUM_FOLDS)` loop in
  calculate_pruning_thresholds that always returned on the first iteration
- Fix num_epochs passed where num_workers expected in make_data_loader
- Fix operator precedence in the cyclic fold index: (pruning_fold + 1) % n
  (was pruning_fold + 1 % n, which indexed out of range on the final fold)
- Guard error-rate calculations against division by zero with informative
  ValueError messages
- Fix return annotation: List[int] -> Generator[Datum, None, None]
- Fix wrong argparse descriptions in prune_dataset.py and edit_dataset.py
@nh13 nh13 marked this pull request as ready for review June 17, 2026 08:24
@nh13 nh13 requested a review from davidbenjamin June 17, 2026 08:24
@nh13 nh13 force-pushed the nh/fix-pruning-pipeline branch from 0137d52 to de07b2c Compare June 17, 2026 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants