feat(load_dataset): report entry-removal counts via verbose flag by breimanntools · Pull Request #275 · breimanntools/aaanalysis

breimanntools · 2026-06-25T17:59:43Z

Summary

aa.load_dataset silently dropped rows in three places — the min_len filter, the max_len filter, and _adjust_non_canonical_aa under the default non_canonical_aa='remove' — surfacing nothing unless a filter removed every row. This makes the previously silent filtering observable.

A new verbose: bool = False parameter reports how many entries each removal step drops, via ut.print_out (gated on verbose, emitted only when the count is non-zero). The returned data is byte-identical when verbose is off. The retain-everything opt-out is the existing non_canonical_aa='keep'.

aa.load_dataset(name="SEQ_CAPSID", verbose=True)
# -> 'SEQ_CAPSID': removed 73 sequence(s) containing non-canonical amino acids.

aa.load_dataset(name="SEQ_CAPSID", non_canonical_aa="keep")   # retains every entry

Design

verbose validated through ut.check_verbose (validates via check_bool and honors the global aa.options['verbose']), matching the house pattern used by TreeModel / EmbeddingPreprocessor.
Per-step n_removed = n_before - n_after computed at each removal site; printed only when verbose and n_removed > 0 (and non_canonical_aa == 'remove' for the non-canonical step).
No separate opt-out flag added — retaining entries is the existing non_canonical_aa='keep' (and min_len/max_len already default to None). Keeps the API slim.

Acceptance criteria (issue KPIs)

On a dataset with non-canonical sequences, verbose=True emits a message whose reported count equals n_before - n_after exactly (asserted per step via capsys).
Opt-out (non_canonical_aa='keep') removes 0 entries — len(df_seq) equals the raw on-disk row count.
Default call (verbose off) returns a df_seq byte-identical to current master (pd.testing.assert_frame_equal).

Ripple

Code + Validate block, numpydoc (versionchanged:: 1.1.0 + param docs), 8 new tests, example notebook (load_dataset.ipynb, re-executed with outputs), release notes (Unreleased → Changed).
No __all__ change (additive parameter on an already-public function).

Notes

Frontend/backend: validation in the # Check input block; backend untouched.
No print() (uses ut.print_out); bare ValueError guards unchanged.

Closes #76

load_dataset silently dropped rows in three places (min_len, max_len, and non_canonical_aa='remove'), telling the user only when a filter removed every row. Add a verbose flag (default False) that reports the exact count each removal step drops; the returned data is byte-identical when verbose is off. Retaining every entry stays available via non_canonical_aa='keep'. - verbose validated through ut.check_verbose (honors aa.options['verbose']) - per-step ut.print_out, gated on verbose and emitted only when count > 0 - tests: exact count == n_before-n_after per step, silent-when-off, opt-out==raw row count, default output byte-identical (assert_frame_equal) - docstring (versionchanged + param docs), example notebook cell, release notes Closes #76 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-25T19:27:41Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.14%. Comparing base (c9127da) to head (129df42).
⚠️ Report is 10 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #275   +/-   ##
=======================================
  Coverage   96.14%   96.14%           
=======================================
  Files         176      176           
  Lines       16739    16752   +13     
  Branches     2856     2859    +3     
=======================================
+ Hits        16093    16106   +13     
  Misses        364      364           
  Partials      282      282

Files with missing lines	Coverage Δ
aaanalysis/data_handling/_load_dataset.py	`100.00% <100.00%> (ø)`

Components	Coverage Δ
cpp_core	`94.95% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

breimanntools marked this pull request as ready for review June 25, 2026 22:53

breimanntools merged commit ae64fe2 into master Jun 25, 2026
16 checks passed

breimanntools deleted the feat/load-dataset-verbose branch June 25, 2026 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(load_dataset): report entry-removal counts via verbose flag#275

feat(load_dataset): report entry-removal counts via verbose flag#275
breimanntools merged 1 commit into
masterfrom
feat/load-dataset-verbose

breimanntools commented Jun 25, 2026

Uh oh!

codecov Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

breimanntools commented Jun 25, 2026

Summary

Design

Acceptance criteria (issue KPIs)

Ripple

Notes

Uh oh!

codecov Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Jun 25, 2026 •

edited

Loading