Skip to content

feat(load_dataset): report entry-removal counts via verbose flag#275

Merged
breimanntools merged 1 commit into
masterfrom
feat/load-dataset-verbose
Jun 25, 2026
Merged

feat(load_dataset): report entry-removal counts via verbose flag#275
breimanntools merged 1 commit into
masterfrom
feat/load-dataset-verbose

Conversation

@breimanntools

Copy link
Copy Markdown
Owner

Summary

aa.load_dataset silently dropped rows in three places — the min_len filter, the max_len filter, and _adjust_non_canonical_aa under the default non_canonical_aa='remove' — surfacing nothing unless a filter removed every row. This makes the previously silent filtering observable.

A new verbose: bool = False parameter reports how many entries each removal step drops, via ut.print_out (gated on verbose, emitted only when the count is non-zero). The returned data is byte-identical when verbose is off. The retain-everything opt-out is the existing non_canonical_aa='keep'.

aa.load_dataset(name="SEQ_CAPSID", verbose=True)
# -> 'SEQ_CAPSID': removed 73 sequence(s) containing non-canonical amino acids.

aa.load_dataset(name="SEQ_CAPSID", non_canonical_aa="keep")   # retains every entry

Design

  • verbose validated through ut.check_verbose (validates via check_bool and honors the global aa.options['verbose']), matching the house pattern used by TreeModel / EmbeddingPreprocessor.
  • Per-step n_removed = n_before - n_after computed at each removal site; printed only when verbose and n_removed > 0 (and non_canonical_aa == 'remove' for the non-canonical step).
  • No separate opt-out flag added — retaining entries is the existing non_canonical_aa='keep' (and min_len/max_len already default to None). Keeps the API slim.

Acceptance criteria (issue KPIs)

  • On a dataset with non-canonical sequences, verbose=True emits a message whose reported count equals n_before - n_after exactly (asserted per step via capsys).
  • Opt-out (non_canonical_aa='keep') removes 0 entries — len(df_seq) equals the raw on-disk row count.
  • Default call (verbose off) returns a df_seq byte-identical to current master (pd.testing.assert_frame_equal).

Ripple

  • Code + Validate block, numpydoc (versionchanged:: 1.1.0 + param docs), 8 new tests, example notebook (load_dataset.ipynb, re-executed with outputs), release notes (Unreleased → Changed).
  • No __all__ change (additive parameter on an already-public function).

Notes

  • Frontend/backend: validation in the # Check input block; backend untouched.
  • No print() (uses ut.print_out); bare ValueError guards unchanged.

Closes #76

load_dataset silently dropped rows in three places (min_len, max_len, and
non_canonical_aa='remove'), telling the user only when a filter removed every
row. Add a verbose flag (default False) that reports the exact count each
removal step drops; the returned data is byte-identical when verbose is off.
Retaining every entry stays available via non_canonical_aa='keep'.

- verbose validated through ut.check_verbose (honors aa.options['verbose'])
- per-step ut.print_out, gated on verbose and emitted only when count > 0
- tests: exact count == n_before-n_after per step, silent-when-off,
  opt-out==raw row count, default output byte-identical (assert_frame_equal)
- docstring (versionchanged + param docs), example notebook cell, release notes

Closes #76

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.14%. Comparing base (c9127da) to head (129df42).
⚠️ Report is 10 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #275   +/-   ##
=======================================
  Coverage   96.14%   96.14%           
=======================================
  Files         176      176           
  Lines       16739    16752   +13     
  Branches     2856     2859    +3     
=======================================
+ Hits        16093    16106   +13     
  Misses        364      364           
  Partials      282      282           
Files with missing lines Coverage Δ
aaanalysis/data_handling/_load_dataset.py 100.00% <100.00%> (ø)
Components Coverage Δ
cpp_core 94.95% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@breimanntools breimanntools marked this pull request as ready for review June 25, 2026 22:53
@breimanntools breimanntools merged commit ae64fe2 into master Jun 25, 2026
16 checks passed
@breimanntools breimanntools deleted the feat/load-dataset-verbose branch June 25, 2026 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make entry removal explicit + add opt-out flag

1 participant