feat(load_dataset): report entry-removal counts via verbose flag#275
Merged
Conversation
load_dataset silently dropped rows in three places (min_len, max_len, and non_canonical_aa='remove'), telling the user only when a filter removed every row. Add a verbose flag (default False) that reports the exact count each removal step drops; the returned data is byte-identical when verbose is off. Retaining every entry stays available via non_canonical_aa='keep'. - verbose validated through ut.check_verbose (honors aa.options['verbose']) - per-step ut.print_out, gated on verbose and emitted only when count > 0 - tests: exact count == n_before-n_after per step, silent-when-off, opt-out==raw row count, default output byte-identical (assert_frame_equal) - docstring (versionchanged + param docs), example notebook cell, release notes Closes #76 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #275 +/- ##
=======================================
Coverage 96.14% 96.14%
=======================================
Files 176 176
Lines 16739 16752 +13
Branches 2856 2859 +3
=======================================
+ Hits 16093 16106 +13
Misses 364 364
Partials 282 282
🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
aa.load_datasetsilently dropped rows in three places — themin_lenfilter, themax_lenfilter, and_adjust_non_canonical_aaunder the defaultnon_canonical_aa='remove'— surfacing nothing unless a filter removed every row. This makes the previously silent filtering observable.A new
verbose: bool = Falseparameter reports how many entries each removal step drops, viaut.print_out(gated onverbose, emitted only when the count is non-zero). The returned data is byte-identical whenverboseis off. The retain-everything opt-out is the existingnon_canonical_aa='keep'.Design
verbosevalidated throughut.check_verbose(validates viacheck_booland honors the globalaa.options['verbose']), matching the house pattern used byTreeModel/EmbeddingPreprocessor.n_removed = n_before - n_aftercomputed at each removal site; printed only whenverbose and n_removed > 0(andnon_canonical_aa == 'remove'for the non-canonical step).non_canonical_aa='keep'(andmin_len/max_lenalready default toNone). Keeps the API slim.Acceptance criteria (issue KPIs)
verbose=Trueemits a message whose reported count equalsn_before - n_afterexactly (asserted per step viacapsys).non_canonical_aa='keep') removes 0 entries —len(df_seq)equals the raw on-disk row count.verboseoff) returns adf_seqbyte-identical to currentmaster(pd.testing.assert_frame_equal).Ripple
versionchanged:: 1.1.0+ param docs), 8 new tests, example notebook (load_dataset.ipynb, re-executed with outputs), release notes (Unreleased → Changed).__all__change (additive parameter on an already-public function).Notes
# Check inputblock; backend untouched.print()(usesut.print_out); bareValueErrorguards unchanged.Closes #76