feat(shap): add opt-in fuzzy_aggregation="interpolate" estimator (#229)#269
Merged
Conversation
ShapModel.fit gains fuzzy_aggregation (default "threshold", byte-identical to before). "interpolate" weights a soft label p by exactly p — fitting at 0 (S0) and at 1 (S1) and blending p*S1 + (1-p)*S0 — instead of the biased threshold sweep. With n_rounds=1 it is exactly two fits per fuzzy sample (the fastest fuzzy estimator); n_rounds>1 averages per-round re-seeded fits (random_state + round) so n_rounds is always meaningful and reproducible. Each fuzzy protein is explained independently against the fixed balanced 0/1 core, with the other fuzzy proteins excluded; a single fuzzy protein shares the full set, so its two blended fits cover every row (no baseline needed). Backend: interpolate_fuzzy_shap_estimation + _seed_model_kwargs. Frontend: validation (check_str_options), routing, numpydoc Notes block. Adds 9 unit tests (exact-p golden, fit-count spy, reproducibility, MC-variance-vs-rounds, multi-fuzzy, threshold no-regression), an example notebook cell, a release-notes entry, and a CONTEXT.md glossary term. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Extract _class_index_from_labels helper (was duplicated across both estimators). - Move the cell zero-init into the multi-fuzzy branch (the single-fuzzy branch reassigns it; the init was dead there). - Comment why only the interpolate path threads random_state (per-round re-seeding) while the threshold path keeps it baked into model kwargs. No behavior change; output identical (41 ShapModel fit tests green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #269 +/- ##
==========================================
- Coverage 96.17% 96.12% -0.05%
==========================================
Files 175 175
Lines 16312 16442 +130
Branches 2787 2806 +19
==========================================
+ Hits 15688 15805 +117
- Misses 366 369 +3
- Partials 258 268 +10
... and 2 files with indirect coverage changes
🚀 New features to boost your workflow:
|
breimanntools
commented
Jun 25, 2026
breimanntools
left a comment
Owner
Author
There was a problem hiding this comment.
Looks all good. Please make a regression test! How much faster are we with the new implementaion. How strong is the difference to the existing ShapModel method?
Pins the interpolate estimator on a real DOM_GSEC fuzzy cell — APP (P05067), CD44 (P16070), and a non-substrate (Q14802) with invented prediction scores, each explained as a single fuzzy sample. Guards: - exact-p identity: interpolate(n_rounds=1) == p*S1 + (1-p)*S0 (atol=1e-10), recomputed same-machine so it is platform-robust; - fit-count advantage: interpolate(n_rounds=1) does 2 fits vs threshold(n5)'s 5 — the ~2.15x wall-clock win measured against aaanalysis 1.0.3 as a noise-free invariant; - frozen per-protein signatures; the threshold signatures were verified byte-identical to aaanalysis 1.0.3 on this cell (no-regression for the default path), while interpolate differs by design (unbiased exact-p). @pytest.mark.regression, pinned to Linux/py3.11 (AAA_RUN_REGRESSION=1 forces it locally); runs in the non-gating nightly only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
interpolate's per-round average is a Monte-Carlo/bootstrap mean over re-seeded model fits, so it converges as n_rounds grows. Adds a reproducible (fixed base seed) convergence test on the canonical fuzzy cell: n_rounds=R equals the cumulative mean of per-round blends, so the 25 blends are computed once and all cumulative means derived from them. Asserts the convergence structure (platform-robust, no frozen values): - a single round (n_rounds=1) sits clearly off the converged mean; - late rounds move the estimate far less than early rounds (it converges ~1/sqrt(R)); - the tail is stable (the last rounds barely change the estimate). On DOM_GSEC the estimate stabilizes (incremental change < 2%) around n_rounds 15-20; n_rounds=1 stays the fast unbiased point estimate, higher n_rounds buys a stable mean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fuzzy_aggregation now defaults to "interpolate" (was "threshold"); the legacy biased threshold sweep stays available via fuzzy_aggregation="threshold". n_rounds is now Optional[int]=None, resolving to a per-estimator natural default: 1 for interpolate (exact in a single round) and 5 otherwise (the threshold sweep and the non-fuzzy Monte-Carlo need several rounds). So the default fuzzy estimate is the exact two-fit blend — ~2x faster than the v1.0 default on the same cell — while n_rounds>1 averages re-seeded fits into a reproducible Monte-Carlo mean that converges around n_rounds~15-20. Updates the Notes/param docstrings, the CONTEXT.md glossary, the release notes, and the example notebook (now demos the threshold opt-in + n_rounds averaging). Tests: default-is-interpolate, n_rounds=None natural-default resolution; the threshold branch-coverage test pins fuzzy_aggregation="threshold" explicitly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…class Reverts the per-estimator n_rounds=None magic for a simpler, single shared default. fuzzy_aggregation selects two first-class estimators: the cited threshold sweep ([Breimann25]) and the new unbiased interpolate (default, v1.1); threshold is kept (not deprecated) and stays faithful to its published n_rounds=5 grid. n_rounds is a plain int=5 (no None resolution): no regression to the threshold or non-fuzzy paths, and for interpolate it is a documented speed/stability dial — n_rounds=1 the fast exact two-fit estimate, 5 (default) light averaging, ~15-20 the converged Monte-Carlo mean (run-to-run spread <5% on DOM_GSEC). The n_rounds reasoning + g-secretase convergence are documented in the fit Notes, the CONTEXT.md glossary, and the release notes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Spells out the seed scheme in the fit Notes, the constructor random_state docstring, and the example notebook: random_state is the initial seed and interpolate re-seeds each round with random_state + round (reproducible for a fixed seed, fresh entropy for None), while the threshold and non-fuzzy paths do not re-seed per round. Adds a notebook cell showing the fixed-seed result is reproducible across runs even with n_rounds>1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The class-abbreviation registry requires the canonical bare abbreviation; reassign `sm` per estimator instead of holding sm_threshold/sm_converged concurrently. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in
fuzzy_aggregation="interpolate"estimator toShapModel.fit([pro]). It weights a soft labelpby exactlyp— fitting the model twice (fuzzy sample at 0 →S0, at 1 →S1) and blendingp·S1 + (1−p)·S0— instead of the biased threshold sweep. The default"threshold"path is unchanged and byte-identical.Closes #229.
Design (settled in the grill)
pblend. The threshold sweep's effective positive-fraction is the grid's biasedfrac1, notp; the blend fixes that.n_roundssemantics (per-round seeds). A fixed intrandom_statederives a per-round seedrandom_state + i, so each round refits a genuinely different model —n_roundsis always meaningful and reproducible. Withrandom_state=Noneeach round draws fresh entropy (non-reproducible MC averaging).n_rounds=1is the 2-fit fast path (the fastest fuzzy estimator). There is no fixed-seed short-circuit and no no-op warning — they're unnecessary under per-round seeds."threshold"stays the default; promoting"interpolate"is a later-minor decision.Changes
_backend/shap_model/shap_model_fit.py):interpolate_fuzzy_shap_estimation+_seed_model_kwargs(per-round seeding)._shap_model.py):fuzzy_aggregationparam,check_str_optionsvalidation, routing, extended numpydoc Notes.test_sm_fit.py): 9 new tests — exact-pgolden (atol=1e-10), 2-fit-count spy (n_rounds=1→2,n_rounds=10→20), reproducibility + rounds-differ, MC-variance-shrinks-with-n_rounds, multi-fuzzy fit count, threshold no-regression, invalid-value negative, inert-without-fuzzy.CONTEXT.mdglossary term.Acceptance criteria — verified locally
pblend equalsp·S1 + (1−p)·S0(max abs diff0.0).n_rounds=1; 20 atn_rounds=10.random_statereproducible;n_rounds=1vs10differ.random_state=None: variance shrinks withn_rounds."threshold"default byte-identical (no-regression test).Scope / non-goals
proonly, no new deps. Default estimator unchanged. Joint multi-fuzzy2^Kinteractions out of scope (each fuzzy protein is an independent single-sample problem). Distinct from #53 (uncertainty band).🤖 Generated with Claude Code