feat(shap): add opt-in fuzzy_aggregation="interpolate" estimator (#229) by breimanntools · Pull Request #269 · breimanntools/aaanalysis

breimanntools · 2026-06-25T07:22:56Z

Summary

Adds an opt-in fuzzy_aggregation="interpolate" estimator to ShapModel.fit ([pro]). It weights a soft label p by exactly p — fitting the model twice (fuzzy sample at 0 → S0, at 1 → S1) and blending p·S1 + (1−p)·S0 — instead of the biased threshold sweep. The default "threshold" path is unchanged and byte-identical.

Closes #229.

Design (settled in the grill)

Unbiased exact-p blend. The threshold sweep's effective positive-fraction is the grid's biased frac1, not p; the blend fixes that.
n_rounds semantics (per-round seeds). A fixed int random_state derives a per-round seed random_state + i, so each round refits a genuinely different model — n_rounds is always meaningful and reproducible. With random_state=None each round draws fresh entropy (non-reproducible MC averaging). n_rounds=1 is the 2-fit fast path (the fastest fuzzy estimator). There is no fixed-seed short-circuit and no no-op warning — they're unnecessary under per-round seeds.
One fuzzy protein at a time. Each fuzzy protein is explained independently against the fixed balanced 0/1 core, with the other fuzzy proteins excluded from its training data. A single fuzzy protein shares the full sample set, so its two blended fits cover every row (no separate baseline → exactly 2 fits); with ≥2 fuzzy proteins, non-fuzzy core rows come from one baseline core fit and each fuzzy row from its own 2-fit blend.
Opt-in. "threshold" stays the default; promoting "interpolate" is a later-minor decision.

Changes

Backend (_backend/shap_model/shap_model_fit.py): interpolate_fuzzy_shap_estimation + _seed_model_kwargs (per-round seeding).
Frontend (_shap_model.py): fuzzy_aggregation param, check_str_options validation, routing, extended numpydoc Notes.
Tests (test_sm_fit.py): 9 new tests — exact-p golden (atol=1e-10), 2-fit-count spy (n_rounds=1→2, n_rounds=10→20), reproducibility + rounds-differ, MC-variance-shrinks-with-n_rounds, multi-fuzzy fit count, threshold no-regression, invalid-value negative, inert-without-fuzzy.
Docs: example notebook cell (re-executed with outputs), release-notes entry, CONTEXT.md glossary term.

Acceptance criteria — verified locally

✅ Exact-p blend equals p·S1 + (1−p)·S0 (max abs diff 0.0).
✅ Exactly 2 model fits per fuzzy sample at n_rounds=1; 20 at n_rounds=10.
✅ Fixed random_state reproducible; n_rounds=1 vs 10 differ.
✅ random_state=None: variance shrinks with n_rounds.
✅ "threshold" default byte-identical (no-regression test).
✅ Full local unit suite: 5037 passed, 10 skipped. Docstring/drift/param-coverage/import-hygiene/method-spacing checkers green.

Scope / non-goals

pro only, no new deps. Default estimator unchanged. Joint multi-fuzzy 2^K interactions out of scope (each fuzzy protein is an independent single-sample problem). Distinct from #53 (uncertainty band).

🤖 Generated with Claude Code

ShapModel.fit gains fuzzy_aggregation (default "threshold", byte-identical to before). "interpolate" weights a soft label p by exactly p — fitting at 0 (S0) and at 1 (S1) and blending p*S1 + (1-p)*S0 — instead of the biased threshold sweep. With n_rounds=1 it is exactly two fits per fuzzy sample (the fastest fuzzy estimator); n_rounds>1 averages per-round re-seeded fits (random_state + round) so n_rounds is always meaningful and reproducible. Each fuzzy protein is explained independently against the fixed balanced 0/1 core, with the other fuzzy proteins excluded; a single fuzzy protein shares the full set, so its two blended fits cover every row (no baseline needed). Backend: interpolate_fuzzy_shap_estimation + _seed_model_kwargs. Frontend: validation (check_str_options), routing, numpydoc Notes block. Adds 9 unit tests (exact-p golden, fit-count spy, reproducibility, MC-variance-vs-rounds, multi-fuzzy, threshold no-regression), an example notebook cell, a release-notes entry, and a CONTEXT.md glossary term. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Extract _class_index_from_labels helper (was duplicated across both estimators). - Move the cell zero-init into the multi-fuzzy branch (the single-fuzzy branch reassigns it; the init was dead there). - Comment why only the interpolate path threads random_state (per-round re-seeding) while the threshold path keeps it baked into model kwargs. No behavior change; output identical (41 ShapModel fit tests green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-25T08:04:46Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.12%. Comparing base (3bd1c87) to head (c1d8684).
⚠️ Report is 25 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #269      +/-   ##
==========================================
- Coverage   96.17%   96.12%   -0.05%     
==========================================
  Files         175      175              
  Lines       16312    16442     +130     
  Branches     2787     2806      +19     
==========================================
+ Hits        15688    15805     +117     
- Misses        366      369       +3     
- Partials      258      268      +10

Files with missing lines	Coverage Δ
...nable_ai_pro/_backend/shap_model/shap_model_fit.py	`94.07% <100.00%> (+4.30%)`	⬆️
aaanalysis/explainable_ai_pro/_shap_model.py	`98.37% <100.00%> (+0.02%)`	⬆️

... and 2 files with indirect coverage changes

Components	Coverage Δ
cpp_core	`94.95% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

breimanntools

Looks all good. Please make a regression test! How much faster are we with the new implementaion. How strong is the difference to the existing ShapModel method?

Pins the interpolate estimator on a real DOM_GSEC fuzzy cell — APP (P05067), CD44 (P16070), and a non-substrate (Q14802) with invented prediction scores, each explained as a single fuzzy sample. Guards: - exact-p identity: interpolate(n_rounds=1) == p*S1 + (1-p)*S0 (atol=1e-10), recomputed same-machine so it is platform-robust; - fit-count advantage: interpolate(n_rounds=1) does 2 fits vs threshold(n5)'s 5 — the ~2.15x wall-clock win measured against aaanalysis 1.0.3 as a noise-free invariant; - frozen per-protein signatures; the threshold signatures were verified byte-identical to aaanalysis 1.0.3 on this cell (no-regression for the default path), while interpolate differs by design (unbiased exact-p). @pytest.mark.regression, pinned to Linux/py3.11 (AAA_RUN_REGRESSION=1 forces it locally); runs in the non-gating nightly only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

interpolate's per-round average is a Monte-Carlo/bootstrap mean over re-seeded model fits, so it converges as n_rounds grows. Adds a reproducible (fixed base seed) convergence test on the canonical fuzzy cell: n_rounds=R equals the cumulative mean of per-round blends, so the 25 blends are computed once and all cumulative means derived from them. Asserts the convergence structure (platform-robust, no frozen values): - a single round (n_rounds=1) sits clearly off the converged mean; - late rounds move the estimate far less than early rounds (it converges ~1/sqrt(R)); - the tail is stable (the last rounds barely change the estimate). On DOM_GSEC the estimate stabilizes (incremental change < 2%) around n_rounds 15-20; n_rounds=1 stays the fast unbiased point estimate, higher n_rounds buys a stable mean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fuzzy_aggregation now defaults to "interpolate" (was "threshold"); the legacy biased threshold sweep stays available via fuzzy_aggregation="threshold". n_rounds is now Optional[int]=None, resolving to a per-estimator natural default: 1 for interpolate (exact in a single round) and 5 otherwise (the threshold sweep and the non-fuzzy Monte-Carlo need several rounds). So the default fuzzy estimate is the exact two-fit blend — ~2x faster than the v1.0 default on the same cell — while n_rounds>1 averages re-seeded fits into a reproducible Monte-Carlo mean that converges around n_rounds~15-20. Updates the Notes/param docstrings, the CONTEXT.md glossary, the release notes, and the example notebook (now demos the threshold opt-in + n_rounds averaging). Tests: default-is-interpolate, n_rounds=None natural-default resolution; the threshold branch-coverage test pins fuzzy_aggregation="threshold" explicitly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…class Reverts the per-estimator n_rounds=None magic for a simpler, single shared default. fuzzy_aggregation selects two first-class estimators: the cited threshold sweep ([Breimann25]) and the new unbiased interpolate (default, v1.1); threshold is kept (not deprecated) and stays faithful to its published n_rounds=5 grid. n_rounds is a plain int=5 (no None resolution): no regression to the threshold or non-fuzzy paths, and for interpolate it is a documented speed/stability dial — n_rounds=1 the fast exact two-fit estimate, 5 (default) light averaging, ~15-20 the converged Monte-Carlo mean (run-to-run spread <5% on DOM_GSEC). The n_rounds reasoning + g-secretase convergence are documented in the fit Notes, the CONTEXT.md glossary, and the release notes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Spells out the seed scheme in the fit Notes, the constructor random_state docstring, and the example notebook: random_state is the initial seed and interpolate re-seeds each round with random_state + round (reproducible for a fixed seed, fresh entropy for None), while the threshold and non-fuzzy paths do not re-seed per round. Adds a notebook cell showing the fixed-seed result is reproducible across runs even with n_rounds>1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The class-abbreviation registry requires the canonical bare abbreviation; reassign `sm` per estimator instead of holding sm_threshold/sm_converged concurrently. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

breimanntools and others added 2 commits June 25, 2026 06:18

breimanntools commented Jun 25, 2026

View reviewed changes

breimanntools and others added 6 commits June 25, 2026 12:28

docs(shap): use bare sm instance in fuzzy example cell

c1d8684

The class-abbreviation registry requires the canonical bare abbreviation; reassign `sm` per estimator instead of holding sm_threshold/sm_converged concurrently. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

breimanntools marked this pull request as ready for review June 25, 2026 14:20

breimanntools merged commit 8e0b52e into master Jun 25, 2026
16 checks passed

breimanntools deleted the feat/shap-fuzzy-interpolate branch June 25, 2026 15:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(shap): add opt-in fuzzy_aggregation="interpolate" estimator (#229)#269

feat(shap): add opt-in fuzzy_aggregation="interpolate" estimator (#229)#269
breimanntools merged 8 commits into
masterfrom
feat/shap-fuzzy-interpolate

breimanntools commented Jun 25, 2026

Uh oh!

codecov Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

breimanntools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

breimanntools commented Jun 25, 2026

Summary

Design (settled in the grill)

Changes

Acceptance criteria — verified locally

Scope / non-goals

Uh oh!

codecov Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

breimanntools left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Jun 25, 2026 •

edited

Loading