Skip to content

feat(shap): add opt-in fuzzy_aggregation="interpolate" estimator (#229)#269

Merged
breimanntools merged 8 commits into
masterfrom
feat/shap-fuzzy-interpolate
Jun 25, 2026
Merged

feat(shap): add opt-in fuzzy_aggregation="interpolate" estimator (#229)#269
breimanntools merged 8 commits into
masterfrom
feat/shap-fuzzy-interpolate

Conversation

@breimanntools

Copy link
Copy Markdown
Owner

Summary

Adds an opt-in fuzzy_aggregation="interpolate" estimator to ShapModel.fit ([pro]). It weights a soft label p by exactly p — fitting the model twice (fuzzy sample at 0 → S0, at 1 → S1) and blending p·S1 + (1−p)·S0 — instead of the biased threshold sweep. The default "threshold" path is unchanged and byte-identical.

Closes #229.

Design (settled in the grill)

  • Unbiased exact-p blend. The threshold sweep's effective positive-fraction is the grid's biased frac1, not p; the blend fixes that.
  • n_rounds semantics (per-round seeds). A fixed int random_state derives a per-round seed random_state + i, so each round refits a genuinely different model — n_rounds is always meaningful and reproducible. With random_state=None each round draws fresh entropy (non-reproducible MC averaging). n_rounds=1 is the 2-fit fast path (the fastest fuzzy estimator). There is no fixed-seed short-circuit and no no-op warning — they're unnecessary under per-round seeds.
  • One fuzzy protein at a time. Each fuzzy protein is explained independently against the fixed balanced 0/1 core, with the other fuzzy proteins excluded from its training data. A single fuzzy protein shares the full sample set, so its two blended fits cover every row (no separate baseline → exactly 2 fits); with ≥2 fuzzy proteins, non-fuzzy core rows come from one baseline core fit and each fuzzy row from its own 2-fit blend.
  • Opt-in. "threshold" stays the default; promoting "interpolate" is a later-minor decision.

Changes

  • Backend (_backend/shap_model/shap_model_fit.py): interpolate_fuzzy_shap_estimation + _seed_model_kwargs (per-round seeding).
  • Frontend (_shap_model.py): fuzzy_aggregation param, check_str_options validation, routing, extended numpydoc Notes.
  • Tests (test_sm_fit.py): 9 new tests — exact-p golden (atol=1e-10), 2-fit-count spy (n_rounds=1→2, n_rounds=10→20), reproducibility + rounds-differ, MC-variance-shrinks-with-n_rounds, multi-fuzzy fit count, threshold no-regression, invalid-value negative, inert-without-fuzzy.
  • Docs: example notebook cell (re-executed with outputs), release-notes entry, CONTEXT.md glossary term.

Acceptance criteria — verified locally

  • ✅ Exact-p blend equals p·S1 + (1−p)·S0 (max abs diff 0.0).
  • ✅ Exactly 2 model fits per fuzzy sample at n_rounds=1; 20 at n_rounds=10.
  • ✅ Fixed random_state reproducible; n_rounds=1 vs 10 differ.
  • random_state=None: variance shrinks with n_rounds.
  • "threshold" default byte-identical (no-regression test).
  • ✅ Full local unit suite: 5037 passed, 10 skipped. Docstring/drift/param-coverage/import-hygiene/method-spacing checkers green.

Scope / non-goals

pro only, no new deps. Default estimator unchanged. Joint multi-fuzzy 2^K interactions out of scope (each fuzzy protein is an independent single-sample problem). Distinct from #53 (uncertainty band).

🤖 Generated with Claude Code

breimanntools and others added 2 commits June 25, 2026 06:18
ShapModel.fit gains fuzzy_aggregation (default "threshold", byte-identical
to before). "interpolate" weights a soft label p by exactly p — fitting at 0
(S0) and at 1 (S1) and blending p*S1 + (1-p)*S0 — instead of the biased
threshold sweep. With n_rounds=1 it is exactly two fits per fuzzy sample
(the fastest fuzzy estimator); n_rounds>1 averages per-round re-seeded fits
(random_state + round) so n_rounds is always meaningful and reproducible.

Each fuzzy protein is explained independently against the fixed balanced 0/1
core, with the other fuzzy proteins excluded; a single fuzzy protein shares
the full set, so its two blended fits cover every row (no baseline needed).

Backend: interpolate_fuzzy_shap_estimation + _seed_model_kwargs.
Frontend: validation (check_str_options), routing, numpydoc Notes block.
Adds 9 unit tests (exact-p golden, fit-count spy, reproducibility,
MC-variance-vs-rounds, multi-fuzzy, threshold no-regression), an example
notebook cell, a release-notes entry, and a CONTEXT.md glossary term.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Extract _class_index_from_labels helper (was duplicated across both
  estimators).
- Move the cell zero-init into the multi-fuzzy branch (the single-fuzzy
  branch reassigns it; the init was dead there).
- Comment why only the interpolate path threads random_state (per-round
  re-seeding) while the threshold path keeps it baked into model kwargs.

No behavior change; output identical (41 ShapModel fit tests green).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.12%. Comparing base (3bd1c87) to head (c1d8684).
⚠️ Report is 25 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #269      +/-   ##
==========================================
- Coverage   96.17%   96.12%   -0.05%     
==========================================
  Files         175      175              
  Lines       16312    16442     +130     
  Branches     2787     2806      +19     
==========================================
+ Hits        15688    15805     +117     
- Misses        366      369       +3     
- Partials      258      268      +10     
Files with missing lines Coverage Δ
...nable_ai_pro/_backend/shap_model/shap_model_fit.py 94.07% <100.00%> (+4.30%) ⬆️
aaanalysis/explainable_ai_pro/_shap_model.py 98.37% <100.00%> (+0.02%) ⬆️

... and 2 files with indirect coverage changes

Components Coverage Δ
cpp_core 94.95% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@breimanntools breimanntools left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good. Please make a regression test! How much faster are we with the new implementaion. How strong is the difference to the existing ShapModel method?

breimanntools and others added 6 commits June 25, 2026 12:28
Pins the interpolate estimator on a real DOM_GSEC fuzzy cell — APP (P05067),
CD44 (P16070), and a non-substrate (Q14802) with invented prediction scores,
each explained as a single fuzzy sample. Guards:

- exact-p identity: interpolate(n_rounds=1) == p*S1 + (1-p)*S0 (atol=1e-10),
  recomputed same-machine so it is platform-robust;
- fit-count advantage: interpolate(n_rounds=1) does 2 fits vs threshold(n5)'s 5
  — the ~2.15x wall-clock win measured against aaanalysis 1.0.3 as a noise-free
  invariant;
- frozen per-protein signatures; the threshold signatures were verified
  byte-identical to aaanalysis 1.0.3 on this cell (no-regression for the default
  path), while interpolate differs by design (unbiased exact-p).

@pytest.mark.regression, pinned to Linux/py3.11 (AAA_RUN_REGRESSION=1 forces it
locally); runs in the non-gating nightly only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
interpolate's per-round average is a Monte-Carlo/bootstrap mean over re-seeded
model fits, so it converges as n_rounds grows. Adds a reproducible (fixed base
seed) convergence test on the canonical fuzzy cell: n_rounds=R equals the
cumulative mean of per-round blends, so the 25 blends are computed once and all
cumulative means derived from them. Asserts the convergence structure
(platform-robust, no frozen values):

- a single round (n_rounds=1) sits clearly off the converged mean;
- late rounds move the estimate far less than early rounds (it converges ~1/sqrt(R));
- the tail is stable (the last rounds barely change the estimate).

On DOM_GSEC the estimate stabilizes (incremental change < 2%) around n_rounds 15-20;
n_rounds=1 stays the fast unbiased point estimate, higher n_rounds buys a stable mean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fuzzy_aggregation now defaults to "interpolate" (was "threshold"); the legacy
biased threshold sweep stays available via fuzzy_aggregation="threshold".

n_rounds is now Optional[int]=None, resolving to a per-estimator natural default:
1 for interpolate (exact in a single round) and 5 otherwise (the threshold sweep
and the non-fuzzy Monte-Carlo need several rounds). So the default fuzzy estimate
is the exact two-fit blend — ~2x faster than the v1.0 default on the same cell —
while n_rounds>1 averages re-seeded fits into a reproducible Monte-Carlo mean that
converges around n_rounds~15-20.

Updates the Notes/param docstrings, the CONTEXT.md glossary, the release notes,
and the example notebook (now demos the threshold opt-in + n_rounds averaging).
Tests: default-is-interpolate, n_rounds=None natural-default resolution; the
threshold branch-coverage test pins fuzzy_aggregation="threshold" explicitly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…class

Reverts the per-estimator n_rounds=None magic for a simpler, single shared
default. fuzzy_aggregation selects two first-class estimators: the cited
threshold sweep ([Breimann25]) and the new unbiased interpolate (default, v1.1);
threshold is kept (not deprecated) and stays faithful to its published
n_rounds=5 grid.

n_rounds is a plain int=5 (no None resolution): no regression to the threshold
or non-fuzzy paths, and for interpolate it is a documented speed/stability dial
— n_rounds=1 the fast exact two-fit estimate, 5 (default) light averaging,
~15-20 the converged Monte-Carlo mean (run-to-run spread <5% on DOM_GSEC). The
n_rounds reasoning + g-secretase convergence are documented in the fit Notes,
the CONTEXT.md glossary, and the release notes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Spells out the seed scheme in the fit Notes, the constructor random_state
docstring, and the example notebook: random_state is the initial seed and
interpolate re-seeds each round with random_state + round (reproducible for a
fixed seed, fresh entropy for None), while the threshold and non-fuzzy paths do
not re-seed per round. Adds a notebook cell showing the fixed-seed result is
reproducible across runs even with n_rounds>1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The class-abbreviation registry requires the canonical bare abbreviation; reassign
`sm` per estimator instead of holding sm_threshold/sm_converged concurrently.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@breimanntools breimanntools marked this pull request as ready for review June 25, 2026 14:20
@breimanntools breimanntools merged commit 8e0b52e into master Jun 25, 2026
16 checks passed
@breimanntools breimanntools deleted the feat/shap-fuzzy-interpolate branch June 25, 2026 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ShapModel.fit: add unbiased probability-interpolation estimator for fuzzy labeling (p-weighted 0/1 blend per round)

1 participant