Skip to content
Merged
4 changes: 4 additions & 0 deletions CONTEXT.md
Original file line number Diff line number Diff line change
Expand Up @@ -561,6 +561,10 @@ _Avoid_: feature importance (unsigned, group-level), SHAP value (the raw per-fea
The uniform boolean toggle on the `CPPPlot` family (`profile`, `heatmap`, `ranking`, `feature_map`) selecting **CPP analysis** (`False`, group-level **feature importance**, `feat_importance` / `mean_dif`) versus **CPP-SHAP analysis** (`True`, sample-level **feature impact**, `feat_impact_'name'` / `mean_dif_'name'`). `True` switches color encoding to signed red/blue and the colorbar to the diverging SHAP colormap. It selects the *interpretation level*; it does not itself run SHAP (that is `ShapModel`). In `feature_map(shap_plot=True)` the cumulative bars stack the per-feature impact in one direction colored by sign; a `mean_dif_'name'` `col_val` keeps the mean-difference heatmap with those bars, while a `feat_impact_'name'` `col_val` moves the impact into the heatmap cells and switches the bars off.
_Avoid_: shap_mode, use_shap, sample_plot.

**fuzzy aggregation** (`fuzzy_aggregation`):
The strategy `ShapModel.fit` selects to turn a soft label `p` ∈ (0, 1) into a SHAP estimate when **fuzzy labeling** is active. `"interpolate"` (default, new in v1.1) fits the model twice (fuzzy sample at 0 → `S0`, at 1 → `S1`) and blends `p·S1 + (1−p)·S0` — the **unbiased** exact-`p` estimate. `"threshold"` (the `Breimann25` sweep) hard-labels the fuzzy sample `1` across a non-uniform `n_rounds × n_selection` grid and averages — a **biased** approximation whose effective positive-fraction is the grid's `frac1`, not `p`; kept as a first-class option. Each fuzzy protein is explained independently against the fixed balanced 0/1 **core**, with the other fuzzy proteins excluded from that run's training data. `n_rounds` (default `5`) is interpolate's speed/stability dial: `1` = fast exact two-fit estimate, `5` = light averaging, `≈15–20` = converged Monte-Carlo mean (run-to-run spread <5% on `DOM_GSEC`).
_Avoid_: fuzzy mode, blend mode, soft-label aggregation.

### Scale-set vocabulary

**explainable scale set** (`top_explain_n`):
Expand Down
102 changes: 100 additions & 2 deletions aaanalysis/explainable_ai_pro/_backend/shap_model/shap_model_fit.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,12 @@ def _get_shap_values(shap_output, class_index=1):
return shap_values


def _class_index_from_labels(labels, label_target_class=1):
"""Map the target class label to its index among the integer (non-fuzzy) classes."""
label_classes = sorted(list(dict.fromkeys([x for x in labels if x == int(x)])))
return label_classes.index(label_target_class)


def _compute_shap_values(X, labels, model_class=None, model_kwargs=None,
explainer_class=None, explainer_kwargs=None,
class_index=1, n_background_data=None):
Expand Down Expand Up @@ -105,6 +111,23 @@ def _aggregate_shap_values(X, labels=None, list_model_classes=None, list_model_k
return shap_values, exp_val


def _seed_model_kwargs(list_model_kwargs, random_state=None, round_idx=0):
"""Derive per-round-seeded copies of the model kwargs.

With a fixed ``random_state`` each round uses ``random_state + round_idx`` so the rounds
differ (Monte-Carlo averaging) yet stay reproducible. With ``random_state=None`` the kwargs
are returned unchanged (``random_state`` already ``None``), so every fit re-draws fresh entropy.
"""
if random_state is None:
return [dict(model_kwargs) for model_kwargs in list_model_kwargs]
seeded = []
for model_kwargs in list_model_kwargs:
model_kwargs = dict(model_kwargs)
model_kwargs["random_state"] = random_state + round_idx
seeded.append(model_kwargs)
return seeded


# II Main Functions
@ut.catch_backend_processing_error()
def monte_carlo_shap_estimation(X, labels=None, list_model_classes=None, list_model_kwargs=None,
Expand All @@ -113,8 +136,7 @@ def monte_carlo_shap_estimation(X, labels=None, list_model_classes=None, list_mo
label_target_class=1, n_background_data=None):
"""Compute Monte Carlo estimates of SHAP values for multiple models and feature selections."""
# Get class index
label_classes = sorted(list(dict.fromkeys([x for x in labels if x == int(x)])))
class_index = label_classes.index(label_target_class)
class_index = _class_index_from_labels(labels, label_target_class)
# Create empty SHAP value matrix
n_samples, n_features = X.shape
n_selection_rounds = len(is_selected)
Expand Down Expand Up @@ -150,3 +172,79 @@ def monte_carlo_shap_estimation(X, labels=None, list_model_classes=None, list_mo
shap_values = np.mean(mc_shap_values, axis=(2, 3))
exp_val = np.mean(list_expected_value)
return shap_values, exp_val


@ut.catch_backend_processing_error()
def interpolate_fuzzy_shap_estimation(X, labels=None, list_model_classes=None, list_model_kwargs=None,
explainer_class=None, explainer_kwargs=None, n_rounds=5,
is_selected=None, verbose=False, label_target_class=1,
n_background_data=None, random_state=None):
"""Compute unbiased exact-``p`` SHAP estimates for fuzzy labels by interpolating between 0/1 fits.

Each fuzzy sample with soft label ``p`` is weighted by exactly ``p``: the model is fit twice
(fuzzy sample at 0 -> ``S0``, at 1 -> ``S1``) and the per-feature attributions are blended as
``p * S1 + (1 - p) * S0``. Each fuzzy protein is explained independently against the fixed
balanced 0/1 core, with the other fuzzy proteins excluded from that run's training data. With
``n_rounds=1`` this is exactly two fits per fuzzy sample; ``n_rounds > 1`` averages per-round
re-seeded fits (reproducible for a fixed ``random_state``).
"""
labels = list(labels)
# Get class index (fuzzy float labels are excluded; classes come from the 0/1 core)
class_index = _class_index_from_labels(labels, label_target_class)
n_samples, n_features = X.shape
n_selection_rounds = len(is_selected)
n_cells = n_rounds * n_selection_rounds
# Partition into the fixed 0/1 core and the fuzzy samples explained one at a time
fuzzy_idx = [i for i, label in enumerate(labels) if label not in (0, 1)]
core_idx = [i for i, label in enumerate(labels) if label in (0, 1)]
core_labels = [labels[i] for i in core_idx]
# A single fuzzy protein shares the full sample set, so the two blended fits already cover
# every row (no separate baseline needed) -> exactly two fits per round and selection.
single_fuzzy = len(fuzzy_idx) == 1
acc_shap_values = np.zeros(shape=(n_samples, n_features))
list_expected_value = []
if verbose:
ut.print_start_progress(start_message=f"ShapModel starts interpolation estimation of SHAP values over {n_rounds} rounds.")
for i in range(n_rounds):
_list_model_kwargs = _seed_model_kwargs(list_model_kwargs, random_state=random_state, round_idx=i)
for j, selected_features in enumerate(is_selected):
if verbose:
pct_progress = j / len(is_selected)
add_new_line = explainer_class in LIST_VERBOSE_shap_modelS
ut.print_progress(i=i+pct_progress, n_total=n_rounds, add_new_line=add_new_line)
X_selected = X[:, selected_features]
args = dict(list_model_classes=list_model_classes, list_model_kwargs=_list_model_kwargs,
explainer_class=explainer_class, explainer_kwargs=explainer_kwargs,
class_index=class_index, n_background_data=n_background_data)
if single_fuzzy:
f = fuzzy_idx[0]
p = labels[f]
labels_0 = [0 if k == f else labels[k] for k in range(n_samples)]
labels_1 = [1 if k == f else labels[k] for k in range(n_samples)]
shap_0, exp_0 = _aggregate_shap_values(X_selected, labels=labels_0, **args)
shap_1, exp_1 = _aggregate_shap_values(X_selected, labels=labels_1, **args)
cell = p * shap_1 + (1 - p) * shap_0
list_expected_value.append(p * exp_1 + (1 - p) * exp_0)
else:
cell = np.zeros(shape=(n_samples, X_selected.shape[1]))
# Non-fuzzy core rows come from a single baseline fit on the core
shap_core, exp_core = _aggregate_shap_values(X_selected[core_idx], labels=core_labels, **args)
cell[core_idx] = shap_core
list_expected_value.append(exp_core)
# Each fuzzy protein is explained against core + itself (others excluded)
for f in fuzzy_idx:
p = labels[f]
sub_idx = core_idx + [f]
X_sub = X_selected[sub_idx]
shap_0, exp_0 = _aggregate_shap_values(X_sub, labels=core_labels + [0], **args)
shap_1, exp_1 = _aggregate_shap_values(X_sub, labels=core_labels + [1], **args)
cell[f] = p * shap_1[-1] + (1 - p) * shap_0[-1]
list_expected_value.append(p * exp_1 + (1 - p) * exp_0)
full_cell = np.zeros(shape=(n_samples, n_features))
full_cell[:, selected_features] = cell
acc_shap_values += full_cell
if verbose:
ut.print_end_progress(end_message=f"ShapModel finished interpolation estimation and saved results.")
shap_values = acc_shap_values / n_cells
exp_val = np.mean(list_expected_value)
return shap_values, exp_val
84 changes: 71 additions & 13 deletions aaanalysis/explainable_ai_pro/_shap_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@
from aaanalysis.template_classes import Wrapper
from ._backend.check_models import (check_match_labels_X,
check_match_X_is_selected)
from ._backend.shap_model.shap_model_fit import monte_carlo_shap_estimation
from ._backend.shap_model.shap_model_fit import (monte_carlo_shap_estimation,
interpolate_fuzzy_shap_estimation)
from ._backend.shap_model.sm_add_feat_impact import (comp_shap_feature_importance,
insert_shap_feature_importance,
comp_shap_feature_impact,
Expand Down Expand Up @@ -379,7 +380,9 @@ def __init__(self,
If ``True``, verbose outputs are enabled.
random_state : int, optional
The seed used by the random number generator. If a positive integer, results of stochastic processes are
consistent, enabling reproducibility. If ``None``, stochastic processes will be truly random.
consistent, enabling reproducibility. If ``None``, stochastic processes will be truly random. For
``fuzzy_aggregation='interpolate'`` it is the initial seed and each round re-seeds with
``random_state + round`` (see :meth:`ShapModel.fit` Notes).

Notes
-----
Expand Down Expand Up @@ -472,6 +475,7 @@ def fit(self,
n_rounds: int = 5,
is_selected: Optional[ut.ArrayLike2D] = None,
fuzzy_labeling: bool = False,
fuzzy_aggregation: str = "interpolate",
n_background_data: Optional[int] = None,
df_seq: Optional[pd.DataFrame] = None,
fuzzy_labels: Optional[dict] = None,
Expand Down Expand Up @@ -499,11 +503,22 @@ def fit(self,
For binary classification, '0' represents the negative class and '1' the positive class.
n_rounds : int, default=5
The number of rounds (>=1) to fit the models and obtain the SHAP values by explainer.
For ``fuzzy_aggregation='interpolate'`` each round re-seeds the fit, so ``n_rounds`` is a
speed/stability dial (see Notes): ``1`` is the fast exact two-fit estimate, the default
``5`` adds Monte-Carlo averaging, and a stable mean is reached around ``15-20``.
is_selected : array-like, shape (n_selection_round, n_features)
2D boolean arrays indicating different feature selections.
fuzzy_labeling : bool, default=False
If ``True``, fuzzy labeling is applied to approximate SHAP values for samples with uncertain/partial
memberships (e.g., between >0 and <1 for binary classification scenarios).
fuzzy_aggregation : str, default='interpolate'
Strategy to turn a soft label ``p`` into a SHAP estimate when fuzzy labeling is active (see Notes):

- ``'interpolate'`` (default, new in 1.1): blend ``p * S1 + (1 - p) * S0`` from a fit at
0 and at 1 (unbiased, exact ``p``; with ``n_rounds=1`` only two fits per fuzzy sample).
- ``'threshold'``: hard-label the fuzzy sample over a threshold grid and average — the
biased sweep of [Breimann25]_; kept for backward-compatible results.

n_background_data : None or int, optional
The number samples (< 'n_samples') in the background dataset used for the `KernelExplainer`` to reduce
computation time. The dataset is obtained by k-means clustering. If ``None``, the full dataset 'X' is used.
Expand All @@ -530,6 +545,39 @@ def fit(self,
* Idea: Adjusts label thresholds dynamically in Monte Carlo estimation to better represent label uncertainties.
* Background: Inspired by fuzzy logic, replacing binary true/false with degrees of truth.

**Fuzzy aggregation strategies**

The ``fuzzy_aggregation`` parameter selects between two estimators:

* ``'interpolate'`` (default): The fuzzy sample is weighted by *exactly* ``p`` by fitting the model twice (fuzzy
sample at 0 -> ``S0``, at 1 -> ``S1``) and blending ``p * S1 + (1 - p) * S0`` (the ``exp_value`` is blended the
same way). This is *unbiased*. Each fuzzy protein is explained independently against the fixed balanced 0/1
core, with the other fuzzy proteins excluded from its training data.
* ``'threshold'``: Over an ``n_rounds`` x ``n_selection`` grid the fuzzy sample is hard-labeled ``1`` when a
per-cell threshold ``<= p`` and the per-cell SHAP matrices are averaged — the sweep of [Breimann25]_. Because
the grid is non-uniform on (0, 1], the effective positive-fraction is a *biased* approximation of ``p``.

**Per-round seeding (interpolate only)**

The constructor ``random_state`` is the initial seed, and ``'interpolate'`` re-seeds **each round** with
``random_state + round`` (round 0 -> ``random_state``, round 1 -> ``random_state + 1``, ...). So every round
fits a *different* model and ``n_rounds`` averages a Monte-Carlo mean over model variance, yet a fixed
``random_state`` gives the identical seed sequence and therefore an exactly reproducible result;
``random_state=None`` draws fresh entropy each round (truly-random, non-reproducible). The ``'threshold'``
estimator and the non-fuzzy Monte-Carlo path do **not** re-seed per round — they bake ``random_state`` in once,
so their per-round variation comes from the threshold grid, not from the model seed.

**Choosing n_rounds for 'interpolate'**

Because each round re-seeds, ``n_rounds`` is a speed/stability dial:

* ``n_rounds=1`` -- the exact two-fit point estimate; fastest, but a single model draw (run-to-run spread ~20%
across seeds).
* ``n_rounds=5`` (default) -- adds light averaging (spread ~10%).
* ``n_rounds≈15-20`` -- the averaged estimate stabilizes (run-to-run spread and distance to the converged mean
fall below ~5% on the bundled ``DOM_GSEC`` gamma-secretase data, ~1/sqrt(n_rounds) decay). Use this for a
stable mean; with a fixed ``random_state`` any single run is exactly reproducible regardless.

**Setting soft labels**

There are two equivalent ways to provide soft labels, both enabling fuzzy labeling:
Expand All @@ -554,6 +602,8 @@ def fit(self,
n_samples, n_feat = X.shape
ut.check_X_unique_samples(X=X, min_n_unique_samples=2)
ut.check_bool(name="fuzzy_labeling", val=fuzzy_labeling)
ut.check_str_options(name="fuzzy_aggregation", val=fuzzy_aggregation,
list_str_options=["threshold", "interpolate"])
if fuzzy_labels is not None:
# Entry-keyed soft labels override 'labels' and enable fuzzy labeling
check_match_df_seq_X(df_seq=df_seq, X=X)
Expand All @@ -571,17 +621,25 @@ def fit(self,
ut.check_number_range(name="n_background_data", val=n_background_data, min_val=1, just_int=True, accept_none=True)
check_match_n_background_data_X(n_background_data=n_background_data, X=X)
# Compute SHAP values
shap_values, exp_val = monte_carlo_shap_estimation(X, labels=labels,
list_model_classes=self._list_model_classes,
list_model_kwargs=self._list_model_kwargs,
explainer_class=self._explainer_class,
explainer_kwargs=self._explainer_kwargs,
is_selected=is_selected,
fuzzy_labeling=fuzzy_labeling,
n_rounds=n_rounds,
verbose=self._verbose,
label_target_class=label_target_class,
n_background_data=n_background_data)
backend_args = dict(list_model_classes=self._list_model_classes,
list_model_kwargs=self._list_model_kwargs,
explainer_class=self._explainer_class,
explainer_kwargs=self._explainer_kwargs,
is_selected=is_selected,
n_rounds=n_rounds,
verbose=self._verbose,
label_target_class=label_target_class,
n_background_data=n_background_data)
if fuzzy_labeling and fuzzy_aggregation == "interpolate":
# Only the interpolate path threads 'random_state' explicitly: it re-seeds per round
# (random_state + round). The threshold path keeps the seed baked into the model kwargs.
shap_values, exp_val = interpolate_fuzzy_shap_estimation(X, labels=labels,
random_state=self._random_state,
**backend_args)
else:
shap_values, exp_val = monte_carlo_shap_estimation(X, labels=labels,
fuzzy_labeling=fuzzy_labeling,
**backend_args)
self.shap_values = shap_values
self.exp_value = exp_val
return self
Expand Down
9 changes: 9 additions & 0 deletions docs/source/index/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,15 @@ Added
``add_feat_impact`` / ``add_sample_mean_dif`` accept ``df_seq`` and a ``samples``
parameter taking row positions or entry names. The array-``labels`` path is unchanged;
``sample_positions`` is a deprecated alias for ``samples`` (removed in 1.2.0).
- **ShapModel — unbiased fuzzy estimator, now the default** (``[pro]``): ``fit`` gains
``fuzzy_aggregation``, defaulting to the new ``'interpolate'`` estimator. It weights a
soft label by *exactly* ``p`` — fitting at 0 (``S0``) and at 1 (``S1``) and blending
``p * S1 + (1 - p) * S0`` — the unbiased alternative to the biased threshold sweep, which
stays available as a first-class option via ``fuzzy_aggregation='threshold'``. For
``interpolate``, ``n_rounds`` (default ``5``) is a speed/stability dial: ``1`` is the fast
exact two-fit estimate (~2x faster than the threshold default on the same cell), ``5`` adds
light Monte-Carlo averaging, and the mean converges (run-to-run spread below ~5%) around
``n_rounds ≈ 15–20``; a fixed ``random_state`` keeps every run reproducible.

**Sequence Analysis**

Expand Down
Loading
Loading