Explore symbolic-regression surrogate for interpretable model-fitness rules (XAI)

## Problem

The models that drive ML-guided design — `TreeModel` / `ShapModel` and the forthcoming `SeqOpt` fitness (model `delta_pred`) — are effectively black boxes at the *score* level. CPP gives feature-level interpretability (which `PART-SPLIT-SCALE` features matter, with `mean_dif` / `feat_importance`), but there is **no compact, human-readable rule** that says *how* a sequence's feature values combine into the predicted score. For rational design and trust, a short symbolic expression — e.g. `score ≈ 0.7·hydrophobicity(TMD) − 0.3·charge(JMD_C) + …` — would let a user reason about and hand-tune a sequence without running the optimizer.

This is a distinct paradigm from sequence optimization: **genetic programming / symbolic regression evolves a *formula/program*, not a sequence**. It surfaced while scoping `SeqOpt` (#261) as the only place `deap.gp` could tangentially appear, and was explicitly kept out of that optimizer.

## Goal

Provide an **optional, interpretable symbolic surrogate** — a compact formula over a handful of CPP features that approximates a fitted model's predicted score (or `delta_pred`), with a stated fidelity guarantee — as an **explainability add-on**, independent of `SeqOpt`'s sequence optimization.

## Requirements

- [ ] A symbolic-regression fitter (e.g. `gplearn.SymbolicRegressor` or `deap.gp`) that learns a compact expression mapping `X` (CPP feature matrix from `SequenceFeature.feature_matrix`) to `y` (`model.predict_proba` target-class score, or `delta_pred`). Likely home: `explainable_ai_pro/` (heavy/optional dep).
- [ ] **Complexity bound for readability** — cap expression depth / number of terms / primitive set (`+ − × ÷`, maybe `min/max`), so the output is human-legible, not a bloated tree.
- [ ] Return the expression (string + structured form), the features it uses, and a **fidelity score** (R² / correlation to the model output on held-out data) and a parsimony measure.
- [ ] Reproducibility: full `random_state`/`seed` threading (no global RNG state).
- [ ] A plotting/format helper to render the formula and its fit (predicted-vs-model scatter).

## KPIs / Acceptance criteria

- [ ] On the canonical dataset, the surrogate reproduces the model's target-class score with **held-out R² ≥ 0.80** (or a documented band) — measurably faithful, not decorative.
- [ ] The returned expression is bounded to **≤ N terms and depth ≤ D** (e.g. ≤ 8 terms, depth ≤ 4) — verified programmatically.
- [ ] Deterministic: two fits with the same `random_state` produce the identical expression.
- [ ] Covered by ≥1 end-to-end test; one example notebook with embedded outputs.

## Scope / non-goals

- **Not** part of `SeqOpt` (#261) — that optimizes *sequences*; this distills an *interpretable rule* from a fixed model. Different representation, different operators.
- Heavy/optional dependency (`gplearn` or `deap`) → **`pro` or a dedicated extra**; the dependency + placement decision is **CONFIRM-FIRST** (`pyproject.toml`). May instead belong **downstream in ProtXplain** if it reads as an agent/explanation concern rather than a core library primitive — settle in the grill.
- Exploratory / low priority: validate that a faithful *and* compact surrogate is even achievable on real CPP features before committing to a public class.

## Dependencies

- Relates #57 (model-aware design) and #261 (`SeqOpt` fitness) — provides an interpretable surrogate of the *same* model score; independent of both (does not block / is not blocked).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore symbolic-regression surrogate for interpretable model-fitness rules (XAI) #265

Problem

Goal

Requirements

KPIs / Acceptance criteria

Scope / non-goals

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Explore symbolic-regression surrogate for interpretable model-fitness rules (XAI) #265

Description

Problem

Goal

Requirements

KPIs / Acceptance criteria

Scope / non-goals

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions