You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The models that drive ML-guided design — TreeModel / ShapModel and the forthcoming SeqOpt fitness (model delta_pred) — are effectively black boxes at the score level. CPP gives feature-level interpretability (which PART-SPLIT-SCALE features matter, with mean_dif / feat_importance), but there is no compact, human-readable rule that says how a sequence's feature values combine into the predicted score. For rational design and trust, a short symbolic expression — e.g. score ≈ 0.7·hydrophobicity(TMD) − 0.3·charge(JMD_C) + … — would let a user reason about and hand-tune a sequence without running the optimizer.
This is a distinct paradigm from sequence optimization: genetic programming / symbolic regression evolves a formula/program, not a sequence. It surfaced while scoping SeqOpt (#261) as the only place deap.gp could tangentially appear, and was explicitly kept out of that optimizer.
Goal
Provide an optional, interpretable symbolic surrogate — a compact formula over a handful of CPP features that approximates a fitted model's predicted score (or delta_pred), with a stated fidelity guarantee — as an explainability add-on, independent of SeqOpt's sequence optimization.
Requirements
A symbolic-regression fitter (e.g. gplearn.SymbolicRegressor or deap.gp) that learns a compact expression mapping X (CPP feature matrix from SequenceFeature.feature_matrix) to y (model.predict_proba target-class score, or delta_pred). Likely home: explainable_ai_pro/ (heavy/optional dep).
Complexity bound for readability — cap expression depth / number of terms / primitive set (+ − × ÷, maybe min/max), so the output is human-legible, not a bloated tree.
Return the expression (string + structured form), the features it uses, and a fidelity score (R² / correlation to the model output on held-out data) and a parsimony measure.
Reproducibility: full random_state/seed threading (no global RNG state).
A plotting/format helper to render the formula and its fit (predicted-vs-model scatter).
KPIs / Acceptance criteria
On the canonical dataset, the surrogate reproduces the model's target-class score with held-out R² ≥ 0.80 (or a documented band) — measurably faithful, not decorative.
The returned expression is bounded to ≤ N terms and depth ≤ D (e.g. ≤ 8 terms, depth ≤ 4) — verified programmatically.
Deterministic: two fits with the same random_state produce the identical expression.
Covered by ≥1 end-to-end test; one example notebook with embedded outputs.
Heavy/optional dependency (gplearn or deap) → pro or a dedicated extra; the dependency + placement decision is CONFIRM-FIRST (pyproject.toml). May instead belong downstream in ProtXplain if it reads as an agent/explanation concern rather than a core library primitive — settle in the grill.
Exploratory / low priority: validate that a faithful and compact surrogate is even achievable on real CPP features before committing to a public class.
Problem
The models that drive ML-guided design —
TreeModel/ShapModeland the forthcomingSeqOptfitness (modeldelta_pred) — are effectively black boxes at the score level. CPP gives feature-level interpretability (whichPART-SPLIT-SCALEfeatures matter, withmean_dif/feat_importance), but there is no compact, human-readable rule that says how a sequence's feature values combine into the predicted score. For rational design and trust, a short symbolic expression — e.g.score ≈ 0.7·hydrophobicity(TMD) − 0.3·charge(JMD_C) + …— would let a user reason about and hand-tune a sequence without running the optimizer.This is a distinct paradigm from sequence optimization: genetic programming / symbolic regression evolves a formula/program, not a sequence. It surfaced while scoping
SeqOpt(#261) as the only placedeap.gpcould tangentially appear, and was explicitly kept out of that optimizer.Goal
Provide an optional, interpretable symbolic surrogate — a compact formula over a handful of CPP features that approximates a fitted model's predicted score (or
delta_pred), with a stated fidelity guarantee — as an explainability add-on, independent ofSeqOpt's sequence optimization.Requirements
gplearn.SymbolicRegressorordeap.gp) that learns a compact expression mappingX(CPP feature matrix fromSequenceFeature.feature_matrix) toy(model.predict_probatarget-class score, ordelta_pred). Likely home:explainable_ai_pro/(heavy/optional dep).+ − × ÷, maybemin/max), so the output is human-legible, not a bloated tree.random_state/seedthreading (no global RNG state).KPIs / Acceptance criteria
random_stateproduce the identical expression.Scope / non-goals
SeqOpt(Add SeqOpt: multi-objective ML-guided directed-evolution optimizer (NSGA-II reimplementation) #261) — that optimizes sequences; this distills an interpretable rule from a fixed model. Different representation, different operators.gplearnordeap) →proor a dedicated extra; the dependency + placement decision is CONFIRM-FIRST (pyproject.toml). May instead belong downstream in ProtXplain if it reads as an agent/explanation concern rather than a core library primitive — settle in the grill.Dependencies
SeqOptfitness) — provides an interpretable surrogate of the same model score; independent of both (does not block / is not blocked).