Skip to content

Explore symbolic-regression surrogate for interpretable model-fitness rules (XAI) #265

Description

@breimanntools

Problem

The models that drive ML-guided design — TreeModel / ShapModel and the forthcoming SeqOpt fitness (model delta_pred) — are effectively black boxes at the score level. CPP gives feature-level interpretability (which PART-SPLIT-SCALE features matter, with mean_dif / feat_importance), but there is no compact, human-readable rule that says how a sequence's feature values combine into the predicted score. For rational design and trust, a short symbolic expression — e.g. score ≈ 0.7·hydrophobicity(TMD) − 0.3·charge(JMD_C) + … — would let a user reason about and hand-tune a sequence without running the optimizer.

This is a distinct paradigm from sequence optimization: genetic programming / symbolic regression evolves a formula/program, not a sequence. It surfaced while scoping SeqOpt (#261) as the only place deap.gp could tangentially appear, and was explicitly kept out of that optimizer.

Goal

Provide an optional, interpretable symbolic surrogate — a compact formula over a handful of CPP features that approximates a fitted model's predicted score (or delta_pred), with a stated fidelity guarantee — as an explainability add-on, independent of SeqOpt's sequence optimization.

Requirements

  • A symbolic-regression fitter (e.g. gplearn.SymbolicRegressor or deap.gp) that learns a compact expression mapping X (CPP feature matrix from SequenceFeature.feature_matrix) to y (model.predict_proba target-class score, or delta_pred). Likely home: explainable_ai_pro/ (heavy/optional dep).
  • Complexity bound for readability — cap expression depth / number of terms / primitive set (+ − × ÷, maybe min/max), so the output is human-legible, not a bloated tree.
  • Return the expression (string + structured form), the features it uses, and a fidelity score (R² / correlation to the model output on held-out data) and a parsimony measure.
  • Reproducibility: full random_state/seed threading (no global RNG state).
  • A plotting/format helper to render the formula and its fit (predicted-vs-model scatter).

KPIs / Acceptance criteria

  • On the canonical dataset, the surrogate reproduces the model's target-class score with held-out R² ≥ 0.80 (or a documented band) — measurably faithful, not decorative.
  • The returned expression is bounded to ≤ N terms and depth ≤ D (e.g. ≤ 8 terms, depth ≤ 4) — verified programmatically.
  • Deterministic: two fits with the same random_state produce the identical expression.
  • Covered by ≥1 end-to-end test; one example notebook with embedded outputs.

Scope / non-goals

  • Not part of SeqOpt (Add SeqOpt: multi-objective ML-guided directed-evolution optimizer (NSGA-II reimplementation) #261) — that optimizes sequences; this distills an interpretable rule from a fixed model. Different representation, different operators.
  • Heavy/optional dependency (gplearn or deap) → pro or a dedicated extra; the dependency + placement decision is CONFIRM-FIRST (pyproject.toml). May instead belong downstream in ProtXplain if it reads as an agent/explanation concern rather than a core library primitive — settle in the grill.
  • Exploratory / low priority: validate that a faithful and compact surrogate is even achievable on real CPP features before committing to a public class.

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    prio:3Still importanttopic:XAIExplainability methods integrated into AAanalysistype:featureImplementation of feature

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions