Skip to content

Add a paper-fidelity model-training engine (nested-CV Monte-Carlo + ensembles) behind predict_samples #276

Description

@breimanntools

Problem

aap.predict_samples is a deliberately thin multi-model comparison harness: for each
(feature set × model) it runs one cross_validate + refit over the estimators the user brings.
It does not reproduce the full machine-learning training protocol from the AAanalysis
γ-secretase paper (Breimann et al.) — that protocol is a substantial training engine with
install-fragile dependencies, which the golden-pipeline "thin wrapper, no own algorithm"
convention keeps out of a one-call pipeline. Users who want paper-faithful predictions (robust
aggregated prediction scores, tuned hyperparameters, ensembles) must currently hand-wire it.

Goal

Provide a core training engine that faithfully reproduces the paper's protocol, which
predict_samples can optionally delegate to via an opt-in flag — so the thin default stays
core-sklearn and light, while paper-fidelity is one argument away.

Requirements

  • 10 model types: random forest, extra trees, xgboost, catboost, LDA, logistic
    regression, SVM, MLP, plus voting and stacking (SVM meta-model) ensembles.
  • Monte-Carlo training: N independent rounds (default 25) of balanced 80/20 train/test split.
  • Nested cross-validation: inner 5-fold for feature selection + GridSearchCV hyperparameter
    tuning; outer 20% hold-out for independent per-round scoring.
  • Two-stage feature pre-selection: Pearson-correlation filter (top-k × threshold grid) →
    random-forest-importance stepwise elimination by 5-fold F1 down to a floor (~25–50 features).
  • Aggregated prediction score: mean predicted probability across models × rounds, with its std.
  • Metrics: balanced accuracy (headline), accuracy, F1, precision, recall, TNR.
  • Class imbalance via balanced splits (+ optional resampling); balanced accuracy is the
    imbalance-aware metric.
  • xgboost / catboost gated behind a new optional extra (needs maintainer approval — touches
    pyproject.toml) so the core install stays light.
  • random_state threaded through end to end (reproducibility contract).
  • predict_samples gains an opt-in path (e.g. engine="paper") that delegates to this engine;
    the thin default path is unchanged.

KPIs / Acceptance criteria

  • On DOM_GSEC, the engine reproduces the paper's headline performance within a documented
    tolerance on the matched dataset/annotation.
  • The aggregated prediction score is reproducible for a fixed random_state (same seed → same
    score ± std).
  • Per-round and aggregate df_eval carry all six metrics as mean ± std.
  • ≥30 unit tests for the new primitive (per the testing standard), a reproducibility test, and an
    executed example notebook that passes nbmake.
  • pyright clean on the new public surface; numpydoc docstrings with an example include.
  • The new extra is documented in pyproject.toml, _EXTRA_MODULES, and the install docs.

Scope / non-goals

  • Not in the thin predict_samples default path (stays core-sklearn, no heavy deps) — the
    engine is strictly opt-in.
  • No multiclass / regression targets — binary test-vs-reference, matching predict_samples.
  • No agent/MCP/tool contracts — those live downstream in ProtXplain.

Dependencies

  • Builds on predict_samples (the thin comparison harness) and find_features feature sets.
  • Requires approval for the new xgboost / catboost extra (touches pyproject.toml).
  • The exact per-model hyperparameter grids live in Supplementary Data 10 of the paper (an external
    spreadsheet, not in the manuscript text) — they must be transcribed when implementing the grids.

Standards checklist

  • New extra approved + added to pyproject.toml and _EXTRA_MODULES
  • Core primitive (Wrapper / Tool template) + thin opt-in delegation from predict_samples
  • ≥30 tests, reproducibility test, executed nbmake notebook
  • numpydoc docstrings with examples include; no internal decision-doc references in code/GitHub
  • df_eval schema documented

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions