Problem
aap.predict_samples is a deliberately thin multi-model comparison harness: for each
(feature set × model) it runs one cross_validate + refit over the estimators the user brings.
It does not reproduce the full machine-learning training protocol from the AAanalysis
γ-secretase paper (Breimann et al.) — that protocol is a substantial training engine with
install-fragile dependencies, which the golden-pipeline "thin wrapper, no own algorithm"
convention keeps out of a one-call pipeline. Users who want paper-faithful predictions (robust
aggregated prediction scores, tuned hyperparameters, ensembles) must currently hand-wire it.
Goal
Provide a core training engine that faithfully reproduces the paper's protocol, which
predict_samples can optionally delegate to via an opt-in flag — so the thin default stays
core-sklearn and light, while paper-fidelity is one argument away.
Requirements
- 10 model types: random forest, extra trees, xgboost, catboost, LDA, logistic
regression, SVM, MLP, plus voting and stacking (SVM meta-model) ensembles.
- Monte-Carlo training: N independent rounds (default 25) of balanced 80/20 train/test split.
- Nested cross-validation: inner 5-fold for feature selection +
GridSearchCV hyperparameter
tuning; outer 20% hold-out for independent per-round scoring.
- Two-stage feature pre-selection: Pearson-correlation filter (top-k × threshold grid) →
random-forest-importance stepwise elimination by 5-fold F1 down to a floor (~25–50 features).
- Aggregated prediction score: mean predicted probability across models × rounds, with its std.
- Metrics: balanced accuracy (headline), accuracy, F1, precision, recall, TNR.
- Class imbalance via balanced splits (+ optional resampling); balanced accuracy is the
imbalance-aware metric.
xgboost / catboost gated behind a new optional extra (needs maintainer approval — touches
pyproject.toml) so the core install stays light.
random_state threaded through end to end (reproducibility contract).
predict_samples gains an opt-in path (e.g. engine="paper") that delegates to this engine;
the thin default path is unchanged.
KPIs / Acceptance criteria
- On
DOM_GSEC, the engine reproduces the paper's headline performance within a documented
tolerance on the matched dataset/annotation.
- The aggregated prediction score is reproducible for a fixed
random_state (same seed → same
score ± std).
- Per-round and aggregate
df_eval carry all six metrics as mean ± std.
- ≥30 unit tests for the new primitive (per the testing standard), a reproducibility test, and an
executed example notebook that passes nbmake.
- pyright clean on the new public surface; numpydoc docstrings with an example include.
- The new extra is documented in
pyproject.toml, _EXTRA_MODULES, and the install docs.
Scope / non-goals
- Not in the thin
predict_samples default path (stays core-sklearn, no heavy deps) — the
engine is strictly opt-in.
- No multiclass / regression targets — binary test-vs-reference, matching
predict_samples.
- No agent/MCP/tool contracts — those live downstream in ProtXplain.
Dependencies
- Builds on
predict_samples (the thin comparison harness) and find_features feature sets.
- Requires approval for the new
xgboost / catboost extra (touches pyproject.toml).
- The exact per-model hyperparameter grids live in Supplementary Data 10 of the paper (an external
spreadsheet, not in the manuscript text) — they must be transcribed when implementing the grids.
Standards checklist
Problem
aap.predict_samplesis a deliberately thin multi-model comparison harness: for each(feature set × model)it runs onecross_validate+ refit over the estimators the user brings.It does not reproduce the full machine-learning training protocol from the AAanalysis
γ-secretase paper (Breimann et al.) — that protocol is a substantial training engine with
install-fragile dependencies, which the golden-pipeline "thin wrapper, no own algorithm"
convention keeps out of a one-call pipeline. Users who want paper-faithful predictions (robust
aggregated prediction scores, tuned hyperparameters, ensembles) must currently hand-wire it.
Goal
Provide a core training engine that faithfully reproduces the paper's protocol, which
predict_samplescan optionally delegate to via an opt-in flag — so the thin default stayscore-sklearn and light, while paper-fidelity is one argument away.
Requirements
regression, SVM, MLP, plus voting and stacking (SVM meta-model) ensembles.
GridSearchCVhyperparametertuning; outer 20% hold-out for independent per-round scoring.
random-forest-importance stepwise elimination by 5-fold F1 down to a floor (~25–50 features).
imbalance-aware metric.
xgboost/catboostgated behind a new optional extra (needs maintainer approval — touchespyproject.toml) so the core install stays light.random_statethreaded through end to end (reproducibility contract).predict_samplesgains an opt-in path (e.g.engine="paper") that delegates to this engine;the thin default path is unchanged.
KPIs / Acceptance criteria
DOM_GSEC, the engine reproduces the paper's headline performance within a documentedtolerance on the matched dataset/annotation.
random_state(same seed → samescore ± std).
df_evalcarry all six metrics as mean ± std.executed example notebook that passes
nbmake.pyproject.toml,_EXTRA_MODULES, and the install docs.Scope / non-goals
predict_samplesdefault path (stays core-sklearn, no heavy deps) — theengine is strictly opt-in.
predict_samples.Dependencies
predict_samples(the thin comparison harness) andfind_featuresfeature sets.xgboost/catboostextra (touchespyproject.toml).spreadsheet, not in the manuscript text) — they must be transcribed when implementing the grids.
Standards checklist
pyproject.tomland_EXTRA_MODULESpredict_samplesnbmakenotebookdf_evalschema documented