Skip to content

Golden pipelines: second convenience API aaanalysis.pipe (aap) + sklearn transformer #241

Description

@breimanntools

Problem

AAanalysis exposes interpretable primitives (SequenceFeature, CPP, TreeModel,
dPULearn, ShapModel, AAclust, the metrics), but a user who just wants a result
must hand-wire the whole load → get_df_parts → scales → CPP.run → rank → plot chain
(and the feature_matrix → TreeModel.fit → eval chain for prediction) every time. That
is a high floor for first use, makes copy-paste errors easy, and gives coding agents no
single, stable, one-call entry point. There is also no sklearn-native way to put CPP
feature selection inside a Pipeline / cross_val_score without leaking the test fold.

Goal

Add a second, high-level convenience APIimport aaanalysis.pipe as aap — a
stateless, thin facade of "golden pipelines" over the existing primitives (zero own
algorithm, defaults byte-identical to the explicit path), plus a sklearn-compliant
SequenceFeatureTransformer, with no new required dependency (sklearn is already
core; torch stays the [embed] extra).

Requirements

aaanalysis.pipe (aap) — core 3 golden pipelines (identify → predict → explain)

  • aap.cpp_feature_map(df_seq, labels=None, subcategories=None, dpulearn=False, optimization="balanced", top_n=None, plot=True, random_state=None, n_jobs=None) -> (df_feat, ax) — wraps SequenceFeature.get_df_parts + scale selection + CPP.run + rank + CPPPlot.feature_map. subcategories filters df_scales/df_cat (via load_scales(name="scales_cat")); optimization is a ~3-level preset ("quick"/"balanced"/"thorough") collapsing the CPP knobs; dpulearn=True runs dPULearn first to derive reliable negatives → clean labels → CPP (the canonical PU→CPP path, the one sanctioned structural swap).
  • aap.predict(df_feat, df_seq|df_parts, labels, model_class=None, n_cv=5, random_state=None, n_jobs=None) -> (model, df_eval) — rebuilds X = feature_matrix(df_feat.feature, df_parts, df_scales) then TreeModel.fit + cross-validated eval.
  • aap.explain(df_feat, df_seq|df_parts, labels, model=None, ...) -> (df_feat_shap, ax)pro; ShapModel SHAP values + SHAP-coloured feature map. Gated behind the pro extra via the existing missing_feature_stub mechanism.
  • Option rule: parameter pass-through YES; structural component-swaps are separate composable aap.* pipelines (e.g. a stretch aap.reliable_negatives(df_seq, ...) -> labels standalone dPULearn), except the dpulearn flag on cpp_feature_map.

SequenceFeatureTransformer (in feature_engineering/, exposed in aa)

  • BaseEstimator + TransformerMixin; fit(df_parts|df_seq, y) runs CPP feature selection on the train fold only, transform(...) -> X applies the SAME features → leak-free inside sklearn.pipeline.Pipeline / cross_val_score.
  • Passes the relevant sklearn.utils.estimator_checks.

Principles

  • Stateless (no pyplot-style global state); plain numpy/pandas out (torch via torch.from_numpy); random_state/seed threaded; numpydoc + one example notebook per pipeline; defaults asserted byte-identical to the manual path by test.

KPIs / Acceptance criteria

  • aap.cpp_feature_map(...) / aap.predict(...) outputs are byte-identical to the equivalent explicit-primitive call at each optimization grade (regression-tested).
  • SequenceFeatureTransformer runs in cross_val_score (n_splits ≥ 5, one score/fold, no leakage) and passes the relevant sklearn.utils.estimator_checks.
  • load_dataset → aap full identify→predict CV score in ≤10 lines; one example notebook per pipeline runs under the nbmake gate.
  • Repeated random_state → byte-identical predictions.

Scope / non-goals

  • No new algorithm, no new required dependency; no torch in core (stays [embed]).
  • ProtXplain boundary: this is user-/sklearn-idiomatic convenience (ours). The
    machine-readable tool contract / MCP / verb-orchestration layer stays in ProtXplain
    (relates ADR-0035, refined by ADR-0038).

Dependencies

Standards checklist

  • CONFIRM-FIRST: new aaanalysis/pipe/ namespace + aaanalysis/__init__.py /
    __all__ re-export (new public surface) — call out explicitly at PR time.
  • Frontend/backend split honored; validation block; backend trusts frontend.
  • numpydoc (named Returns, per-method Examples include); reproducibility
    (random_state/seed); pro gating for aap.explain.
  • tests (unit + byte-identical parity + sklearn estimator checks); no print();
    bare ValueError/RuntimeError.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions