You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AAanalysis exposes interpretable primitives (SequenceFeature, CPP, TreeModel, dPULearn, ShapModel, AAclust, the metrics), but a user who just wants a result
must hand-wire the whole load → get_df_parts → scales → CPP.run → rank → plot chain
(and the feature_matrix → TreeModel.fit → eval chain for prediction) every time. That
is a high floor for first use, makes copy-paste errors easy, and gives coding agents no
single, stable, one-call entry point. There is also no sklearn-native way to put CPP
feature selection inside a Pipeline / cross_val_score without leaking the test fold.
Goal
Add a second, high-level convenience API — import aaanalysis.pipe as aap — a stateless, thin facade of "golden pipelines" over the existing primitives (zero own
algorithm, defaults byte-identical to the explicit path), plus a sklearn-compliant SequenceFeatureTransformer, with no new required dependency (sklearn is already
core; torch stays the [embed] extra).
aap.explain(df_feat, df_seq|df_parts, labels, model=None, ...) -> (df_feat_shap, ax) — pro; ShapModel SHAP values + SHAP-coloured feature map. Gated behind the pro extra via the existing missing_feature_stub mechanism.
Option rule: parameter pass-through YES; structural component-swaps are separate composable aap.* pipelines (e.g. a stretch aap.reliable_negatives(df_seq, ...) -> labels standalone dPULearn), except the dpulearn flag on cpp_feature_map.
SequenceFeatureTransformer (in feature_engineering/, exposed in aa)
BaseEstimator + TransformerMixin; fit(df_parts|df_seq, y) runs CPP feature selection on the train fold only, transform(...) -> X applies the SAME features → leak-free inside sklearn.pipeline.Pipeline / cross_val_score.
Passes the relevant sklearn.utils.estimator_checks.
Principles
Stateless (no pyplot-style global state); plain numpy/pandas out (torch via torch.from_numpy); random_state/seed threaded; numpydoc + one example notebook per pipeline; defaults asserted byte-identical to the manual path by test.
KPIs / Acceptance criteria
aap.cpp_feature_map(...) / aap.predict(...) outputs are byte-identical to the equivalent explicit-primitive call at each optimization grade (regression-tested).
SequenceFeatureTransformer runs in cross_val_score (n_splits ≥ 5, one score/fold, no leakage) and passes the relevant sklearn.utils.estimator_checks.
load_dataset → aap full identify→predict CV score in ≤10 lines; one example notebook per pipeline runs under the nbmake gate.
No new algorithm, no new required dependency; no torch in core (stays [embed]).
ProtXplain boundary: this is user-/sklearn-idiomatic convenience (ours). The
machine-readable tool contract / MCP / verb-orchestration layer stays in ProtXplain
(relates ADR-0035, refined by ADR-0038).
Problem
AAanalysis exposes interpretable primitives (
SequenceFeature,CPP,TreeModel,dPULearn,ShapModel,AAclust, the metrics), but a user who just wants a resultmust hand-wire the whole
load → get_df_parts → scales → CPP.run → rank → plotchain(and the
feature_matrix → TreeModel.fit → evalchain for prediction) every time. Thatis a high floor for first use, makes copy-paste errors easy, and gives coding agents no
single, stable, one-call entry point. There is also no sklearn-native way to put CPP
feature selection inside a
Pipeline/cross_val_scorewithout leaking the test fold.Goal
Add a second, high-level convenience API —
import aaanalysis.pipe as aap— astateless, thin facade of "golden pipelines" over the existing primitives (zero own
algorithm, defaults byte-identical to the explicit path), plus a sklearn-compliant
SequenceFeatureTransformer, with no new required dependency (sklearn is alreadycore; torch stays the
[embed]extra).Requirements
aaanalysis.pipe(aap) — core 3 golden pipelines (identify → predict → explain)aap.cpp_feature_map(df_seq, labels=None, subcategories=None, dpulearn=False, optimization="balanced", top_n=None, plot=True, random_state=None, n_jobs=None) -> (df_feat, ax)— wrapsSequenceFeature.get_df_parts+ scale selection +CPP.run+ rank +CPPPlot.feature_map.subcategoriesfiltersdf_scales/df_cat(viaload_scales(name="scales_cat"));optimizationis a ~3-level preset ("quick"/"balanced"/"thorough") collapsing the CPP knobs;dpulearn=Trueruns dPULearn first to derive reliable negatives → clean labels → CPP (the canonical PU→CPP path, the one sanctioned structural swap).aap.predict(df_feat, df_seq|df_parts, labels, model_class=None, n_cv=5, random_state=None, n_jobs=None) -> (model, df_eval)— rebuildsX = feature_matrix(df_feat.feature, df_parts, df_scales)thenTreeModel.fit+ cross-validated eval.aap.explain(df_feat, df_seq|df_parts, labels, model=None, ...) -> (df_feat_shap, ax)—pro;ShapModelSHAP values + SHAP-coloured feature map. Gated behind theproextra via the existingmissing_feature_stubmechanism.aap.*pipelines (e.g. a stretchaap.reliable_negatives(df_seq, ...) -> labelsstandalone dPULearn), except thedpulearnflag oncpp_feature_map.SequenceFeatureTransformer(infeature_engineering/, exposed inaa)BaseEstimator+TransformerMixin;fit(df_parts|df_seq, y)runs CPP feature selection on the train fold only,transform(...) -> Xapplies the SAME features → leak-free insidesklearn.pipeline.Pipeline/cross_val_score.sklearn.utils.estimator_checks.Principles
torch.from_numpy);random_state/seedthreaded; numpydoc + one example notebook per pipeline; defaults asserted byte-identical to the manual path by test.KPIs / Acceptance criteria
aap.cpp_feature_map(...)/aap.predict(...)outputs are byte-identical to the equivalent explicit-primitive call at eachoptimizationgrade (regression-tested).SequenceFeatureTransformerruns incross_val_score(n_splits ≥ 5, one score/fold, no leakage) and passes the relevantsklearn.utils.estimator_checks.load_dataset → aapfull identify→predict CV score in ≤10 lines; one example notebook per pipeline runs under the nbmake gate.random_state→ byte-identical predictions.Scope / non-goals
[embed]).machine-readable tool contract / MCP / verb-orchestration layer stays in ProtXplain
(relates ADR-0035, refined by ADR-0038).
Dependencies
pipeline wrapper). Relates Protocols: task-oriented, pipeline-ordered usage guide (epic) #35 / epic: ecosystem integration — consume upstream + descriptors, expose downstream (sklearn / XAI / design) #210 / Define benchmark dataset for evaluation #25. Relates ADR-0035 / ADR-0038.
Standards checklist
aaanalysis/pipe/namespace +aaanalysis/__init__.py/__all__re-export (new public surface) — call out explicitly at PR time.Returns, per-methodExamplesinclude); reproducibility(
random_state/seed);progating foraap.explain.print();bare
ValueError/RuntimeError.