Golden pipelines: second convenience API `aaanalysis.pipe` (aap) + sklearn transformer

## Problem

AAanalysis exposes interpretable primitives (`SequenceFeature`, `CPP`, `TreeModel`,
`dPULearn`, `ShapModel`, `AAclust`, the metrics), but a user who just wants a result
must hand-wire the whole `load → get_df_parts → scales → CPP.run → rank → plot` chain
(and the `feature_matrix → TreeModel.fit → eval` chain for prediction) every time. That
is a high floor for first use, makes copy-paste errors easy, and gives coding agents no
single, stable, one-call entry point. There is also no sklearn-native way to put CPP
feature selection inside a `Pipeline` / `cross_val_score` without leaking the test fold.

## Goal

Add a second, **high-level convenience API** — `import aaanalysis.pipe as aap` — a
**stateless, thin facade** of "golden pipelines" over the existing primitives (zero own
algorithm, defaults byte-identical to the explicit path), plus a sklearn-compliant
`SequenceFeatureTransformer`, **with no new required dependency** (sklearn is already
core; torch stays the `[embed]` extra).

## Requirements

### `aaanalysis.pipe` (`aap`) — core 3 golden pipelines (identify → predict → explain)
- [ ] `aap.cpp_feature_map(df_seq, labels=None, subcategories=None, dpulearn=False, optimization="balanced", top_n=None, plot=True, random_state=None, n_jobs=None) -> (df_feat, ax)` — wraps `SequenceFeature.get_df_parts` + scale selection + `CPP.run` + rank + `CPPPlot.feature_map`. `subcategories` filters `df_scales`/`df_cat` (via `load_scales(name="scales_cat")`); `optimization` is a ~3-level preset (`"quick"`/`"balanced"`/`"thorough"`) collapsing the CPP knobs; `dpulearn=True` runs dPULearn first to derive reliable negatives → clean labels → CPP (the canonical PU→CPP path, the one sanctioned structural swap).
- [ ] `aap.predict(df_feat, df_seq|df_parts, labels, model_class=None, n_cv=5, random_state=None, n_jobs=None) -> (model, df_eval)` — rebuilds `X = feature_matrix(df_feat.feature, df_parts, df_scales)` then `TreeModel.fit` + cross-validated eval.
- [ ] `aap.explain(df_feat, df_seq|df_parts, labels, model=None, ...) -> (df_feat_shap, ax)` — `pro`; `ShapModel` SHAP values + SHAP-coloured feature map. Gated behind the `pro` extra via the existing `missing_feature_stub` mechanism.
- [ ] Option rule: parameter pass-through YES; structural component-swaps are separate composable `aap.*` pipelines (e.g. a stretch `aap.reliable_negatives(df_seq, ...) -> labels` standalone dPULearn), **except** the `dpulearn` flag on `cpp_feature_map`.

### `SequenceFeatureTransformer` (in `feature_engineering/`, exposed in `aa`)
- [ ] `BaseEstimator` + `TransformerMixin`; `fit(df_parts|df_seq, y)` runs CPP feature **selection on the train fold only**, `transform(...) -> X` applies the SAME features → **leak-free** inside `sklearn.pipeline.Pipeline` / `cross_val_score`.
- [ ] Passes the relevant `sklearn.utils.estimator_checks`.

### Principles
- [ ] Stateless (no pyplot-style global state); plain numpy/pandas out (torch via `torch.from_numpy`); `random_state`/`seed` threaded; numpydoc + one example notebook per pipeline; defaults asserted byte-identical to the manual path by test.

## KPIs / Acceptance criteria
- [ ] `aap.cpp_feature_map(...)` / `aap.predict(...)` outputs are **byte-identical** to the equivalent explicit-primitive call at each `optimization` grade (regression-tested).
- [ ] `SequenceFeatureTransformer` runs in `cross_val_score` (n_splits ≥ 5, one score/fold, **no leakage**) and passes the relevant `sklearn.utils.estimator_checks`.
- [ ] `load_dataset → aap` full identify→predict CV score in **≤10 lines**; one example notebook per pipeline runs under the nbmake gate.
- [ ] Repeated `random_state` → byte-identical predictions.

## Scope / non-goals
- No new algorithm, no new required dependency; **no torch in core** (stays `[embed]`).
- **ProtXplain boundary:** this is user-/sklearn-idiomatic convenience (ours). The
  machine-readable tool contract / MCP / verb-orchestration layer stays in ProtXplain
  (relates ADR-0035, refined by ADR-0038).

## Dependencies
- Child of the **API-ergonomics** epic #126. **Supersedes the closed #24** (sklearn
  pipeline wrapper). Relates #35 / #210 / #25. Relates ADR-0035 / ADR-0038.

## Standards checklist
- [ ] **CONFIRM-FIRST**: new `aaanalysis/pipe/` namespace + `aaanalysis/__init__.py` /
      `__all__` re-export (new public surface) — call out explicitly at PR time.
- [ ] Frontend/backend split honored; validation block; backend trusts frontend.
- [ ] numpydoc (named `Returns`, per-method `Examples` include); reproducibility
      (`random_state`/`seed`); `pro` gating for `aap.explain`.
- [ ] tests (unit + byte-identical parity + sklearn estimator checks); no `print()`;
      bare `ValueError`/`RuntimeError`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Golden pipelines: second convenience API `aaanalysis.pipe` (aap) + sklearn transformer #241

Problem

Goal

Requirements

`aaanalysis.pipe` (`aap`) — core 3 golden pipelines (identify → predict → explain)

`SequenceFeatureTransformer` (in `feature_engineering/`, exposed in `aa`)

Principles

KPIs / Acceptance criteria

Scope / non-goals

Dependencies

Standards checklist

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Golden pipelines: second convenience API aaanalysis.pipe (aap) + sklearn transformer #241

Description

Problem

Goal

Requirements

aaanalysis.pipe (aap) — core 3 golden pipelines (identify → predict → explain)

SequenceFeatureTransformer (in feature_engineering/, exposed in aa)

Principles

KPIs / Acceptance criteria

Scope / non-goals

Dependencies

Standards checklist

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Golden pipelines: second convenience API `aaanalysis.pipe` (aap) + sklearn transformer #241

`aaanalysis.pipe` (`aap`) — core 3 golden pipelines (identify → predict → explain)

`SequenceFeatureTransformer` (in `feature_engineering/`, exposed in `aa`)