epic: ecosystem integration — consume upstream + descriptors, expose downstream (sklearn / XAI / design)

## Problem
AAanalysis is the interpretable middle band between bioinformatics I/O and
downstream ML / XAI / design (see ecosystem diagram). Today "Ecosystem
integration" is our weakest dimension vs descriptor competitors
(iFeature / propy3 / PyBioMed) — "mixed / needs improvement". Concretely:

- The "consumes X / scikit-learn-compatible" claim is only half-true. We
  *consume* sklearn estimators (`TreeModel` / `ShapModel` / `AAclust` accept
  `list_model_classes` / `model_class`), and `SequenceFeature.feature_matrix`
  already returns a plain `X` — but there is **no documented `Pipeline` recipe
  and no test proving it round-trips**.
- Upstream adapters that *already exist as code* — `read_fasta`,
  `EmbeddingPreprocessor.encode` (the sanctioned ESM/ProtT5 → `dict_num` step),
  `combine_dict_nums`, `StructurePreprocessor`, `AnnotationPreprocessor` — are
  **undocumented as end-to-end bridges**.
- Descriptor libraries are treated as pure competitors, not as **consumable
  feature sources**.
- Downstream XAI beyond SHAP (Captum / Quantus) is **not wired**.

This costs adoption: a user comparing AAanalysis to iFeature can't see how it
plugs into their existing sklearn / ESM / proteomics stack, and we can't make
the defensible claim concrete — *"more interpretable & task-aware; consumes the
ecosystem rather than competing on breadth."*

## Goal
One umbrella tracking the ecosystem-integration strategy: AAanalysis **consumes**
upstream representations and even competitor descriptors, runs them through its
interpretable core (Part×Split×Scale / AAontology / CPP / ShapModel), and
**exposes** results downstream (ML / XAI / optimization) — without reimplementing
breadth and without bloating core deps. This issue is the **memory anchor**;
concrete children are spun out later, each with its own quantified KPI.

## Roadmap (children to spin out later — NOT built in this issue)

### Upstream · "consumes X"
- [ ] Prove + document sklearn-compatibility: `FunctionTransformer(lambda dfp:
      sf.feature_matrix(df_feat, dfp))` inside a `Pipeline`, with a round-trip
      test. Leakage-aware framing: discover `df_feat` **once** (`CPP.run`), put
      only the deterministic `feature_matrix` transform inside CV. **No new
      public symbol.**
- [ ] Adapter recipe notebooks for the bridges that already have code:
      Biopython/scikit-bio → `df_seq` (`read_fasta`); ESM/ProtT5 → `dict_num`
      (`EmbeddingPreprocessor.encode`); pyteomics peptide table → `df_seq`;
      Optuna → `CPPGrid` objective. Docs-only; **lightweight/mock inputs** so
      they run under nbmake without pulling `torch` / `fair-esm`.

### Consume descriptor competitors (turn "vs" into "upstream")
- [ ] **Path 1 (primary):** iFeature / iFeatureOmega / propy3 / PyBioMed
      precomputed descriptor matrix → `X` → `TreeModel` / `ShapModel` for
      interpretable modeling + benchmark baselines. (Makes the "consumes X" box
      real; these flat vectors **cannot** flow into CPP's Part×Split×Scale
      engine.)
- [ ] **Path 2 (marginal, open question):** their AAindex-style per-AA scales →
      `df_scales` → CPP. Worth it given curated AAontology scales already ship?

### Downstream · XAI (wrap, don't reimplement)
- [ ] Wrap Captum / Quantus to consume AAanalysis models / feature matrix.
      Coordinate with #55 (XAI evaluation) and #51 (Neural XAI) — link,
      don't duplicate.

### Tracked elsewhere / separate studies
- [ ] Descriptor benchmark study (iFeature/propy3/PyBioMed/AA-comp/k-mers/ESM +
      simple classifiers) — separate issue; it's a study, not an integration.
- [ ] "AAanalysis vs descriptor libraries" comparison page → docs epic #106.
- [ ] Validation-protocol suite (homology-aware/same-vs-diff-protein splits,
      shuffled-label controls, feature stability, per-protein AP, PU sanity) →
      overlaps #91.
- [ ] Conformal/MAPIE uncertainty + pymoo/Nevergrad/DEAP design recipes →
      #53 / #57 / #59 / #60.
- [ ] Proteomics side-branch (detectability / flyability / proteotypic peptide
      selection via pyteomics / pyOpenMS / AlphaPept) — optional, separate.

## KPIs / Acceptance criteria (for the epic)
- [ ] Every ecosystem bridge above is enumerated with an owner-issue number or a
      "to spin out" tag — no bridge forgotten.
- [ ] Each child, when created, carries ≥1 quantified KPI per the issue style
      guide.
- [ ] Epic closes only when all children are closed or explicitly dropped.

## Scope / non-goals
- **No** reimplementing descriptor breadth in core — we consume, not clone.
- **No** heavy deps (`torch` / `fair-esm` / `biopython`) added to core; adapters
  use existing `pro` extras or mock inputs. Any new dep is **CONFIRM-FIRST**
  (`pyproject.toml`).
- Children are **not** built here.

## Dependencies
- relates #55, #51 (XAI), #91 (validation/eval), #106 (docs),
  #57 / #59 / #60 (design chain), #53 (uncertainty)
- relates ADR-0027 (design scope — AAMut/SeqMut "emerging")

## Standards checklist
- [ ] CONFIRM-FIRST: new dep → `pyproject.toml`; new public symbol →
      `__init__.py` / `__all__` (the sklearn child deliberately adds none) —
      enforced per child
- [ ] frontend/backend · numpydoc · tests · no-print — per child


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

epic: ecosystem integration — consume upstream + descriptors, expose downstream (sklearn / XAI / design) #210

Problem

Goal

Roadmap (children to spin out later — NOT built in this issue)

Upstream · "consumes X"

Consume descriptor competitors (turn "vs" into "upstream")

Downstream · XAI (wrap, don't reimplement)

Tracked elsewhere / separate studies

KPIs / Acceptance criteria (for the epic)

Scope / non-goals

Dependencies

Standards checklist

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

epic: ecosystem integration — consume upstream + descriptors, expose downstream (sklearn / XAI / design) #210

Description

Problem

Goal

Roadmap (children to spin out later — NOT built in this issue)

Upstream · "consumes X"

Consume descriptor competitors (turn "vs" into "upstream")

Downstream · XAI (wrap, don't reimplement)

Tracked elsewhere / separate studies

KPIs / Acceptance criteria (for the epic)

Scope / non-goals

Dependencies

Standards checklist

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions