Problem
AAanalysis is the interpretable middle band between bioinformatics I/O and
downstream ML / XAI / design (see ecosystem diagram). Today "Ecosystem
integration" is our weakest dimension vs descriptor competitors
(iFeature / propy3 / PyBioMed) — "mixed / needs improvement". Concretely:
- The "consumes X / scikit-learn-compatible" claim is only half-true. We
consume sklearn estimators (TreeModel / ShapModel / AAclust accept
list_model_classes / model_class), and SequenceFeature.feature_matrix
already returns a plain X — but there is no documented Pipeline recipe
and no test proving it round-trips.
- Upstream adapters that already exist as code —
read_fasta,
EmbeddingPreprocessor.encode (the sanctioned ESM/ProtT5 → dict_num step),
combine_dict_nums, StructurePreprocessor, AnnotationPreprocessor — are
undocumented as end-to-end bridges.
- Descriptor libraries are treated as pure competitors, not as consumable
feature sources.
- Downstream XAI beyond SHAP (Captum / Quantus) is not wired.
This costs adoption: a user comparing AAanalysis to iFeature can't see how it
plugs into their existing sklearn / ESM / proteomics stack, and we can't make
the defensible claim concrete — "more interpretable & task-aware; consumes the
ecosystem rather than competing on breadth."
Goal
One umbrella tracking the ecosystem-integration strategy: AAanalysis consumes
upstream representations and even competitor descriptors, runs them through its
interpretable core (Part×Split×Scale / AAontology / CPP / ShapModel), and
exposes results downstream (ML / XAI / optimization) — without reimplementing
breadth and without bloating core deps. This issue is the memory anchor;
concrete children are spun out later, each with its own quantified KPI.
Roadmap (children to spin out later — NOT built in this issue)
Upstream · "consumes X"
Consume descriptor competitors (turn "vs" into "upstream")
Downstream · XAI (wrap, don't reimplement)
Tracked elsewhere / separate studies
KPIs / Acceptance criteria (for the epic)
Scope / non-goals
- No reimplementing descriptor breadth in core — we consume, not clone.
- No heavy deps (
torch / fair-esm / biopython) added to core; adapters
use existing pro extras or mock inputs. Any new dep is CONFIRM-FIRST
(pyproject.toml).
- Children are not built here.
Dependencies
Standards checklist
Problem
AAanalysis is the interpretable middle band between bioinformatics I/O and
downstream ML / XAI / design (see ecosystem diagram). Today "Ecosystem
integration" is our weakest dimension vs descriptor competitors
(iFeature / propy3 / PyBioMed) — "mixed / needs improvement". Concretely:
consume sklearn estimators (
TreeModel/ShapModel/AAclustacceptlist_model_classes/model_class), andSequenceFeature.feature_matrixalready returns a plain
X— but there is no documentedPipelinerecipeand no test proving it round-trips.
read_fasta,EmbeddingPreprocessor.encode(the sanctioned ESM/ProtT5 →dict_numstep),combine_dict_nums,StructurePreprocessor,AnnotationPreprocessor— areundocumented as end-to-end bridges.
feature sources.
This costs adoption: a user comparing AAanalysis to iFeature can't see how it
plugs into their existing sklearn / ESM / proteomics stack, and we can't make
the defensible claim concrete — "more interpretable & task-aware; consumes the
ecosystem rather than competing on breadth."
Goal
One umbrella tracking the ecosystem-integration strategy: AAanalysis consumes
upstream representations and even competitor descriptors, runs them through its
interpretable core (Part×Split×Scale / AAontology / CPP / ShapModel), and
exposes results downstream (ML / XAI / optimization) — without reimplementing
breadth and without bloating core deps. This issue is the memory anchor;
concrete children are spun out later, each with its own quantified KPI.
Roadmap (children to spin out later — NOT built in this issue)
Upstream · "consumes X"
FunctionTransformer(lambda dfp: sf.feature_matrix(df_feat, dfp))inside aPipeline, with a round-triptest. Leakage-aware framing: discover
df_featonce (CPP.run), putonly the deterministic
feature_matrixtransform inside CV. No newpublic symbol.
Biopython/scikit-bio →
df_seq(read_fasta); ESM/ProtT5 →dict_num(
EmbeddingPreprocessor.encode); pyteomics peptide table →df_seq;Optuna →
CPPGridobjective. Docs-only; lightweight/mock inputs sothey run under nbmake without pulling
torch/fair-esm.Consume descriptor competitors (turn "vs" into "upstream")
precomputed descriptor matrix →
X→TreeModel/ShapModelforinterpretable modeling + benchmark baselines. (Makes the "consumes X" box
real; these flat vectors cannot flow into CPP's Part×Split×Scale
engine.)
df_scales→ CPP. Worth it given curated AAontology scales already ship?Downstream · XAI (wrap, don't reimplement)
Coordinate with XAI evaluation framework (scientific validation) #55 (XAI evaluation) and Neural XAI methods (deep learning) #51 (Neural XAI) — link,
don't duplicate.
Tracked elsewhere / separate studies
simple classifiers) — separate issue; it's a study, not an integration.
shuffled-label controls, feature stability, per-protein AP, PU sanity) →
overlaps Model evaluation & comparison: repeated-CV + bootstrap CIs + paired ΔMCC #91.
Uncertainty-aware XAI (robustness) #53 / ML-guided directed evolution #57 / Multi-objective optimization #59 / Active learning #60.
selection via pyteomics / pyOpenMS / AlphaPept) — optional, separate.
KPIs / Acceptance criteria (for the epic)
"to spin out" tag — no bridge forgotten.
guide.
Scope / non-goals
torch/fair-esm/biopython) added to core; adaptersuse existing
proextras or mock inputs. Any new dep is CONFIRM-FIRST(
pyproject.toml).Dependencies
ML-guided directed evolution #57 / Multi-objective optimization #59 / Active learning #60 (design chain), Uncertainty-aware XAI (robustness) #53 (uncertainty)
Standards checklist
pyproject.toml; new public symbol →__init__.py/__all__(the sklearn child deliberately adds none) —enforced per child