Skip to content

epic: ecosystem integration — consume upstream + descriptors, expose downstream (sklearn / XAI / design) #210

Description

@breimanntools

Problem

AAanalysis is the interpretable middle band between bioinformatics I/O and
downstream ML / XAI / design (see ecosystem diagram). Today "Ecosystem
integration" is our weakest dimension vs descriptor competitors
(iFeature / propy3 / PyBioMed) — "mixed / needs improvement". Concretely:

  • The "consumes X / scikit-learn-compatible" claim is only half-true. We
    consume sklearn estimators (TreeModel / ShapModel / AAclust accept
    list_model_classes / model_class), and SequenceFeature.feature_matrix
    already returns a plain X — but there is no documented Pipeline recipe
    and no test proving it round-trips
    .
  • Upstream adapters that already exist as coderead_fasta,
    EmbeddingPreprocessor.encode (the sanctioned ESM/ProtT5 → dict_num step),
    combine_dict_nums, StructurePreprocessor, AnnotationPreprocessor — are
    undocumented as end-to-end bridges.
  • Descriptor libraries are treated as pure competitors, not as consumable
    feature sources
    .
  • Downstream XAI beyond SHAP (Captum / Quantus) is not wired.

This costs adoption: a user comparing AAanalysis to iFeature can't see how it
plugs into their existing sklearn / ESM / proteomics stack, and we can't make
the defensible claim concrete — "more interpretable & task-aware; consumes the
ecosystem rather than competing on breadth."

Goal

One umbrella tracking the ecosystem-integration strategy: AAanalysis consumes
upstream representations and even competitor descriptors, runs them through its
interpretable core (Part×Split×Scale / AAontology / CPP / ShapModel), and
exposes results downstream (ML / XAI / optimization) — without reimplementing
breadth and without bloating core deps. This issue is the memory anchor;
concrete children are spun out later, each with its own quantified KPI.

Roadmap (children to spin out later — NOT built in this issue)

Upstream · "consumes X"

  • Prove + document sklearn-compatibility: FunctionTransformer(lambda dfp: sf.feature_matrix(df_feat, dfp)) inside a Pipeline, with a round-trip
    test. Leakage-aware framing: discover df_feat once (CPP.run), put
    only the deterministic feature_matrix transform inside CV. No new
    public symbol.
  • Adapter recipe notebooks for the bridges that already have code:
    Biopython/scikit-bio → df_seq (read_fasta); ESM/ProtT5 → dict_num
    (EmbeddingPreprocessor.encode); pyteomics peptide table → df_seq;
    Optuna → CPPGrid objective. Docs-only; lightweight/mock inputs so
    they run under nbmake without pulling torch / fair-esm.

Consume descriptor competitors (turn "vs" into "upstream")

  • Path 1 (primary): iFeature / iFeatureOmega / propy3 / PyBioMed
    precomputed descriptor matrix → XTreeModel / ShapModel for
    interpretable modeling + benchmark baselines. (Makes the "consumes X" box
    real; these flat vectors cannot flow into CPP's Part×Split×Scale
    engine.)
  • Path 2 (marginal, open question): their AAindex-style per-AA scales →
    df_scales → CPP. Worth it given curated AAontology scales already ship?

Downstream · XAI (wrap, don't reimplement)

Tracked elsewhere / separate studies

KPIs / Acceptance criteria (for the epic)

  • Every ecosystem bridge above is enumerated with an owner-issue number or a
    "to spin out" tag — no bridge forgotten.
  • Each child, when created, carries ≥1 quantified KPI per the issue style
    guide.
  • Epic closes only when all children are closed or explicitly dropped.

Scope / non-goals

  • No reimplementing descriptor breadth in core — we consume, not clone.
  • No heavy deps (torch / fair-esm / biopython) added to core; adapters
    use existing pro extras or mock inputs. Any new dep is CONFIRM-FIRST
    (pyproject.toml).
  • Children are not built here.

Dependencies

Standards checklist

  • CONFIRM-FIRST: new dep → pyproject.toml; new public symbol →
    __init__.py / __all__ (the sklearn child deliberately adds none) —
    enforced per child
  • frontend/backend · numpydoc · tests · no-print — per child

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions