Skip to content

Add scverse / AnnData I/O adapters (from_anndata / to_anndata) via optional [omics] extra #273

Description

@breimanntools

Problem

The dominant data object in single-cell / spatial omics is AnnData (Scanpy /
scverse), yet AAanalysis has no bridge to it: a user holding a Scanpy marker-gene
or differential-expression result cannot turn the selected genes/proteins into a
df_seq, and there is no way to attach a CPP df_feat back onto an AnnData for
downstream omics plotting. The ecosystem epic (#210) enumerates proteomics
adapters but never the single-cell/scverse upstream, so today that entire
upstream is unreachable — a user comparing AAanalysis to their existing scverse
stack sees no integration point, only a manual, error-prone copy of sequences in
and scores out.

Goal

Add thin, optional bidirectional adapters — from_anndata (selected
genes/proteins → df_seq) and to_anndata (df_featadata.var / varm /
uns) — gated behind a new [omics] extra, with anndata never a core/required
dependency and no single-cell analysis reimplemented.

Requirements

  • from_anndata(adata, var_key=..., sequence_key=..., group_key=None, ...)
    returns a df_seq with entry + sequence (and optional label/group
    columns) for a selected .var subset. Sequences come from a user-supplied
    .var column or a passed-in {id: sequence} mapping — no network
    lookup here.
  • to_anndata(adata, df_feat, key="aaanalysis", ...) writes a per-protein
    score to adata.var, the protein×feature matrix to adata.varm, and run
    metadata to adata.uns, non-destructively (namespaced keys, original
    slots untouched).
  • Place alongside read_fasta/to_fasta in aaanalysis/data_handling/
    (flag at review if a dedicated adapters/ module is preferred).
  • New [omics] extra in pyproject.toml (anndata; mudata deferred).
  • On a base install (no [omics]), calling either adapter raises an
    actionable ImportError naming pip install aaanalysis[omics] — never a
    bare crash (mirror the optional-engine pattern of Add standardized export formats for outputs #33's parquet path).
  • numpydoc docstrings (named Returns, per-method Examples include); one
    example notebook built on a tiny synthetic AnnData so it runs under
    the nbmake CI gate without pulling scanpy/torch.

KPIs / Acceptance criteria

  • Round-trip on a small fixture: from_anndata(adata) yields a df_seq
    whose entry set equals the selected .var subset (asserted); after
    to_anndata(adata, df_feat), reading the attached columns back recovers
    df_feat's contracted columns losslessly.
  • Base install (no [omics]) raises the actionable ImportError
    (asserted) — anndata is absent from core dependencies (verified by a
    base-install import test).
  • ≥1 example notebook runs under the nbmake gate using a synthetic
    AnnData (no scanpy/torch resolved).

Scope / non-goals

  • Out: gene→protein sequence resolution (UniProt/network lookup) —
    sequences are user-supplied; automatic resolution is a separate concern.
  • Out: the expression→sequence-signature enrichment orchestration
    (ProteinSignatureEnrichment / "GSEA-for-protein-features"). That end-to-end
    workflow — resolve gene→protein, run CPP, narrate the signature — is a
    downstream application and lives in ProtXplain, not in AAanalysis core. The
    adapter only moves data across the boundary.
  • Out: MuData / SpatialData (deferred follow-ups once AnnData works).
  • Out: any Scanpy analysis (clustering / DE / UMAP) — consumed, not cloned.

Dependencies

Standards checklist

  • Frontend/backend split honored; validation block; backend trusts frontend
  • CONFIRM-FIRST surface touched: pyproject.toml (new [omics] extra)
    and __init__.py / __all__ (two new public symbols) — flag for
    approval
  • numpydoc docstring (named Returns, per-method Examples include)
  • tests (unit + round-trip; base-install ImportError test)
  • no print() (ut.print_out); bare ValueError/RuntimeError; no
    aaanalysis._utils.* imports outside utils.py
  • new public symbol → re-export in __init__.py (CONFIRM-FIRST); optional
    dep gated via [omics] extra with a clear install hint

Metadata

Metadata

Assignees

No one assigned

    Labels

    prio:3Still importanttopic:dataData related improvementstype:featureImplementation of feature

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions