Skip to content

Optional GPU backend for CPP feature computation (CuPy/torch, CPU-equivalent, opt-in) #263

Description

@breimanntools

Problem

CPP feature computation is O(n_sequences × n_scales × n_positions). CPU multi-core
parallelization already exists (n_jobs on CPP.run / run_num / eval, backed by
joblib.Parallel over feature/scale chunks), but the per-position matrix ops
(scale application, per-position mean/std) are a natural fit for GPU tensors and could
unlock proteome-scale screens (tens of thousands of sequences) and iterative ML loops.

Carved out of #62, whose CPU-parallelization half is already shipped (joblib n_jobs).
This issue is the GPU-only remainder.

Goal

An opt-in GPU backend for CPP's core matrix operations that is numerically
equivalent
to the CPU/Cython path within a declared tolerance, with graceful CPU
fallback
and no new required dependency.

Requirements

  • Identify the CPP matrix ops that map cleanly to GPU tensors (scale application,
    per-position mean/std), keeping the CPU/Cython kernel the default.
  • Add an opt-in backend="gpu" path: CuPy primary, PyTorch fallback, automatic
    CPU fallback when neither is importable / no device is present.
  • Dependency policy: GPU libs are an optional install extra (e.g.
    aaanalysis[gpu]); the default install and CPU path gain no new required dep.
    Adding the extra + any pyproject.toml change is CONFIRM-FIRST and needs
    explicit maintainer approval.
  • Numerical-equivalence contract: GPU float reductions are not byte-identical to
    the CPU/Cython kernel, so this is an output-affecting change. It must declare a
    tolerance tier (likely T2 allclose(atol=1e-10, rtol=0) + identical discrete
    filter/rank decisions, or T3 with a documented band), commit a regression anchor
    pinning the decision artifact, and land an ADR recording the decision. CPU vs
    GPU equivalence tests run on CPU; the GPU leg is skipped when no device is present.

KPIs / Acceptance

  • On a 10,000-sequence reference dataset, GPU backend ≥ faster than single-core
    CPU (measured on a GPU runner).
  • CPU and GPU outputs satisfy the declared tolerance tier on the canonical cell
    (identical top-feature identity + filter decisions; values within the stated band).
  • Default install unchanged; importing without CuPy/torch or without a device falls
    back to CPU with no error and no new required dependency.

Blockers / non-goals

  • Hardware-gated: the ≥5× criterion and the GPU equivalence leg require a GPU CI
    runner / dev machine
    , which the current matrix does not have. This issue cannot be
    validated (and should not be merged) until GPU CI is available — call that out before
    starting implementation.
  • Re-scoping or replacing the existing CPU n_jobs parallelization is out of scope (done).
  • A user-facing benchmark_performance() helper is out of scope (the tests/benchmarks/
    suite + perf A/B gate already cover runtime measurement).

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions