Optional GPU backend for CPP feature computation (CuPy/torch, CPU-equivalent, opt-in)

## Problem
CPP feature computation is O(n_sequences × n_scales × n_positions). CPU multi-core
parallelization already exists (`n_jobs` on `CPP.run` / `run_num` / `eval`, backed by
`joblib.Parallel` over feature/scale chunks), but the per-position matrix ops
(scale application, per-position mean/std) are a natural fit for GPU tensors and could
unlock proteome-scale screens (tens of thousands of sequences) and iterative ML loops.

Carved out of #62, whose **CPU-parallelization half is already shipped** (joblib `n_jobs`).
This issue is the **GPU-only remainder**.

## Goal
An **opt-in** GPU backend for CPP's core matrix operations that is **numerically
equivalent** to the CPU/Cython path within a declared tolerance, with **graceful CPU
fallback** and **no new required dependency**.

## Requirements
- [ ] Identify the CPP matrix ops that map cleanly to GPU tensors (scale application,
      per-position mean/std), keeping the CPU/Cython kernel the default.
- [ ] Add an opt-in `backend="gpu"` path: CuPy primary, PyTorch fallback, automatic
      CPU fallback when neither is importable / no device is present.
- [ ] **Dependency policy:** GPU libs are an **optional install extra** (e.g.
      `aaanalysis[gpu]`); the default install and CPU path gain **no new required dep**.
      Adding the extra + any `pyproject.toml` change is **CONFIRM-FIRST** and needs
      explicit maintainer approval.
- [ ] **Numerical-equivalence contract:** GPU float reductions are not byte-identical to
      the CPU/Cython kernel, so this is an output-affecting change. It must declare a
      tolerance **tier** (likely T2 `allclose(atol=1e-10, rtol=0)` + identical discrete
      filter/rank decisions, or T3 with a documented band), commit a regression anchor
      pinning the decision artifact, and land an **ADR** recording the decision. CPU vs
      GPU equivalence tests run on CPU; the GPU leg is skipped when no device is present.

## KPIs / Acceptance
- [ ] On a 10,000-sequence reference dataset, GPU backend ≥ **5×** faster than single-core
      CPU (measured on a GPU runner).
- [ ] CPU and GPU outputs satisfy the declared tolerance tier on the canonical cell
      (identical top-feature identity + filter decisions; values within the stated band).
- [ ] Default install unchanged; importing without CuPy/torch or without a device falls
      back to CPU with no error and no new required dependency.

## Blockers / non-goals
- **Hardware-gated:** the ≥5× criterion and the GPU equivalence leg require a **GPU CI
  runner / dev machine**, which the current matrix does not have. This issue cannot be
  validated (and should not be merged) until GPU CI is available — call that out before
  starting implementation.
- Re-scoping or replacing the existing CPU `n_jobs` parallelization is out of scope (done).
- A user-facing `benchmark_performance()` helper is out of scope (the `tests/benchmarks/`
  suite + perf A/B gate already cover runtime measurement).

## Dependencies
- Parent: #62 (CPU half shipped). Relates #57 / #60 (iterative ML loops that re-run CPP).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional GPU backend for CPP feature computation (CuPy/torch, CPU-equivalent, opt-in) #263

Problem

Goal

Requirements

KPIs / Acceptance

Blockers / non-goals

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Optional GPU backend for CPP feature computation (CuPy/torch, CPU-equivalent, opt-in) #263

Description

Problem

Goal

Requirements

KPIs / Acceptance

Blockers / non-goals

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions