Problem
CPP feature computation is O(n_sequences × n_scales × n_positions). CPU multi-core
parallelization already exists (n_jobs on CPP.run / run_num / eval, backed by
joblib.Parallel over feature/scale chunks), but the per-position matrix ops
(scale application, per-position mean/std) are a natural fit for GPU tensors and could
unlock proteome-scale screens (tens of thousands of sequences) and iterative ML loops.
Carved out of #62, whose CPU-parallelization half is already shipped (joblib n_jobs).
This issue is the GPU-only remainder.
Goal
An opt-in GPU backend for CPP's core matrix operations that is numerically
equivalent to the CPU/Cython path within a declared tolerance, with graceful CPU
fallback and no new required dependency.
Requirements
KPIs / Acceptance
Blockers / non-goals
- Hardware-gated: the ≥5× criterion and the GPU equivalence leg require a GPU CI
runner / dev machine, which the current matrix does not have. This issue cannot be
validated (and should not be merged) until GPU CI is available — call that out before
starting implementation.
- Re-scoping or replacing the existing CPU
n_jobs parallelization is out of scope (done).
- A user-facing
benchmark_performance() helper is out of scope (the tests/benchmarks/
suite + perf A/B gate already cover runtime measurement).
Dependencies
Problem
CPP feature computation is O(n_sequences × n_scales × n_positions). CPU multi-core
parallelization already exists (
n_jobsonCPP.run/run_num/eval, backed byjoblib.Parallelover feature/scale chunks), but the per-position matrix ops(scale application, per-position mean/std) are a natural fit for GPU tensors and could
unlock proteome-scale screens (tens of thousands of sequences) and iterative ML loops.
Carved out of #62, whose CPU-parallelization half is already shipped (joblib
n_jobs).This issue is the GPU-only remainder.
Goal
An opt-in GPU backend for CPP's core matrix operations that is numerically
equivalent to the CPU/Cython path within a declared tolerance, with graceful CPU
fallback and no new required dependency.
Requirements
per-position mean/std), keeping the CPU/Cython kernel the default.
backend="gpu"path: CuPy primary, PyTorch fallback, automaticCPU fallback when neither is importable / no device is present.
aaanalysis[gpu]); the default install and CPU path gain no new required dep.Adding the extra + any
pyproject.tomlchange is CONFIRM-FIRST and needsexplicit maintainer approval.
the CPU/Cython kernel, so this is an output-affecting change. It must declare a
tolerance tier (likely T2
allclose(atol=1e-10, rtol=0)+ identical discretefilter/rank decisions, or T3 with a documented band), commit a regression anchor
pinning the decision artifact, and land an ADR recording the decision. CPU vs
GPU equivalence tests run on CPU; the GPU leg is skipped when no device is present.
KPIs / Acceptance
CPU (measured on a GPU runner).
(identical top-feature identity + filter decisions; values within the stated band).
back to CPU with no error and no new required dependency.
Blockers / non-goals
runner / dev machine, which the current matrix does not have. This issue cannot be
validated (and should not be merged) until GPU CI is available — call that out before
starting implementation.
n_jobsparallelization is out of scope (done).benchmark_performance()helper is out of scope (thetests/benchmarks/suite + perf A/B gate already cover runtime measurement).
Dependencies