Add SeqOpt: multi-objective ML-guided directed-evolution optimizer (NSGA-II reimplementation)

## Problem

`SeqMut` (model-aware, #57/#259) scores single mutations, explicit combined variants, and a greedy single-objective `evolve`. Real protein design needs to **balance several objectives at once** (e.g. raise binding while keeping stability and avoiding an off-target class) and return the **trade-off (Pareto) front**, not one greedy path. Today there is no population-based, multi-objective optimizer; users must hand-roll loops around `SeqMut`, and greedy stacking can't expose trade-offs. This is the substance of #59 (multi-objective) and sets up #60 (active learning / Bayesian optimization).

## Goal

Add a `SeqOpt` optimizer class that runs **multi-objective directed evolution** over sequence variants — a re-implementation of the relevant DEAP algorithms (NSGA-II non-dominated sort + crowding) using a model-bound `SeqMut` as the fitness engine, deterministic via `random_state`. **Parity-first:** prototype against DEAP, then reimplement to be **byte-exact to DEAP** (`mode="exact"`, the default) with an optional vectorized **`mode="fast"`** (equivalent within a stated tolerance), and ship a head-to-head **comparison** (correctness + speed) so we can decide whether to keep ours or depend on DEAP. The **shipped runtime stays DEAP-free** (core); DEAP is a **dev/test-only** dependency (prototype + parity oracle + benchmark).

## Approach — parity-first (prototype → reimplement → compare)

1. **Phase A — DEAP prototype / oracle (dev/test only).** Stand up the full NSGA-II pipeline with DEAP (`selNSGA2`, `cxUniform`, a substitution mutation, `eaMuPlusLambda`, `ParetoFront`) over the `SeqMut`-backed fitness. This is the reference behavior and the parity oracle; it lives behind the `[dev]` extra and never on the shipped runtime path.
2. **Phase B — reimplement, byte-exact by default.** Our pure-Python `SeqOpt` reproduces DEAP **byte-for-byte** in `mode="exact"`: the same RNG stream (Python `random.Random`, seeded and consumed in DEAP's exact call order — *not* numpy in this mode) and the same float-summation order in the non-dominated sort / crowding distance. An optional `mode="fast"` vectorizes with numpy for speed at the cost of byte-identity (numerically equivalent within a stated tolerance, with identical discrete Pareto-front membership).
3. **Phase C — comparison harness (the decision input).** A reproducible benchmark + parity report comparing `SeqOpt(mode="exact")`, `SeqOpt(mode="fast")`, and the DEAP oracle across a grid of `pop_size` × `n_gen` × `n_objectives`: **correctness** = byte-exact `df_pareto` equality (exact mode) / within-tolerance + identical front membership (fast mode); **performance** = wall-clock + peak memory. Output a table so the maintainer can decide **ship-ours vs depend-on-DEAP**.

> **Tension to be explicit about:** byte-exactness and vectorized speed pull against each other (numpy changes float-op order), so there are two modes — `exact` proves correctness against DEAP, `fast` chases speed. The Phase-C numbers settle which to default to.

## Design (DEAP -> SeqOpt mapping)

`SeqOpt(Tool)` -> `run` + `eval`; abbreviation `seqopt`; plot pair `SeqOptPlot`. House style: **one `run(...)` with operators as string params** + `eval(...)` for Pareto metrics + a plot pair; DEAP primitives are **internal backend**, not public methods. Adopt DEAP names for newly-introduced concepts (population, generation, non-dominated rank, crowding distance, Pareto front, hypervolume, NSGA-II); keep our names where they exist (variant, mutate, objective/score, df_seq/df_feat, random_state).

| DEAP concept | DEAP name | SeqOpt name | Where it lives |
|---|---|---|---|
| Individual | `Individual` | variant (`COL_VARIANT`) | output row |
| Population / size | `population` | population / `pop_size` | `run` param |
| Generations | `ngen` | generation / `n_gen` (`COL_GENERATION`) | `run` param + col |
| Multi-objective fitness | `FitnessMulti(weights)` | `objectives=[(name,"max"/"min",source)]` | `run` param |
| NSGA-II selection | `selNSGA2` | `algorithm="nsga2"` (default) | `run` param |
| NSGA-III / SPEA2 | `selNSGA3` / `selSPEA2` | `algorithm="nsga3"/"spea2"` (future) | `run` param |
| Dominance tournament | `selTournamentDCD` | mating selection | backend |
| Non-dominated sort | `sortNondominated` / `sortLogNondominated` | non-dominated rank (`COL_RANK`) | backend + col |
| Crowding distance | `assignCrowdingDist` | crowding distance (`COL_CROWDING`) | backend + col |
| Crossover | `cxUniform`/`cxOnePoint`/`cxTwoPoint` | `crossover="uniform"/…` + `cx_prob` | `run` param |
| Mutation | `mutUniformInt`/`mutFlipBit` | `mutation="substitution"/…` + `mut_prob` (reuses SeqMut moves) | `run` param |
| Variation step | `varAnd` / `varOr` | variation | backend |
| Survival loop | `eaMuPlusLambda`/`eaMuCommaLambda`/`eaSimple` | `survival="mu_plus_lambda"` (default) | `run` param |
| Pareto archive | `ParetoFront` | Pareto front = `df_pareto` | output |
| Hall of fame | `HallOfFame` | best variants (single-objective) | output |
| Constraints | `DeltaPenalty` / `ClosestValidPenalty` | `constraints=[callable]` (penalty / feasibility) | `run` param |
| Hypervolume | `tools._hypervolume` | hypervolume | `eval` metric |
| Convergence / diversity | `benchmarks.tools` | convergence / spread | `eval` metric |
| Statistics / Logbook | `Statistics` / `Logbook` | `df_eval` + trajectory DataFrame | output |
| ask-tell | `eaGenerateUpdate` | future CMA/BO hook | future |
| greedy (ours) | — | `algorithm="greedy"` (greedy baseline, implemented in `SeqOpt`) | `run` param |
| Bayesian optimization | not in DEAP | `algorithm="bayesian"` (future, #60, extra-gated) | future |

## Requirements

### Core optimizer
- [ ] `SeqOpt(Tool)` in `aaanalysis/protein_design/_seqopt.py` (+ backend `_backend/seqopt/`), reusing a model-bound `SeqMut` as the fitness engine; register abbreviation `seqopt`.
- [ ] `run(df_seq, df_feat, objectives, *, algorithm="nsga2", pop_size=50, n_gen=20, crossover="uniform", mutation="substitution", cx_prob=0.5, mut_prob=0.2, survival="mu_plus_lambda", n_mut_max=5, region=None, to_aa=None, constraints=None, init="random", jmd_n_len=10, jmd_c_len=10)` -> `df_pareto`.
- [ ] `objectives` spec: list of `(name, "max"/"min", source)`; `source` in {`"delta_pred"` (+target_class), `"delta_cpp"`, `"shift_score"`, `"n_mut"`, callable(df_variant)->array}. At least 2 objectives for a Pareto run.
- [ ] Re-implement (cite DEAP + Deb et al. 2002 in `references.rst`, never in code): fast non-dominated sort, crowding distance, DCD mating selection, uniform/one-point/two-point crossover over mutation-sets, substitution/shift mutation reusing `SeqMut` moves, `(mu+lambda)` survival.
- [ ] `df_pareto` columns: `entry`, `variant`, `n_mut`, `sequence_mut`, one column per objective, `rank` (front index), `crowding`.
- [ ] `algorithm="greedy"` single-objective greedy baseline implemented **in `SeqOpt`** (`SeqMut` has no `evolve` — `SeqMut` only *scores*; all search lives here). Core, **no `pro` gate** (DEAP is reimplemented, not a dependency).

### Evaluation + plotting
- [ ] `eval(df_pareto, *, ref_point=None)` -> `df_eval` with `hypervolume`, `n_front` (front size), `spread`/diversity; (convergence vs a reference front optional).
- [ ] `SeqOptPlot.pareto_front(df_pareto, x, y, ...)` (2-D/3-D objective scatter colored by rank) and `SeqOptPlot.hypervolume(...)` (per-generation trace).

### Constraints + reproducibility
- [ ] `constraints` = list of feasibility callables -> penalty (mirror `DeltaPenalty`/`ClosestValidPenalty` semantics) enforcing `n_mut_max`, allowed positions/alphabet, forbidden residues.
- [ ] Full `random_state`/`seed` threading (constructor + per-call), no reliance on process-global RNG state. `mode="fast"` uses `np.random.default_rng`; `mode="exact"` uses a local `random.Random(seed)` stream mirroring DEAP (still no globals) for byte-parity.

### Parity, modes & DEAP comparison
- [ ] **Phase A:** DEAP-backed reference NSGA-II pipeline wired to the `SeqMut` fitness, behind the `[dev]` extra; used only by parity tests + the benchmark.
- [ ] **Phase B:** `mode="exact"` (default) reproduces the DEAP reference **byte-for-byte** on fixed seeds (matched `random.Random` stream + matched float-op order in sort/crowding); `mode="fast"` vectorizes (numpy) with a documented tolerance and identical Pareto-front membership vs exact.
- [ ] **Phase C:** a benchmark harness producing the ours-vs-DEAP comparison table (correctness + wall-clock + peak memory across `pop_size`×`n_gen`×`n_objectives`), summarized in the PR for the ship-ours-vs-DEAP decision.

## KPIs / Acceptance criteria
- [ ] On a 2-objective task (maximize `delta_pred` for class A, minimize `n_mut`), `run` returns a non-dominated set where no returned variant dominates another (verified: pairwise non-domination holds for all `rank==0` rows).
- [ ] Non-dominated sort + crowding match a DEAP `selNSGA2` reference ordering on a fixed synthetic fitness set (golden test, identical front assignment).
- [ ] Hypervolume is non-decreasing across generations on a deterministic seed (monotonicity test); `eval` hypervolume matches a hand-computed 2-D reference within `1e-9`.
- [ ] Reproducible: two `run` calls with the same `random_state` produce byte-identical `df_pareto`.
- [ ] **Byte-exact parity:** on ≥3 fixed seeds/configs, `SeqOpt(mode="exact").run(...)` yields a `df_pareto` **byte-identical** to the DEAP reference (`pandas.testing.assert_frame_equal`, exact — including row order, `rank`, `crowding`).
- [ ] **`mode="fast"` equivalence:** identical Pareto-front *membership* vs exact, objective values within a stated tolerance (e.g. `atol=1e-9`); documented.
- [ ] **Comparison delivered:** the Phase-C table (ours `exact`/`fast` vs DEAP — correctness + wall-clock + peak memory) is produced and summarized, making ship-ours-vs-DEAP a data-backed decision.
- [ ] **No new runtime dependency:** the shipped path is pure-Python core (DEAP only in `[dev]`). Covered by ≥1 end-to-end test; one example notebook per public method runs with embedded outputs.

## Scope / non-goals
- **Out:** NSGA-III / SPEA2 (future `algorithm=` values), MO-CMA-ES (continuous), genetic programming.
- **Bayesian optimization** is a separate paradigm (surrogate + acquisition), not part of the EA engine; tracked toward #60 and, if added, gated behind an opt-in extra (CONFIRM-FIRST `pyproject.toml`).
- DEAP is a **dev/test-only** dependency (Phase-A prototype, parity oracle, benchmark) under the `[dev]` extra — the **shipped runtime stays DEAP-free** (core, pure-Python).

## Dependencies
- blocked-by #259 / #57 (model-aware `SeqMut` is the fitness engine).
- Implements the substance of #59 (multi-objective optimization); relates #60 (active learning / Bayesian optimization). Do not auto-close #59/#60 — maintainer to decide whether this supersedes #59.

## Standards checklist
- [ ] Frontend/backend split; Validate block; backend trusts frontend; new dedicated `_backend/seqopt/` (extend backend-import-hygiene `DEDICATED_OWNERS`).
- [ ] **CONFIRM-FIRST:** new public `SeqOpt`/`SeqOptPlot` -> `aaanalysis/__init__.py` `__all__`; **`deap` added to the `[dev]` extra** -> `pyproject.toml` (test/benchmark only); any future BO extra -> `pyproject.toml`.
- [ ] numpydoc (named `Returns`, per-method `Examples` include); `df_pareto`/`df_eval` added to `DICT_DF_SCHEMAS`.
- [ ] unit + golden/property tests; `random_state`/`seed` contract.
- [ ] no `print()` (use `ut.print_out`); bare `ValueError`/`RuntimeError`; constants in `_constants.py`; plot methods return `ut.FigAxResult`.
- [ ] new domain terms in `CONTEXT.md` (population, generation, Pareto front, non-dominated rank, crowding distance, hypervolume) + a new ADR for the SeqOpt optimization layer and the parity-first reimplementation (exact/fast modes; the ship-ours-vs-DEAP decision backed by the Phase-C comparison).



DEAP concept	DEAP name	SeqOpt name	Where it lives
Individual	`Individual`	variant (`COL_VARIANT`)	output row
Population / size	`population`	population / `pop_size`	`run` param
Generations	`ngen`	generation / `n_gen` (`COL_GENERATION`)	`run` param + col
Multi-objective fitness	`FitnessMulti(weights)`	`objectives=[(name,"max"/"min",source)]`	`run` param
NSGA-II selection	`selNSGA2`	`algorithm="nsga2"` (default)	`run` param
NSGA-III / SPEA2	`selNSGA3` / `selSPEA2`	`algorithm="nsga3"/"spea2"` (future)	`run` param
Dominance tournament	`selTournamentDCD`	mating selection	backend
Non-dominated sort	`sortNondominated` / `sortLogNondominated`	non-dominated rank (`COL_RANK`)	backend + col
Crowding distance	`assignCrowdingDist`	crowding distance (`COL_CROWDING`)	backend + col
Crossover	`cxUniform`/`cxOnePoint`/`cxTwoPoint`	`crossover="uniform"/…` + `cx_prob`	`run` param
Mutation	`mutUniformInt`/`mutFlipBit`	`mutation="substitution"/…` + `mut_prob` (reuses SeqMut moves)	`run` param
Variation step	`varAnd` / `varOr`	variation	backend
Survival loop	`eaMuPlusLambda`/`eaMuCommaLambda`/`eaSimple`	`survival="mu_plus_lambda"` (default)	`run` param
Pareto archive	`ParetoFront`	Pareto front = `df_pareto`	output
Hall of fame	`HallOfFame`	best variants (single-objective)	output
Constraints	`DeltaPenalty` / `ClosestValidPenalty`	`constraints=[callable]` (penalty / feasibility)	`run` param
Hypervolume	`tools._hypervolume`	hypervolume	`eval` metric
Convergence / diversity	`benchmarks.tools`	convergence / spread	`eval` metric
Statistics / Logbook	`Statistics` / `Logbook`	`df_eval` + trajectory DataFrame	output
ask-tell	`eaGenerateUpdate`	future CMA/BO hook	future
greedy (ours)	—	`algorithm="greedy"` (greedy baseline, implemented in `SeqOpt`)	`run` param
Bayesian optimization	not in DEAP	`algorithm="bayesian"` (future, #60, extra-gated)	future

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SeqOpt: multi-objective ML-guided directed-evolution optimizer (NSGA-II reimplementation) #261

Problem

Goal

Approach — parity-first (prototype → reimplement → compare)

Design (DEAP -> SeqOpt mapping)

Requirements

Core optimizer

Evaluation + plotting

Constraints + reproducibility

Parity, modes & DEAP comparison

KPIs / Acceptance criteria

Scope / non-goals

Dependencies

Standards checklist

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add SeqOpt: multi-objective ML-guided directed-evolution optimizer (NSGA-II reimplementation) #261

Description

Problem

Goal

Approach — parity-first (prototype → reimplement → compare)

Design (DEAP -> SeqOpt mapping)

Requirements

Core optimizer

Evaluation + plotting

Constraints + reproducibility

Parity, modes & DEAP comparison

KPIs / Acceptance criteria

Scope / non-goals

Dependencies

Standards checklist

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions