Skip to content

Add SeqOpt: multi-objective ML-guided directed-evolution optimizer (NSGA-II reimplementation) #261

Description

@breimanntools

Problem

SeqMut (model-aware, #57/#259) scores single mutations, explicit combined variants, and a greedy single-objective evolve. Real protein design needs to balance several objectives at once (e.g. raise binding while keeping stability and avoiding an off-target class) and return the trade-off (Pareto) front, not one greedy path. Today there is no population-based, multi-objective optimizer; users must hand-roll loops around SeqMut, and greedy stacking can't expose trade-offs. This is the substance of #59 (multi-objective) and sets up #60 (active learning / Bayesian optimization).

Goal

Add a SeqOpt optimizer class that runs multi-objective directed evolution over sequence variants — a re-implementation of the relevant DEAP algorithms (NSGA-II non-dominated sort + crowding) using a model-bound SeqMut as the fitness engine, deterministic via random_state. Parity-first: prototype against DEAP, then reimplement to be byte-exact to DEAP (mode="exact", the default) with an optional vectorized mode="fast" (equivalent within a stated tolerance), and ship a head-to-head comparison (correctness + speed) so we can decide whether to keep ours or depend on DEAP. The shipped runtime stays DEAP-free (core); DEAP is a dev/test-only dependency (prototype + parity oracle + benchmark).

Approach — parity-first (prototype → reimplement → compare)

  1. Phase A — DEAP prototype / oracle (dev/test only). Stand up the full NSGA-II pipeline with DEAP (selNSGA2, cxUniform, a substitution mutation, eaMuPlusLambda, ParetoFront) over the SeqMut-backed fitness. This is the reference behavior and the parity oracle; it lives behind the [dev] extra and never on the shipped runtime path.
  2. Phase B — reimplement, byte-exact by default. Our pure-Python SeqOpt reproduces DEAP byte-for-byte in mode="exact": the same RNG stream (Python random.Random, seeded and consumed in DEAP's exact call order — not numpy in this mode) and the same float-summation order in the non-dominated sort / crowding distance. An optional mode="fast" vectorizes with numpy for speed at the cost of byte-identity (numerically equivalent within a stated tolerance, with identical discrete Pareto-front membership).
  3. Phase C — comparison harness (the decision input). A reproducible benchmark + parity report comparing SeqOpt(mode="exact"), SeqOpt(mode="fast"), and the DEAP oracle across a grid of pop_size × n_gen × n_objectives: correctness = byte-exact df_pareto equality (exact mode) / within-tolerance + identical front membership (fast mode); performance = wall-clock + peak memory. Output a table so the maintainer can decide ship-ours vs depend-on-DEAP.

Tension to be explicit about: byte-exactness and vectorized speed pull against each other (numpy changes float-op order), so there are two modes — exact proves correctness against DEAP, fast chases speed. The Phase-C numbers settle which to default to.

Design (DEAP -> SeqOpt mapping)

SeqOpt(Tool) -> run + eval; abbreviation seqopt; plot pair SeqOptPlot. House style: one run(...) with operators as string params + eval(...) for Pareto metrics + a plot pair; DEAP primitives are internal backend, not public methods. Adopt DEAP names for newly-introduced concepts (population, generation, non-dominated rank, crowding distance, Pareto front, hypervolume, NSGA-II); keep our names where they exist (variant, mutate, objective/score, df_seq/df_feat, random_state).

DEAP concept DEAP name SeqOpt name Where it lives
Individual Individual variant (COL_VARIANT) output row
Population / size population population / pop_size run param
Generations ngen generation / n_gen (COL_GENERATION) run param + col
Multi-objective fitness FitnessMulti(weights) objectives=[(name,"max"/"min",source)] run param
NSGA-II selection selNSGA2 algorithm="nsga2" (default) run param
NSGA-III / SPEA2 selNSGA3 / selSPEA2 algorithm="nsga3"/"spea2" (future) run param
Dominance tournament selTournamentDCD mating selection backend
Non-dominated sort sortNondominated / sortLogNondominated non-dominated rank (COL_RANK) backend + col
Crowding distance assignCrowdingDist crowding distance (COL_CROWDING) backend + col
Crossover cxUniform/cxOnePoint/cxTwoPoint crossover="uniform"/… + cx_prob run param
Mutation mutUniformInt/mutFlipBit mutation="substitution"/… + mut_prob (reuses SeqMut moves) run param
Variation step varAnd / varOr variation backend
Survival loop eaMuPlusLambda/eaMuCommaLambda/eaSimple survival="mu_plus_lambda" (default) run param
Pareto archive ParetoFront Pareto front = df_pareto output
Hall of fame HallOfFame best variants (single-objective) output
Constraints DeltaPenalty / ClosestValidPenalty constraints=[callable] (penalty / feasibility) run param
Hypervolume tools._hypervolume hypervolume eval metric
Convergence / diversity benchmarks.tools convergence / spread eval metric
Statistics / Logbook Statistics / Logbook df_eval + trajectory DataFrame output
ask-tell eaGenerateUpdate future CMA/BO hook future
greedy (ours) algorithm="greedy" (greedy baseline, implemented in SeqOpt) run param
Bayesian optimization not in DEAP algorithm="bayesian" (future, #60, extra-gated) future

Requirements

Core optimizer

  • SeqOpt(Tool) in aaanalysis/protein_design/_seqopt.py (+ backend _backend/seqopt/), reusing a model-bound SeqMut as the fitness engine; register abbreviation seqopt.
  • run(df_seq, df_feat, objectives, *, algorithm="nsga2", pop_size=50, n_gen=20, crossover="uniform", mutation="substitution", cx_prob=0.5, mut_prob=0.2, survival="mu_plus_lambda", n_mut_max=5, region=None, to_aa=None, constraints=None, init="random", jmd_n_len=10, jmd_c_len=10) -> df_pareto.
  • objectives spec: list of (name, "max"/"min", source); source in {"delta_pred" (+target_class), "delta_cpp", "shift_score", "n_mut", callable(df_variant)->array}. At least 2 objectives for a Pareto run.
  • Re-implement (cite DEAP + Deb et al. 2002 in references.rst, never in code): fast non-dominated sort, crowding distance, DCD mating selection, uniform/one-point/two-point crossover over mutation-sets, substitution/shift mutation reusing SeqMut moves, (mu+lambda) survival.
  • df_pareto columns: entry, variant, n_mut, sequence_mut, one column per objective, rank (front index), crowding.
  • algorithm="greedy" single-objective greedy baseline implemented in SeqOpt (SeqMut has no evolveSeqMut only scores; all search lives here). Core, no pro gate (DEAP is reimplemented, not a dependency).

Evaluation + plotting

  • eval(df_pareto, *, ref_point=None) -> df_eval with hypervolume, n_front (front size), spread/diversity; (convergence vs a reference front optional).
  • SeqOptPlot.pareto_front(df_pareto, x, y, ...) (2-D/3-D objective scatter colored by rank) and SeqOptPlot.hypervolume(...) (per-generation trace).

Constraints + reproducibility

  • constraints = list of feasibility callables -> penalty (mirror DeltaPenalty/ClosestValidPenalty semantics) enforcing n_mut_max, allowed positions/alphabet, forbidden residues.
  • Full random_state/seed threading (constructor + per-call), no reliance on process-global RNG state. mode="fast" uses np.random.default_rng; mode="exact" uses a local random.Random(seed) stream mirroring DEAP (still no globals) for byte-parity.

Parity, modes & DEAP comparison

  • Phase A: DEAP-backed reference NSGA-II pipeline wired to the SeqMut fitness, behind the [dev] extra; used only by parity tests + the benchmark.
  • Phase B: mode="exact" (default) reproduces the DEAP reference byte-for-byte on fixed seeds (matched random.Random stream + matched float-op order in sort/crowding); mode="fast" vectorizes (numpy) with a documented tolerance and identical Pareto-front membership vs exact.
  • Phase C: a benchmark harness producing the ours-vs-DEAP comparison table (correctness + wall-clock + peak memory across pop_size×n_gen×n_objectives), summarized in the PR for the ship-ours-vs-DEAP decision.

KPIs / Acceptance criteria

  • On a 2-objective task (maximize delta_pred for class A, minimize n_mut), run returns a non-dominated set where no returned variant dominates another (verified: pairwise non-domination holds for all rank==0 rows).
  • Non-dominated sort + crowding match a DEAP selNSGA2 reference ordering on a fixed synthetic fitness set (golden test, identical front assignment).
  • Hypervolume is non-decreasing across generations on a deterministic seed (monotonicity test); eval hypervolume matches a hand-computed 2-D reference within 1e-9.
  • Reproducible: two run calls with the same random_state produce byte-identical df_pareto.
  • Byte-exact parity: on ≥3 fixed seeds/configs, SeqOpt(mode="exact").run(...) yields a df_pareto byte-identical to the DEAP reference (pandas.testing.assert_frame_equal, exact — including row order, rank, crowding).
  • mode="fast" equivalence: identical Pareto-front membership vs exact, objective values within a stated tolerance (e.g. atol=1e-9); documented.
  • Comparison delivered: the Phase-C table (ours exact/fast vs DEAP — correctness + wall-clock + peak memory) is produced and summarized, making ship-ours-vs-DEAP a data-backed decision.
  • No new runtime dependency: the shipped path is pure-Python core (DEAP only in [dev]). Covered by ≥1 end-to-end test; one example notebook per public method runs with embedded outputs.

Scope / non-goals

  • Out: NSGA-III / SPEA2 (future algorithm= values), MO-CMA-ES (continuous), genetic programming.
  • Bayesian optimization is a separate paradigm (surrogate + acquisition), not part of the EA engine; tracked toward Active learning #60 and, if added, gated behind an opt-in extra (CONFIRM-FIRST pyproject.toml).
  • DEAP is a dev/test-only dependency (Phase-A prototype, parity oracle, benchmark) under the [dev] extra — the shipped runtime stays DEAP-free (core, pure-Python).

Dependencies

Standards checklist

  • Frontend/backend split; Validate block; backend trusts frontend; new dedicated _backend/seqopt/ (extend backend-import-hygiene DEDICATED_OWNERS).
  • CONFIRM-FIRST: new public SeqOpt/SeqOptPlot -> aaanalysis/__init__.py __all__; deap added to the [dev] extra -> pyproject.toml (test/benchmark only); any future BO extra -> pyproject.toml.
  • numpydoc (named Returns, per-method Examples include); df_pareto/df_eval added to DICT_DF_SCHEMAS.
  • unit + golden/property tests; random_state/seed contract.
  • no print() (use ut.print_out); bare ValueError/RuntimeError; constants in _constants.py; plot methods return ut.FigAxResult.
  • new domain terms in CONTEXT.md (population, generation, Pareto front, non-dominated rank, crowding distance, hypervolume) + a new ADR for the SeqOpt optimization layer and the parity-first reimplementation (exact/fast modes; the ship-ours-vs-DEAP decision backed by the Phase-C comparison).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions