You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SeqMut (model-aware, #57/#259) scores single mutations, explicit combined variants, and a greedy single-objective evolve. Real protein design needs to balance several objectives at once (e.g. raise binding while keeping stability and avoiding an off-target class) and return the trade-off (Pareto) front, not one greedy path. Today there is no population-based, multi-objective optimizer; users must hand-roll loops around SeqMut, and greedy stacking can't expose trade-offs. This is the substance of #59 (multi-objective) and sets up #60 (active learning / Bayesian optimization).
Goal
Add a SeqOpt optimizer class that runs multi-objective directed evolution over sequence variants — a re-implementation of the relevant DEAP algorithms (NSGA-II non-dominated sort + crowding) using a model-bound SeqMut as the fitness engine, deterministic via random_state. Parity-first: prototype against DEAP, then reimplement to be byte-exact to DEAP (mode="exact", the default) with an optional vectorized mode="fast" (equivalent within a stated tolerance), and ship a head-to-head comparison (correctness + speed) so we can decide whether to keep ours or depend on DEAP. The shipped runtime stays DEAP-free (core); DEAP is a dev/test-only dependency (prototype + parity oracle + benchmark).
Phase A — DEAP prototype / oracle (dev/test only). Stand up the full NSGA-II pipeline with DEAP (selNSGA2, cxUniform, a substitution mutation, eaMuPlusLambda, ParetoFront) over the SeqMut-backed fitness. This is the reference behavior and the parity oracle; it lives behind the [dev] extra and never on the shipped runtime path.
Phase B — reimplement, byte-exact by default. Our pure-Python SeqOpt reproduces DEAP byte-for-byte in mode="exact": the same RNG stream (Python random.Random, seeded and consumed in DEAP's exact call order — not numpy in this mode) and the same float-summation order in the non-dominated sort / crowding distance. An optional mode="fast" vectorizes with numpy for speed at the cost of byte-identity (numerically equivalent within a stated tolerance, with identical discrete Pareto-front membership).
Phase C — comparison harness (the decision input). A reproducible benchmark + parity report comparing SeqOpt(mode="exact"), SeqOpt(mode="fast"), and the DEAP oracle across a grid of pop_size × n_gen × n_objectives: correctness = byte-exact df_pareto equality (exact mode) / within-tolerance + identical front membership (fast mode); performance = wall-clock + peak memory. Output a table so the maintainer can decide ship-ours vs depend-on-DEAP.
Tension to be explicit about: byte-exactness and vectorized speed pull against each other (numpy changes float-op order), so there are two modes — exact proves correctness against DEAP, fast chases speed. The Phase-C numbers settle which to default to.
Design (DEAP -> SeqOpt mapping)
SeqOpt(Tool) -> run + eval; abbreviation seqopt; plot pair SeqOptPlot. House style: one run(...) with operators as string params + eval(...) for Pareto metrics + a plot pair; DEAP primitives are internal backend, not public methods. Adopt DEAP names for newly-introduced concepts (population, generation, non-dominated rank, crowding distance, Pareto front, hypervolume, NSGA-II); keep our names where they exist (variant, mutate, objective/score, df_seq/df_feat, random_state).
SeqOpt(Tool) in aaanalysis/protein_design/_seqopt.py (+ backend _backend/seqopt/), reusing a model-bound SeqMut as the fitness engine; register abbreviation seqopt.
objectives spec: list of (name, "max"/"min", source); source in {"delta_pred" (+target_class), "delta_cpp", "shift_score", "n_mut", callable(df_variant)->array}. At least 2 objectives for a Pareto run.
Re-implement (cite DEAP + Deb et al. 2002 in references.rst, never in code): fast non-dominated sort, crowding distance, DCD mating selection, uniform/one-point/two-point crossover over mutation-sets, substitution/shift mutation reusing SeqMut moves, (mu+lambda) survival.
df_pareto columns: entry, variant, n_mut, sequence_mut, one column per objective, rank (front index), crowding.
algorithm="greedy" single-objective greedy baseline implemented in SeqOpt (SeqMut has no evolve — SeqMut only scores; all search lives here). Core, no pro gate (DEAP is reimplemented, not a dependency).
Evaluation + plotting
eval(df_pareto, *, ref_point=None) -> df_eval with hypervolume, n_front (front size), spread/diversity; (convergence vs a reference front optional).
SeqOptPlot.pareto_front(df_pareto, x, y, ...) (2-D/3-D objective scatter colored by rank) and SeqOptPlot.hypervolume(...) (per-generation trace).
Constraints + reproducibility
constraints = list of feasibility callables -> penalty (mirror DeltaPenalty/ClosestValidPenalty semantics) enforcing n_mut_max, allowed positions/alphabet, forbidden residues.
Full random_state/seed threading (constructor + per-call), no reliance on process-global RNG state. mode="fast" uses np.random.default_rng; mode="exact" uses a local random.Random(seed) stream mirroring DEAP (still no globals) for byte-parity.
Parity, modes & DEAP comparison
Phase A: DEAP-backed reference NSGA-II pipeline wired to the SeqMut fitness, behind the [dev] extra; used only by parity tests + the benchmark.
Phase B:mode="exact" (default) reproduces the DEAP reference byte-for-byte on fixed seeds (matched random.Random stream + matched float-op order in sort/crowding); mode="fast" vectorizes (numpy) with a documented tolerance and identical Pareto-front membership vs exact.
Phase C: a benchmark harness producing the ours-vs-DEAP comparison table (correctness + wall-clock + peak memory across pop_size×n_gen×n_objectives), summarized in the PR for the ship-ours-vs-DEAP decision.
KPIs / Acceptance criteria
On a 2-objective task (maximize delta_pred for class A, minimize n_mut), run returns a non-dominated set where no returned variant dominates another (verified: pairwise non-domination holds for all rank==0 rows).
Non-dominated sort + crowding match a DEAP selNSGA2 reference ordering on a fixed synthetic fitness set (golden test, identical front assignment).
Hypervolume is non-decreasing across generations on a deterministic seed (monotonicity test); eval hypervolume matches a hand-computed 2-D reference within 1e-9.
Reproducible: two run calls with the same random_state produce byte-identical df_pareto.
Byte-exact parity: on ≥3 fixed seeds/configs, SeqOpt(mode="exact").run(...) yields a df_paretobyte-identical to the DEAP reference (pandas.testing.assert_frame_equal, exact — including row order, rank, crowding).
mode="fast" equivalence: identical Pareto-front membership vs exact, objective values within a stated tolerance (e.g. atol=1e-9); documented.
Comparison delivered: the Phase-C table (ours exact/fast vs DEAP — correctness + wall-clock + peak memory) is produced and summarized, making ship-ours-vs-DEAP a data-backed decision.
No new runtime dependency: the shipped path is pure-Python core (DEAP only in [dev]). Covered by ≥1 end-to-end test; one example notebook per public method runs with embedded outputs.
Bayesian optimization is a separate paradigm (surrogate + acquisition), not part of the EA engine; tracked toward Active learning #60 and, if added, gated behind an opt-in extra (CONFIRM-FIRST pyproject.toml).
DEAP is a dev/test-only dependency (Phase-A prototype, parity oracle, benchmark) under the [dev] extra — the shipped runtime stays DEAP-free (core, pure-Python).
CONFIRM-FIRST: new public SeqOpt/SeqOptPlot -> aaanalysis/__init__.py__all__; deap added to the [dev] extra -> pyproject.toml (test/benchmark only); any future BO extra -> pyproject.toml.
numpydoc (named Returns, per-method Examples include); df_pareto/df_eval added to DICT_DF_SCHEMAS.
unit + golden/property tests; random_state/seed contract.
no print() (use ut.print_out); bare ValueError/RuntimeError; constants in _constants.py; plot methods return ut.FigAxResult.
new domain terms in CONTEXT.md (population, generation, Pareto front, non-dominated rank, crowding distance, hypervolume) + a new ADR for the SeqOpt optimization layer and the parity-first reimplementation (exact/fast modes; the ship-ours-vs-DEAP decision backed by the Phase-C comparison).
Problem
SeqMut(model-aware, #57/#259) scores single mutations, explicit combined variants, and a greedy single-objectiveevolve. Real protein design needs to balance several objectives at once (e.g. raise binding while keeping stability and avoiding an off-target class) and return the trade-off (Pareto) front, not one greedy path. Today there is no population-based, multi-objective optimizer; users must hand-roll loops aroundSeqMut, and greedy stacking can't expose trade-offs. This is the substance of #59 (multi-objective) and sets up #60 (active learning / Bayesian optimization).Goal
Add a
SeqOptoptimizer class that runs multi-objective directed evolution over sequence variants — a re-implementation of the relevant DEAP algorithms (NSGA-II non-dominated sort + crowding) using a model-boundSeqMutas the fitness engine, deterministic viarandom_state. Parity-first: prototype against DEAP, then reimplement to be byte-exact to DEAP (mode="exact", the default) with an optional vectorizedmode="fast"(equivalent within a stated tolerance), and ship a head-to-head comparison (correctness + speed) so we can decide whether to keep ours or depend on DEAP. The shipped runtime stays DEAP-free (core); DEAP is a dev/test-only dependency (prototype + parity oracle + benchmark).Approach — parity-first (prototype → reimplement → compare)
selNSGA2,cxUniform, a substitution mutation,eaMuPlusLambda,ParetoFront) over theSeqMut-backed fitness. This is the reference behavior and the parity oracle; it lives behind the[dev]extra and never on the shipped runtime path.SeqOptreproduces DEAP byte-for-byte inmode="exact": the same RNG stream (Pythonrandom.Random, seeded and consumed in DEAP's exact call order — not numpy in this mode) and the same float-summation order in the non-dominated sort / crowding distance. An optionalmode="fast"vectorizes with numpy for speed at the cost of byte-identity (numerically equivalent within a stated tolerance, with identical discrete Pareto-front membership).SeqOpt(mode="exact"),SeqOpt(mode="fast"), and the DEAP oracle across a grid ofpop_size×n_gen×n_objectives: correctness = byte-exactdf_paretoequality (exact mode) / within-tolerance + identical front membership (fast mode); performance = wall-clock + peak memory. Output a table so the maintainer can decide ship-ours vs depend-on-DEAP.Design (DEAP -> SeqOpt mapping)
SeqOpt(Tool)->run+eval; abbreviationseqopt; plot pairSeqOptPlot. House style: onerun(...)with operators as string params +eval(...)for Pareto metrics + a plot pair; DEAP primitives are internal backend, not public methods. Adopt DEAP names for newly-introduced concepts (population, generation, non-dominated rank, crowding distance, Pareto front, hypervolume, NSGA-II); keep our names where they exist (variant, mutate, objective/score, df_seq/df_feat, random_state).IndividualCOL_VARIANT)populationpop_sizerunparamngenn_gen(COL_GENERATION)runparam + colFitnessMulti(weights)objectives=[(name,"max"/"min",source)]runparamselNSGA2algorithm="nsga2"(default)runparamselNSGA3/selSPEA2algorithm="nsga3"/"spea2"(future)runparamselTournamentDCDsortNondominated/sortLogNondominatedCOL_RANK)assignCrowdingDistCOL_CROWDING)cxUniform/cxOnePoint/cxTwoPointcrossover="uniform"/…+cx_probrunparammutUniformInt/mutFlipBitmutation="substitution"/…+mut_prob(reuses SeqMut moves)runparamvarAnd/varOreaMuPlusLambda/eaMuCommaLambda/eaSimplesurvival="mu_plus_lambda"(default)runparamParetoFrontdf_paretoHallOfFameDeltaPenalty/ClosestValidPenaltyconstraints=[callable](penalty / feasibility)runparamtools._hypervolumeevalmetricbenchmarks.toolsevalmetricStatistics/Logbookdf_eval+ trajectory DataFrameeaGenerateUpdatealgorithm="greedy"(greedy baseline, implemented inSeqOpt)runparamalgorithm="bayesian"(future, #60, extra-gated)Requirements
Core optimizer
SeqOpt(Tool)inaaanalysis/protein_design/_seqopt.py(+ backend_backend/seqopt/), reusing a model-boundSeqMutas the fitness engine; register abbreviationseqopt.run(df_seq, df_feat, objectives, *, algorithm="nsga2", pop_size=50, n_gen=20, crossover="uniform", mutation="substitution", cx_prob=0.5, mut_prob=0.2, survival="mu_plus_lambda", n_mut_max=5, region=None, to_aa=None, constraints=None, init="random", jmd_n_len=10, jmd_c_len=10)->df_pareto.objectivesspec: list of(name, "max"/"min", source);sourcein {"delta_pred"(+target_class),"delta_cpp","shift_score","n_mut", callable(df_variant)->array}. At least 2 objectives for a Pareto run.references.rst, never in code): fast non-dominated sort, crowding distance, DCD mating selection, uniform/one-point/two-point crossover over mutation-sets, substitution/shift mutation reusingSeqMutmoves,(mu+lambda)survival.df_paretocolumns:entry,variant,n_mut,sequence_mut, one column per objective,rank(front index),crowding.algorithm="greedy"single-objective greedy baseline implemented inSeqOpt(SeqMuthas noevolve—SeqMutonly scores; all search lives here). Core, noprogate (DEAP is reimplemented, not a dependency).Evaluation + plotting
eval(df_pareto, *, ref_point=None)->df_evalwithhypervolume,n_front(front size),spread/diversity; (convergence vs a reference front optional).SeqOptPlot.pareto_front(df_pareto, x, y, ...)(2-D/3-D objective scatter colored by rank) andSeqOptPlot.hypervolume(...)(per-generation trace).Constraints + reproducibility
constraints= list of feasibility callables -> penalty (mirrorDeltaPenalty/ClosestValidPenaltysemantics) enforcingn_mut_max, allowed positions/alphabet, forbidden residues.random_state/seedthreading (constructor + per-call), no reliance on process-global RNG state.mode="fast"usesnp.random.default_rng;mode="exact"uses a localrandom.Random(seed)stream mirroring DEAP (still no globals) for byte-parity.Parity, modes & DEAP comparison
SeqMutfitness, behind the[dev]extra; used only by parity tests + the benchmark.mode="exact"(default) reproduces the DEAP reference byte-for-byte on fixed seeds (matchedrandom.Randomstream + matched float-op order in sort/crowding);mode="fast"vectorizes (numpy) with a documented tolerance and identical Pareto-front membership vs exact.pop_size×n_gen×n_objectives), summarized in the PR for the ship-ours-vs-DEAP decision.KPIs / Acceptance criteria
delta_predfor class A, minimizen_mut),runreturns a non-dominated set where no returned variant dominates another (verified: pairwise non-domination holds for allrank==0rows).selNSGA2reference ordering on a fixed synthetic fitness set (golden test, identical front assignment).evalhypervolume matches a hand-computed 2-D reference within1e-9.runcalls with the samerandom_stateproduce byte-identicaldf_pareto.SeqOpt(mode="exact").run(...)yields adf_paretobyte-identical to the DEAP reference (pandas.testing.assert_frame_equal, exact — including row order,rank,crowding).mode="fast"equivalence: identical Pareto-front membership vs exact, objective values within a stated tolerance (e.g.atol=1e-9); documented.exact/fastvs DEAP — correctness + wall-clock + peak memory) is produced and summarized, making ship-ours-vs-DEAP a data-backed decision.[dev]). Covered by ≥1 end-to-end test; one example notebook per public method runs with embedded outputs.Scope / non-goals
algorithm=values), MO-CMA-ES (continuous), genetic programming.pyproject.toml).[dev]extra — the shipped runtime stays DEAP-free (core, pure-Python).Dependencies
SeqMutis the fitness engine).Standards checklist
_backend/seqopt/(extend backend-import-hygieneDEDICATED_OWNERS).SeqOpt/SeqOptPlot->aaanalysis/__init__.py__all__;deapadded to the[dev]extra ->pyproject.toml(test/benchmark only); any future BO extra ->pyproject.toml.Returns, per-methodExamplesinclude);df_pareto/df_evaladded toDICT_DF_SCHEMAS.random_state/seedcontract.print()(useut.print_out); bareValueError/RuntimeError; constants in_constants.py; plot methods returnut.FigAxResult.CONTEXT.md(population, generation, Pareto front, non-dominated rank, crowding distance, hypervolume) + a new ADR for the SeqOpt optimization layer and the parity-first reimplementation (exact/fast modes; the ship-ours-vs-DEAP decision backed by the Phase-C comparison).