Skip to content

feat(seqopt): pure-Python EA operators + DEAP parity + SeqOptPlot (protein engineering)#271

Merged
breimanntools merged 13 commits into
masterfrom
feat/seqopt-deap-parity
Jun 25, 2026
Merged

feat(seqopt): pure-Python EA operators + DEAP parity + SeqOptPlot (protein engineering)#271
breimanntools merged 13 commits into
masterfrom
feat/seqopt-deap-parity

Conversation

@breimanntools

Copy link
Copy Markdown
Owner

Summary

Completes the parity-first half of #261 (deferred from PR #267) and substantially extends SeqOpt (pro). Builds on ADR-0043; recorded in ADR-0045.

⚠️ #261 stays open (no closing keyword) — ongoing.

Pure-Python EA operator set (DEAP-free runtime)

Beyond the NSGA-II core: varAnd/varOr variation; (μ+λ)/(μ,λ)/eaSimple survival; constraints (DeltaPenalty / ClosestValidPenalty); uniform/one-/two-point crossover; substitution/shift mutation; single-objective Hall of Fame; a cumulative Pareto archive (rank-0 = best-ever, none lost to crowding); hypervolume / spread / convergence metrics. Objectives accept any callable(sequence) -> float (external scikit/torch model or web API), cached per variant.

DEAP parity (dev/test-only oracle)

deap added to [dev] only. test_seqopt_deap_parity.py proves our sortNondominated/assignCrowdingDist/selNSGA2 match DEAP — identical rank (incl. ties), crowding values+ordering within atol, survivor profile. Phase-C comparison (.github/scripts/seqopt_deap_comparison.py): ours-fast is 3–7× faster than DEAP and dependency-free → ship ours. engine="exact"|"fast" give identical fronts; fast is memory-bounded (chunked, 2.6× leaner at n=3000).

Visualization (SeqOptPlot)

pareto_front (2-D/3-D), parallel_coordinates, convergence (best/mean/worst band), hypervolume, mutation_map (front substitution-enrichment heatmap), genealogy (mutational-lineage tree). cmap is a parameter throughout (package convention).

Docs / framing

The class docstring carries a DEAP-mapping table and clearly frames SeqOpt as protein engineering (ML-guided directed evolution, [Yang19]/[Wittmann21]) vs de novo design (RFdiffusion→ProteinMPNN→AlphaFold, [Yang26]). New: 8 per-method example notebooks (realistic GSEC "super-substrate" task) + tutorial7_protein_engineering.

Bugs fixed (found via realistic data)

mode="impact" kept the full df_seq_ref → NaN-tripped check_df_seq when the reference came from load_dataset; now position-cols only (+ regression test).

Verification

469-test broad gate green locally (SeqOpt suite, all meta-tests, docstrings, parity); merged current with master.

🤖 Generated with Claude Code

breimanntools and others added 13 commits June 25, 2026 05:42
Pure-Python (no runtime dep), closing the gaps from the NSGA-II-only first cut:
- variation varAnd/varOr; survival mu_plus_lambda/mu_comma_lambda/ea_simple
- constraints (feasibility callables) with DeltaPenalty / ClosestValidPenalty
- single-objective Hall of Fame (SeqOpt.hall_of_fame_) beside the Pareto archive
- convergence metric (generational distance to a ref_front) in eval
- engine='exact'|'fast' (numpy-vectorized non-dominated sort; numerically identical
  front, faster); crowding now uses DEAP's nobj*span normalization

DEAP parity (dev/test-only oracle; runtime stays DEAP-free):
- deap added to [dev]; test_seqopt_deap_parity.py asserts our sort/crowding/selNSGA2
  reproduce DEAP's sortNondominated/assignCrowdingDist/selNSGA2 on synthetic fitness
  (identical rank incl. ties, crowding values+ordering within atol, selNSGA2 profile)
- Phase-C comparison (.github/scripts/seqopt_deap_comparison.py): ours-exact/fast vs
  DEAP, correctness + wall-clock + peak memory -> ship-ours (fast beats DEAP, e.g.
  ~14ms vs ~102ms at 500x3, dependency-free)

Docs: ADR-XXXX (number-last), CONTEXT.md EA-operator/engine/convergence terms,
release note. 85 SeqOpt tests + 447 in the broader gate green; docstrings/param
coverage clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pin what is actually invariant vs DEAP: non-dominated rank always identical (incl.
ties); crowding values+ordering and selNSGA2 survivor profile identical on
continuous fitness (within 1e-9); survivor rank-distribution identical under heavy
ties (the exact tied-individual kept is arbitrary in DEAP too). Drops the
over-strict exact-set/profile-under-duplicates claims that don't hold (boundary
points tie at inf crowding even for continuous objectives).
…ence history

Visualization (SeqOptPlot): new convergence (per-generation hypervolume + spread +
per-objective best, from the new SeqOpt.history_), 3-D pareto_front (optional z),
and parallel_coordinates for many-objective fronts. Per-generation history is now
tracked (spread + per-objective best, not only hypervolume) and exposed as
SeqOpt.history_.

Objectives: a callable source now receives the variant SEQUENCE (fn(sequence)->
float) and is cached per distinct variant, so any external predictor — a scikit/
torch model or a sequence-level tool / web API — can be optimized jointly with the
model-on-features objectives; pure-callable multi-objective runs need no CPP model.

Two executed example notebooks (seqopt_convergence, seqopt_parallel_coordinates)
demonstrate the views + the external-predictor recipe. 53 SeqOpt frontend tests
(+19) and 459 in the broader gate green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rewrite all six SeqOpt example notebooks around a real task — 'design a super
gamma-secretase substrate': load_features('DOM_GSEC') (150 CPP features) +
load_dataset('DOM_GSEC') + a simple RandomForest, take a non-substrate wild-type
and mutate its TMD to maximize predicted substrate probability with few mutations.
They demonstrate run (nsga2/greedy, impact/importance, varOr/ea_simple/operators,
constraints + Hall of Fame, external-predictor callable objective), eval
(hypervolume/spread/convergence), and all four SeqOptPlot views (pareto_front 2-D/
3-D, parallel_coordinates, convergence, hypervolume), with executed outputs.

Fix (found via the realistic reference): SeqOpt mode='impact' refit kept the FULL
df_seq_ref, so a reference from load_dataset (carrying jmd_n/tmd/jmd_c/label) NaN-
tripped check_df_seq on the appended variant row. Now keep only the position-based
columns; add a regression test with an extra-column reference.

460-test broad gate + docstrings clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hero plots

Critical-assessment improvements:
- engine='fast' non-dominated sort now computes the dominance matrix in row-chunks
  (adaptive block), bounding the transient to O(block*n*m) vs O(n^2*m) — ~2.6x
  leaner peak memory at n=3000; identical fronts (parity unchanged). Realistic
  pool sizes were never a problem; this makes pathologically large populations safe.
- run() keeps a cumulative non-dominated archive (DEAP ParetoFront analogue),
  merged into the final population so the returned rank=0 front is the best-ever
  set — no solution lost to per-generation crowding truncation.
- history_ now tracks per-objective best/mean/worst per generation.

Hero plots (the genre's standard views):
- SeqOptPlot.mutation_map — position x amino-acid substitution-enrichment heatmap
  across the front (the directed-evolution 'which mutations won' view).
- SeqOptPlot.convergence gains the classic GA best/mean/worst fitness band.

New executed notebook seqopt_mutation_map; tests for mutation_map + the band +
archive. 465-test broad gate + docstrings clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- SeqOptPlot.genealogy: mutational-lineage tree (wild-type -> variants by accumulated
  mutations, linked by mutation-set containment, colored by the first objective) - the
  directed-evolution analogue of a genealogy tree, matplotlib-only (no networkx).
- SeqOpt class docstring now carries a rendered list-table mapping every run/eval
  method + parameter value to its DEAP function (selNSGA2/varAnd/varOr/eaMuPlusLambda/
  cxUniform/DeltaPenalty/...), with the aaanalysis-only rows called out.
- New executed seqopt_genealogy notebook + tests. 273-test gate + docstrings clean.
…onsistency)

pareto_front / parallel_coordinates / mutation_map / genealogy take a user-overridable
cmap= (defaults unchanged), matching the CPPPlot / AAMutPlot / SeqMutPlot convention of
colormap-as-parameter instead of a hardcoded name.
tutorial7_protein_design: an executed end-to-end case study — train a GSEC substrate
classifier, design a 'super substrate' from a non-substrate, and read the result with
every SeqOptPlot view (pareto_front 2-D/3-D, convergence, mutation_map, genealogy,
parallel_coordinates) plus SHAP-guided impact mode. Wired into the Tutorials toctree
under a new Protein Design section.
…) + refs

Draw the paradigm distinction clearly in the SeqOpt class docstring, the tutorial,
and CONTEXT.md: SeqOpt does protein *engineering* — machine-learning-guided directed
evolution of an existing sequence [Yang19] — explicitly NOT de novo protein design
(generating new proteins). Introduce de novo design as the contrasting paradigm via
the canonical structure-first pipeline RFdiffusion [Watson23] -> ProteinMPNN
[Dauparas22] -> AlphaFold [Jumper21], reviewed in [DeNovoReview26]. Add all five
references to references.rst; tutorial retitled 'Protein Engineering with SeqOpt'
with the distinction + hyperlinked refs; Tutorials toctree section renamed.

Docstring citations resolve (0 defects); 104-test gate green.
…ce reviews

Read the two provided reviews and fixed the citations: ML-guided directed evolution
is Wittmann, Johnston, Wu & Arnold (2021), Curr. Opin. Struct. Biol. (not the Yang19
I had guessed); the de novo design review is Yang et al. (2026), Nature 652:1139.
Sharpened the distinction in the SeqOpt docstring + tutorial + CONTEXT.md using the
reviews' own framing (de novo = build new proteins from the ground up; engineering =
iterative mutation/selection of an existing protein, ML learns the fitness model).
Citations resolve; 64-test gate green.
…ed evolution

Add [Yang19] (Nature Methods 2019, the foundational ML-guided directed-evolution-for-
protein-engineering review) alongside [Wittmann21] in the SeqOpt docstring, tutorial
and CONTEXT.md. Citations resolve.
…rity

# Conflicts:
#	docs/source/index/release_notes.rst
…erators decision

Number the previously number-less parity ADR (one past the current master max 0044 =
find-features protocol), set status Accepted, regenerate INDEX.
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.75000% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.13%. Comparing base (e7e272e) to head (2178624).

Files with missing lines Patch % Lines
aaanalysis/protein_design_pro/_seqopt_plot.py 92.95% 1 Missing and 9 partials ⚠️
...alysis/protein_design_pro/_backend/seqopt/nsga2.py 95.52% 1 Missing and 2 partials ⚠️
...ysis/protein_design_pro/_backend/seqopt/metrics.py 81.81% 1 Missing and 1 partial ⚠️
...ysis/protein_design_pro/_backend/seqopt/penalty.py 92.85% 1 Missing and 1 partial ⚠️
...analysis/protein_design_pro/_backend/seqopt/run.py 97.33% 1 Missing and 1 partial ⚠️
aaanalysis/protein_design_pro/_seqopt.py 97.22% 0 Missing and 2 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #271      +/-   ##
==========================================
+ Coverage   96.10%   96.13%   +0.02%     
==========================================
  Files         175      176       +1     
  Lines       16374    16733     +359     
  Branches     2796     2863      +67     
==========================================
+ Hits        15737    16087     +350     
+ Misses        369      363       -6     
- Partials      268      283      +15     
Files with missing lines Coverage Δ
aaanalysis/_constants.py 100.00% <100.00%> (ø)
...ysis/protein_design_pro/_backend/seqopt/metrics.py 89.74% <81.81%> (+13.62%) ⬆️
...ysis/protein_design_pro/_backend/seqopt/penalty.py 92.85% <92.85%> (ø)
...analysis/protein_design_pro/_backend/seqopt/run.py 94.40% <97.33%> (+1.81%) ⬆️
aaanalysis/protein_design_pro/_seqopt.py 88.92% <97.22%> (+2.66%) ⬆️
...alysis/protein_design_pro/_backend/seqopt/nsga2.py 96.93% <95.52%> (-1.05%) ⬇️
aaanalysis/protein_design_pro/_seqopt_plot.py 92.04% <92.95%> (+1.56%) ⬆️
Components Coverage Δ
cpp_core 94.95% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@breimanntools breimanntools merged commit 8b88b35 into master Jun 25, 2026
17 checks passed
@breimanntools breimanntools deleted the feat/seqopt-deap-parity branch June 25, 2026 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant