Skip to content

Fix critical bugs, improve performance, add Snakemake workflow + CI/CD#1

Open
vipinmenon1989 wants to merge 1 commit into
mainfrom
full-pipeline
Open

Fix critical bugs, improve performance, add Snakemake workflow + CI/CD#1
vipinmenon1989 wants to merge 1 commit into
mainfrom
full-pipeline

Conversation

@vipinmenon1989

Copy link
Copy Markdown
Contributor

Critical bug fixes:

  • compute_final_features() in DGD.py, Variants/DGDVar.py, and Base-Editors/DGDbaseeditor.py assigned the result of make_arrays.return_dimers_array() -- which returns a (dimers, anti_sequence) tuple -- directly to a variable used as if it were the dimer list. This made the stacking-energy lookup index into the raw tuple (stacking[][...]), an unhashable-type crash on every guide. The pipeline could not previously produce a final output. Fixed by unpacking the tuple at all three call sites.
  • stacking_model.py's stacking-energy table only defined 4 of the 16 required RNA dinucleotide contexts (AA, AU, AG, AC); any guide whose paired dimers used one of the other 12 would raise KeyError even after the fix above. Rather than invent the missing published thermodynamic values, lookups now fall back to 0.0 kcal/mol with a logged warning -- flagged clearly in the docstring and README for the repo owner to supply the complete table.
  • The energy-calculation loop assumed an even-length dimers list, but return_dimers_array() doesn't guarantee that; fixed to safely skip a trailing unpaired entry instead of indexing out of bounds.
  • Every script in Accessory/ was broken: each calls its corresponding DGD.py function with 2-4 explicit file-path arguments, but those functions took zero arguments (I/O was hardcoded to fixed filenames). All affected DGD.py functions now accept optional path parameters defaulting to the original hardcoded filenames, so the standalone accessory scripts work as documented and run_pipeline()'s internal calls are unchanged. Verified via Snakemake actually running steps 1-3 end to end through the fixed Accessory scripts.
  • README documented a 'Base-Editors (BE)/' folder that doesn't match the actual 'Base-Editors/' directory name -- corrected throughout.

Performance (each verified to produce byte-identical output to the original logic before being applied):

  • prepare_model_inputs() rebuilt the same list of sequence 2-mers from scratch 16 times per guide; now built once and reused.
  • _build_structure_df() (run up to ~160 times per pipeline run) converted a DataFrame slice to a Python dict and summed values in a pure-Python loop; replaced with a single vectorized pandas row-sum.
  • make_fasta_for_rnafold() / compute_target_features() switched from .iterrows() to the faster .itertuples().

Cleanup:

  • Removed a committed pycache/*.pyc file; added .gitignore.
  • Recreated requirements.txt and a Makefile from what the README documents. Did NOT invent connection_to_matrix.cpp (C++ source) or the models/ directory of trained weights -- both are flagged in the README 'Known Issues' section for the repo owner to supply.

Snakemake workflow (Snakefile, config/config.yaml, envs/environment.yaml, demo/input.fa):

  • Standard SpCas9 variant wired as one rule per pipeline step (1-11), using the now-fixed Accessory/*.py scripts, so the DAG is resumable and runs as far as feature engineering without needing trained models (config['run_scoring']). Broad-PAM and base-editor variants wired as single full-CLI rules, matching their documented usage.
  • Verified: DAG resolves correctly for all variant/config combinations (including rejecting bad config); steps 1-3 actually execute end to end against the bundled synthetic demo FASTA; the Step 6 rule fails with a clear, actionable message (not a cryptic error) when the C++ binary hasn't been built.

CI (.github/workflows/ci.yml):

  • syntax-check: compiles every .py file.
  • snakemake-dry-run: validates the DAG for every variant/config combination.
  • smoke-test: installs the lightweight Python deps (no TensorFlow, no C++ binary) and actually runs steps 1-3 against the demo FASTA.

Critical bug fixes:
- compute_final_features() in DGD.py, Variants/DGDVar.py, and
  Base-Editors/DGDbaseeditor.py assigned the result of
  make_arrays.return_dimers_array() -- which returns a (dimers,
  anti_sequence) tuple -- directly to a variable used as if it were the
  dimer list. This made the stacking-energy lookup index into the raw
  tuple (stacking[<list>][...]), an unhashable-type crash on every guide.
  The pipeline could not previously produce a final output. Fixed by
  unpacking the tuple at all three call sites.
- stacking_model.py's stacking-energy table only defined 4 of the 16
  required RNA dinucleotide contexts (AA, AU, AG, AC); any guide whose
  paired dimers used one of the other 12 would raise KeyError even after
  the fix above. Rather than invent the missing published thermodynamic
  values, lookups now fall back to 0.0 kcal/mol with a logged warning --
  flagged clearly in the docstring and README for the repo owner to
  supply the complete table.
- The energy-calculation loop assumed an even-length dimers list, but
  return_dimers_array() doesn't guarantee that; fixed to safely skip a
  trailing unpaired entry instead of indexing out of bounds.
- Every script in Accessory/ was broken: each calls its corresponding
  DGD.py function with 2-4 explicit file-path arguments, but those
  functions took zero arguments (I/O was hardcoded to fixed filenames).
  All affected DGD.py functions now accept optional path parameters
  defaulting to the original hardcoded filenames, so the standalone
  accessory scripts work as documented and run_pipeline()'s internal
  calls are unchanged. Verified via Snakemake actually running steps
  1-3 end to end through the fixed Accessory scripts.
- README documented a 'Base-Editors (BE)/' folder that doesn't match the
  actual 'Base-Editors/' directory name -- corrected throughout.

Performance (each verified to produce byte-identical output to the
original logic before being applied):
- prepare_model_inputs() rebuilt the same list of sequence 2-mers from
  scratch 16 times per guide; now built once and reused.
- _build_structure_df() (run up to ~160 times per pipeline run) converted
  a DataFrame slice to a Python dict and summed values in a pure-Python
  loop; replaced with a single vectorized pandas row-sum.
- make_fasta_for_rnafold() / compute_target_features() switched from
  .iterrows() to the faster .itertuples().

Cleanup:
- Removed a committed __pycache__/*.pyc file; added .gitignore.
- Recreated requirements.txt and a Makefile from what the README
  documents. Did NOT invent connection_to_matrix.cpp (C++ source) or the
  models/ directory of trained weights -- both are flagged in the README
  'Known Issues' section for the repo owner to supply.

Snakemake workflow (Snakefile, config/config.yaml, envs/environment.yaml,
demo/input.fa):
- Standard SpCas9 variant wired as one rule per pipeline step (1-11),
  using the now-fixed Accessory/*.py scripts, so the DAG is resumable and
  runs as far as feature engineering without needing trained models
  (config['run_scoring']). Broad-PAM and base-editor variants wired as
  single full-CLI rules, matching their documented usage.
- Verified: DAG resolves correctly for all variant/config combinations
  (including rejecting bad config); steps 1-3 actually execute end to end
  against the bundled synthetic demo FASTA; the Step 6 rule fails with a
  clear, actionable message (not a cryptic error) when the C++ binary
  hasn't been built.

CI (.github/workflows/ci.yml):
- syntax-check: compiles every .py file.
- snakemake-dry-run: validates the DAG for every variant/config
  combination.
- smoke-test: installs the lightweight Python deps (no TensorFlow, no
  C++ binary) and actually runs steps 1-3 against the demo FASTA.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant