Fix critical bugs, improve performance, add Snakemake workflow + CI/CD by vipinmenon1989 · Pull Request #1 · GuideDesigner/DGD-Cas9

vipinmenon1989 · 2026-07-04T01:11:06Z

Critical bug fixes:

compute_final_features() in DGD.py, Variants/DGDVar.py, and Base-Editors/DGDbaseeditor.py assigned the result of make_arrays.return_dimers_array() -- which returns a (dimers, anti_sequence) tuple -- directly to a variable used as if it were the dimer list. This made the stacking-energy lookup index into the raw tuple (stacking[][...]), an unhashable-type crash on every guide. The pipeline could not previously produce a final output. Fixed by unpacking the tuple at all three call sites.
stacking_model.py's stacking-energy table only defined 4 of the 16 required RNA dinucleotide contexts (AA, AU, AG, AC); any guide whose paired dimers used one of the other 12 would raise KeyError even after the fix above. Rather than invent the missing published thermodynamic values, lookups now fall back to 0.0 kcal/mol with a logged warning -- flagged clearly in the docstring and README for the repo owner to supply the complete table.
The energy-calculation loop assumed an even-length dimers list, but return_dimers_array() doesn't guarantee that; fixed to safely skip a trailing unpaired entry instead of indexing out of bounds.
Every script in Accessory/ was broken: each calls its corresponding DGD.py function with 2-4 explicit file-path arguments, but those functions took zero arguments (I/O was hardcoded to fixed filenames). All affected DGD.py functions now accept optional path parameters defaulting to the original hardcoded filenames, so the standalone accessory scripts work as documented and run_pipeline()'s internal calls are unchanged. Verified via Snakemake actually running steps 1-3 end to end through the fixed Accessory scripts.
README documented a 'Base-Editors (BE)/' folder that doesn't match the actual 'Base-Editors/' directory name -- corrected throughout.

Performance (each verified to produce byte-identical output to the original logic before being applied):

prepare_model_inputs() rebuilt the same list of sequence 2-mers from scratch 16 times per guide; now built once and reused.
_build_structure_df() (run up to ~160 times per pipeline run) converted a DataFrame slice to a Python dict and summed values in a pure-Python loop; replaced with a single vectorized pandas row-sum.
make_fasta_for_rnafold() / compute_target_features() switched from .iterrows() to the faster .itertuples().

Cleanup:

Removed a committed pycache/*.pyc file; added .gitignore.
Recreated requirements.txt and a Makefile from what the README documents. Did NOT invent connection_to_matrix.cpp (C++ source) or the models/ directory of trained weights -- both are flagged in the README 'Known Issues' section for the repo owner to supply.

Snakemake workflow (Snakefile, config/config.yaml, envs/environment.yaml, demo/input.fa):

Standard SpCas9 variant wired as one rule per pipeline step (1-11), using the now-fixed Accessory/*.py scripts, so the DAG is resumable and runs as far as feature engineering without needing trained models (config['run_scoring']). Broad-PAM and base-editor variants wired as single full-CLI rules, matching their documented usage.
Verified: DAG resolves correctly for all variant/config combinations (including rejecting bad config); steps 1-3 actually execute end to end against the bundled synthetic demo FASTA; the Step 6 rule fails with a clear, actionable message (not a cryptic error) when the C++ binary hasn't been built.

CI (.github/workflows/ci.yml):

syntax-check: compiles every .py file.
snakemake-dry-run: validates the DAG for every variant/config combination.
smoke-test: installs the lightweight Python deps (no TensorFlow, no C++ binary) and actually runs steps 1-3 against the demo FASTA.

Critical bug fixes: - compute_final_features() in DGD.py, Variants/DGDVar.py, and Base-Editors/DGDbaseeditor.py assigned the result of make_arrays.return_dimers_array() -- which returns a (dimers, anti_sequence) tuple -- directly to a variable used as if it were the dimer list. This made the stacking-energy lookup index into the raw tuple (stacking[<list>][...]), an unhashable-type crash on every guide. The pipeline could not previously produce a final output. Fixed by unpacking the tuple at all three call sites. - stacking_model.py's stacking-energy table only defined 4 of the 16 required RNA dinucleotide contexts (AA, AU, AG, AC); any guide whose paired dimers used one of the other 12 would raise KeyError even after the fix above. Rather than invent the missing published thermodynamic values, lookups now fall back to 0.0 kcal/mol with a logged warning -- flagged clearly in the docstring and README for the repo owner to supply the complete table. - The energy-calculation loop assumed an even-length dimers list, but return_dimers_array() doesn't guarantee that; fixed to safely skip a trailing unpaired entry instead of indexing out of bounds. - Every script in Accessory/ was broken: each calls its corresponding DGD.py function with 2-4 explicit file-path arguments, but those functions took zero arguments (I/O was hardcoded to fixed filenames). All affected DGD.py functions now accept optional path parameters defaulting to the original hardcoded filenames, so the standalone accessory scripts work as documented and run_pipeline()'s internal calls are unchanged. Verified via Snakemake actually running steps 1-3 end to end through the fixed Accessory scripts. - README documented a 'Base-Editors (BE)/' folder that doesn't match the actual 'Base-Editors/' directory name -- corrected throughout. Performance (each verified to produce byte-identical output to the original logic before being applied): - prepare_model_inputs() rebuilt the same list of sequence 2-mers from scratch 16 times per guide; now built once and reused. - _build_structure_df() (run up to ~160 times per pipeline run) converted a DataFrame slice to a Python dict and summed values in a pure-Python loop; replaced with a single vectorized pandas row-sum. - make_fasta_for_rnafold() / compute_target_features() switched from .iterrows() to the faster .itertuples(). Cleanup: - Removed a committed __pycache__/*.pyc file; added .gitignore. - Recreated requirements.txt and a Makefile from what the README documents. Did NOT invent connection_to_matrix.cpp (C++ source) or the models/ directory of trained weights -- both are flagged in the README 'Known Issues' section for the repo owner to supply. Snakemake workflow (Snakefile, config/config.yaml, envs/environment.yaml, demo/input.fa): - Standard SpCas9 variant wired as one rule per pipeline step (1-11), using the now-fixed Accessory/*.py scripts, so the DAG is resumable and runs as far as feature engineering without needing trained models (config['run_scoring']). Broad-PAM and base-editor variants wired as single full-CLI rules, matching their documented usage. - Verified: DAG resolves correctly for all variant/config combinations (including rejecting bad config); steps 1-3 actually execute end to end against the bundled synthetic demo FASTA; the Step 6 rule fails with a clear, actionable message (not a cryptic error) when the C++ binary hasn't been built. CI (.github/workflows/ci.yml): - syntax-check: compiles every .py file. - snakemake-dry-run: validates the DAG for every variant/config combination. - smoke-test: installs the lightweight Python deps (no TensorFlow, no C++ binary) and actually runs steps 1-3 against the demo FASTA.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix critical bugs, improve performance, add Snakemake workflow + CI/CD#1

Fix critical bugs, improve performance, add Snakemake workflow + CI/CD#1
vipinmenon1989 wants to merge 1 commit into
mainfrom
full-pipeline

vipinmenon1989 commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

vipinmenon1989 commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant