Fix critical bugs, improve performance, add Snakemake workflow + CI/CD#1
Open
vipinmenon1989 wants to merge 1 commit into
Open
Fix critical bugs, improve performance, add Snakemake workflow + CI/CD#1vipinmenon1989 wants to merge 1 commit into
vipinmenon1989 wants to merge 1 commit into
Conversation
Critical bug fixes: - compute_final_features() in DGD.py, Variants/DGDVar.py, and Base-Editors/DGDbaseeditor.py assigned the result of make_arrays.return_dimers_array() -- which returns a (dimers, anti_sequence) tuple -- directly to a variable used as if it were the dimer list. This made the stacking-energy lookup index into the raw tuple (stacking[<list>][...]), an unhashable-type crash on every guide. The pipeline could not previously produce a final output. Fixed by unpacking the tuple at all three call sites. - stacking_model.py's stacking-energy table only defined 4 of the 16 required RNA dinucleotide contexts (AA, AU, AG, AC); any guide whose paired dimers used one of the other 12 would raise KeyError even after the fix above. Rather than invent the missing published thermodynamic values, lookups now fall back to 0.0 kcal/mol with a logged warning -- flagged clearly in the docstring and README for the repo owner to supply the complete table. - The energy-calculation loop assumed an even-length dimers list, but return_dimers_array() doesn't guarantee that; fixed to safely skip a trailing unpaired entry instead of indexing out of bounds. - Every script in Accessory/ was broken: each calls its corresponding DGD.py function with 2-4 explicit file-path arguments, but those functions took zero arguments (I/O was hardcoded to fixed filenames). All affected DGD.py functions now accept optional path parameters defaulting to the original hardcoded filenames, so the standalone accessory scripts work as documented and run_pipeline()'s internal calls are unchanged. Verified via Snakemake actually running steps 1-3 end to end through the fixed Accessory scripts. - README documented a 'Base-Editors (BE)/' folder that doesn't match the actual 'Base-Editors/' directory name -- corrected throughout. Performance (each verified to produce byte-identical output to the original logic before being applied): - prepare_model_inputs() rebuilt the same list of sequence 2-mers from scratch 16 times per guide; now built once and reused. - _build_structure_df() (run up to ~160 times per pipeline run) converted a DataFrame slice to a Python dict and summed values in a pure-Python loop; replaced with a single vectorized pandas row-sum. - make_fasta_for_rnafold() / compute_target_features() switched from .iterrows() to the faster .itertuples(). Cleanup: - Removed a committed __pycache__/*.pyc file; added .gitignore. - Recreated requirements.txt and a Makefile from what the README documents. Did NOT invent connection_to_matrix.cpp (C++ source) or the models/ directory of trained weights -- both are flagged in the README 'Known Issues' section for the repo owner to supply. Snakemake workflow (Snakefile, config/config.yaml, envs/environment.yaml, demo/input.fa): - Standard SpCas9 variant wired as one rule per pipeline step (1-11), using the now-fixed Accessory/*.py scripts, so the DAG is resumable and runs as far as feature engineering without needing trained models (config['run_scoring']). Broad-PAM and base-editor variants wired as single full-CLI rules, matching their documented usage. - Verified: DAG resolves correctly for all variant/config combinations (including rejecting bad config); steps 1-3 actually execute end to end against the bundled synthetic demo FASTA; the Step 6 rule fails with a clear, actionable message (not a cryptic error) when the C++ binary hasn't been built. CI (.github/workflows/ci.yml): - syntax-check: compiles every .py file. - snakemake-dry-run: validates the DAG for every variant/config combination. - smoke-test: installs the lightweight Python deps (no TensorFlow, no C++ binary) and actually runs steps 1-3 against the demo FASTA.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Critical bug fixes:
Performance (each verified to produce byte-identical output to the original logic before being applied):
Cleanup:
Snakemake workflow (Snakefile, config/config.yaml, envs/environment.yaml, demo/input.fa):
CI (.github/workflows/ci.yml):