Lightweight, open-source CLI that
- finds open reading frames (ORFs) in any FASTA-formatted DNA sequence
- translates each ORF to its protein sequence (configurable genetic codes)
- asks ChatGPT to generate plain-English biological insights about the ORFs
The pipeline is validated on
- 500 synthetic 5 kb strands (with known βplantedβ genes)
- three reference genomes: Ξ»-phage (48 kb), E. coli K-12 (4.6 Mb) and Influenza A (β13 kb, 8 segments)
| Module | Highlights |
|---|---|
orf.py |
Linear scan over both strands (+, β) and all 3 frames; configurable start/stop sets; β₯ min_len filter |
translate.py |
Fast DNA β AA translation; built-in tables for Standard (1), Bacterial (11), Mitochondrial (2) |
llm.py |
explain_orfs() wraps the OpenAI SDK β batch-prompts top N ORFs and returns a concise Markdown summary |
simulate.py |
Generates synthetic sequences with ground-truth ORFs and computes precision / recall / Fβ |
cli.py |
One-liner interface β TSV (ORFs), FASTA (proteins), and optional Markdown/HTML report with GPT text |
| Week | Goal |
|---|---|
| Week 1 | Set up GitHub repo, create pyproject.toml, CI config, dev environment |
| Week 2 | Write find_orfs() in orf.py; test basic ORF detection logic |
| Week 3 | Build translate_orf() in translate.py; support multiple genetic codes |
| Week 4 | Integrate cli.py with CLI options for FASTA input, ORF length filter, output paths |
| Week 5 | Create simulate.py to generate 500 test strands with planted ORFs; write scoring function |
| Week 6 | Implement llm.py for GPT summary generation; create --explain CLI flag |
| Week 7 | Validate pipeline on Ξ»-phage genome; refine start/stop codon edge cases |
| Week 8 | Benchmark on E. coli K-12 and Influenza A; track runtime, precision, and coverage |
| Week 9 | Build documentation site with MkDocs; add screenshots and examples |
| Week 10 | Publish v0.1 to TestPyPI; prepare demo video, blog post, or academic poster |