Phase 2: assembly version summary generation#11
Open
fchen13 wants to merge 1 commit into
Open
Conversation
Contributor
Reviewer's GuideImplements a new Phase 2 aggregation flow that reads assembly_current.tsv and assembly_historical.tsv, normalizes accessions, computes per-base-accession version history metrics, writes assembly_version_summary.tsv, emits a completion event for downstream flows, and refactors documentation into phase-specific README guides for backfill, daily updates, and assembly summary while removing the old walkthrough from flows/README.md. Flow diagram for Phase 2 assembly version summary generationflowchart LR
current[assembly_current.tsv]
hist[assembly_historical.tsv]
flow_node[generate_assembly_summary flow]
summary[assembly_version_summary.tsv]
event[generate.assembly_summary.completed]
current --> flow_node
hist --> flow_node
flow_node --> summary
flow_node --> event
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Contributor
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The Phase 2 README and pipeline description still refer to
aggregate_assembly_summary.pyand theaggregate.assembly_summary.completedevent, but the implementation isgenerate_assembly_summary.pyemittinggenerate.assembly_summary.completed; aligning filenames and event names between code and docs will avoid confusion and miswired triggers. - The helper
_get_accession()is currently unused and the accession-normalisation logic is duplicated inload_assemblies; consider either using_get_accession()there or removing it to keep the module minimal and consistent.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The Phase 2 README and pipeline description still refer to `aggregate_assembly_summary.py` and the `aggregate.assembly_summary.completed` event, but the implementation is `generate_assembly_summary.py` emitting `generate.assembly_summary.completed`; aligning filenames and event names between code and docs will avoid confusion and miswired triggers.
- The helper `_get_accession()` is currently unused and the accession-normalisation logic is duplicated in `load_assemblies`; consider either using `_get_accession()` there or removing it to keep the module minimal and consistent.
## Individual Comments
### Comment 1
<location path="flows/lib/generate_assembly_summary.py" line_range="32-39" />
<code_context>
+EBP_METRIC_COLUMNS = {"ebpStandardDate", "ebp_standard_date", "ebpMetricDate", "ebp_metric_date"}
+
+
+def _get_accession(row: dict) -> str:
+ """Return the accession value from a row, handling both column name conventions.
+
+ assembly_current.tsv uses 'accession'; assembly_historical.tsv written by
+ the Phase 0 backfill uses 'genbankAccession'. Both may appear in the
+ merged historical TSV depending on which phase wrote each row.
+ """
+ return row.get("accession") or row.get("genbankAccession") or ""
+
+
</code_context>
<issue_to_address>
**suggestion:** Consider removing or inlining `_get_accession` since it is unused and duplicates logic in `load_assemblies`.
This helper duplicates the accession normalisation logic already in `load_assemblies` and is never called. To reduce redundancy and avoid confusion about the preferred access pattern, either remove `_get_accession` or update `load_assemblies` to use it consistently.
Suggested implementation:
```python
EBP_METRIC_COLUMNS = {"ebpStandardDate", "ebp_standard_date", "ebpMetricDate", "ebp_metric_date"}
```
The helper `_get_accession` is currently unused and duplicates logic in `load_assemblies`, so removing it is sufficient to reduce redundancy and avoid confusion. No further changes are required unless you want to centralise accession normalisation in a shared helper, in which case `load_assemblies` would need to be updated to call that helper consistently.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
8ec9d98 to
12535b4
Compare
12535b4 to
c92f20c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds flows/lib/generate_assembly_summary.py, a post-parse aggregation step that combines assembly_current.tsv and assembly_historical.tsv into a per-base-accession version summary (assembly_version_summary.tsv). Emits generate.assembly_summary.completed for downstream Phase 3 taxon milestone flows to trigger from.
Also adds phase-specific run guides (tests/README_phase_0_backfill.md, tests/README_phase_1_daily_updates.md, tests/README_phase_2_assembly_summary.md) and removes the assembly version walkthrough from flows/README.md — that content now lives in the phase-specific READMEs.
Summary by Sourcery
Generate a per-base-accession assembly version summary from current and historical TSVs and document the multi-phase assembly version pipeline.
New Features:
Documentation: