[WIP]feat(ref): return NCBI assembly reports via gget ref (#179)#237
Open
Elarwei001 wants to merge 6 commits into
Open
[WIP]feat(ref): return NCBI assembly reports via gget ref (#179)#237Elarwei001 wants to merge 6 commits into
Elarwei001 wants to merge 6 commits into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #237 +/- ##
==========================================
+ Coverage 56.70% 57.31% +0.61%
==========================================
Files 29 29
Lines 9392 9517 +125
==========================================
+ Hits 5326 5455 +129
+ Misses 4066 4062 -4 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Add an `assembly_report` mode to `gget ref` (Python API and CLI). When enabled, the positional argument is interpreted as an NCBI assembly accession (e.g. GCF_000001405.40) and the NCBI assembly report is fetched and parsed, mapping sequence/chromosome names across the Ensembl/short, GenBank, RefSeq and UCSC naming conventions. - New `gget.assembly_report()` helper and `ref(..., assembly_report=True)`. - CLI: `gget ref <accession> --assembly_report [--csv]`. - Returns a pandas DataFrame by default, list of dicts / JSON with json=True. - Tests + fixtures (live NCBI, small SARS-CoV-2 assembly) and docs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
for more information, see https://pre-commit.ci
Add a network-free TestAssemblyReportOffline class that mocks requests to cover assembly_report (accession validation, parent-dir and report HTTP errors, no-folder/missing-header errors, parsing, verbose, and json/save/CSV branches) and the ref(assembly_report=True) delegation. All PR-added lines are now covered except the json_package.dump rename deep in the Ensembl network save path (un-coverable network-free). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
a871709 to
26ab068
Compare
Elarwei001
added a commit
to Elarwei001/gget_issues
that referenced
this pull request
Jul 5, 2026
Bilingual (ZH/EN) analysis of scverse/gget#237 (adds assembly_report to gget ref: interpret the positional arg as an NCBI assembly accession, return the cross-naming map chr1<->1<->CM000663.2<->NC_000001.11). Covers platform fit, the --csv-inverted-vs-gget-norm UX wart, and the de-facto live fixture test. Updated README index + index.html card. Analysis by subagent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on-less accession (scverse#179) Addresses review findings on the assembly_report feature: - --csv now follows gget's convention: it is store_false (default JSON on the CLI, --csv converts to CSV), so a flag named --csv gives CSV like every other module, instead of the previous inverted behaviour (--csv -> JSON). - Directory match uses `startswith(accession + "_")` so e.g. 'GCF_000001405.4' no longer prefix-matches the folder for 'GCF_000001405.40'. - The accession version suffix is now optional; when omitted (e.g. 'GCF_000001405') the latest available version is resolved. Saved files and logs carry the resolved version. - Move the network-touching fixture out of test_ref.json into an explicit TestAssemblyReportLive (skip-on-network / transient; anchored to the frozen SARS-CoV-2 report but asserting only stable identity columns). Add offline tests for the version-prefix guard and the version-less latest resolution. - Fix a pre-existing mypy error on df.to_dict(orient="records") via cast. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cverse#179) assembly_report previously required an NCBI assembly accession. Add an explicit `taxon=True` option (CLI `--taxon`/`-tx`) that interprets the input as an organism/taxon name (e.g. "homo sapiens") and resolves it to that taxon's NCBI reference assembly accession before fetching the report, using the datasets CLI already bundled with gget. - Resolution is opt-in: accession callers are unaffected and make no extra call. gget_virus (for the datasets binary) is imported lazily, only in taxon mode. - Default resolves to the reference assembly; to fetch a specific non-reference assembly, pass its accession directly. - Tests: offline (taxon path wiring, resolver parsing, not-found -> ValueError, datasets failure -> RuntimeError) and a live test resolving "homo sapiens". - Docs: --taxon flag + example (en); changelog note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rse#179) Complements --taxon (which resolves to the reference assembly): --list_assemblies (Python: list_assemblies=True) interprets the input as an organism/taxon name and returns a table of all NCBI assemblies for that taxon (accession, assembly_name, refseq_category, assembly_level, organism), with reference/representative first, so a user can discover and then fetch a specific non-reference assembly by accession. Uses the bundled datasets CLI (imported lazily). Tests: offline (parse/order, empty -> ValueError, delegation) and a live test listing the human assemblies. Docs: --list_assemblies flag + example; changelog note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Hi @lauraluebbert — gget ref --assembly_report (resolves #179) fetches an NCBI assembly report to translate sequence/chromosome names across Ensembl/GenBank/RefSeq/UCSC. Key points:
would appreciate a review when you have a moment. Thanks |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #179
What this adds
gget ref --assembly_report(Python:assembly_report=True) fetches the NCBI assembly report for an assembly and returns it as apandasDataFrame (or list of dicts withjson=True). The report maps each sequence/chromosome name across the Ensembl/short, GenBank, RefSeq, and UCSC conventions (e.g.1↔CM000663.2↔NC_000001.11↔chr1), which is handy for translating chromosome names between databases. A standalonegget.assembly_report()is also exposed.GCF_000001405(no version) and the latest version is resolved automatically.--taxon(Python:taxon=True) interprets the input as an organism name (e.g."homo sapiens") and resolves it to that taxon's NCBI reference assembly via the bundleddatasetsCLI, so you don't need to know the accession.--list_assemblies(Python:list_assemblies=True) lists all NCBI assemblies for an organism name (accession, name, category, level; reference/representative first), so you can pick a specific non-reference assembly to fetch.--csvfor CSV.Name resolution (
--taxon/--list_assemblies) is opt-in and imported lazily, so accession callers are unaffected and make no extra call.How it works
Builds the NCBI FTP path by sharding the accession digits (
GCA/981/441/385/…), lists the parent directory to find the exact assembly folder (matched precisely so.4can't collide with.40), then parses the tab-delimited*_assembly_report.txt. Organism-name modes shell out to the bundleddatasets summary genome taxonCLI.Testing
tests/test_ref.py): report parsing, json/save paths, the five error branches, version-prefix guard, version-less latest resolution, taxon resolution and assembly listing (parse / order / not-found / datasets-failure), and theref()delegation.TestAssemblyReportLive, skip-on-network/transient): SARS-CoV-2 report anchored on stable identity columns,--taxon "homo sapiens"resolving to the human reference, and--list_assemblieslisting the human assemblies.