feat(ref): support fetching GENCODE GTFs and FASTAs (#73)#238
Open
Elarwei001 wants to merge 6 commits into
Open
feat(ref): support fetching GENCODE GTFs and FASTAs (#73)#238Elarwei001 wants to merge 6 commits into
Elarwei001 wants to merge 6 commits into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #238 +/- ##
==========================================
+ Coverage 56.70% 57.41% +0.71%
==========================================
Files 29 29
Lines 9392 9509 +117
==========================================
+ Hits 5326 5460 +134
+ Misses 4066 4049 -17 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Add a `source` argument to `gget ref` (Python API and CLI) to select the reference database: "ensembl" (default, unchanged) or "gencode" (human and mouse only). - With source="gencode", `release` is the GENCODE release number (mouse "M" prefix added automatically) and defaults to the latest release. - `which` supports gtf, dna (primary assembly genome), cdna (transcripts), ncrna (lncRNA transcripts) and pep (translations); cds is not provided by GENCODE and raises a clear error. - Returns the same dict/list structure as the Ensembl path (with a `gencode_release` field). Tests + fixtures (live GENCODE FTP) and docs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
for more information, see https://pre-commit.ci
Add a network-free TestGencodeRefOffline class that mocks requests and find_FTP_link to cover _find_latest_gencode_release (human/mouse release resolution and error paths) and _gencode_ref (species and 'which' validation, link found/not-found, ftp/json save, release resolution), plus the ref() source validation and GENCODE delegation. All PR-added lines are now covered. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
da4c7a1 to
7b932f0
Compare
Elarwei001
added a commit
to Elarwei001/gget_issues
that referenced
this pull request
Jul 3, 2026
Bilingual (ZH/EN) analysis of scverse/gget#238 (adds source='gencode' to gget ref for human/mouse GTF+FASTA). Covers platform fit, return-value shape vs Ensembl, and the live-data-test hardening gaps (two fixture tests are de-facto live without skip-on-network-error; the latest-release HTML-scraping path has no live test). Updated README index + index.html card. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address live-data-test robustness for the GENCODE path: 1. Move the two network-touching fixtures (test_ref_gencode / test_ref_gencode_ftp) out of test_ref.json into an explicit TestGencodeRefLive class. They still assert exact values (anchored to the frozen releases human v46 / mouse M35, whose files never change), but now skip on a network error or a transient non-200 (rate-limit / HTML page) instead of reddening CI. The two validation-only fixtures (bad_species / no_cds) stay in test_ref.json — they raise before any request. 2. Add live tests for the HTML-scraping "latest release" path (_find_latest_gencode_release) with loose assertions (human -> integer string >= 46; mouse -> 'M'+integer >= 35), skipping on transient errors. This is the only regression net for upstream listing changes. 3. Warn (logger.warning) when a GENCODE file name no longer matches, instead of silently returning an empty entry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- es/ref.md: mirror the GENCODE additions from en/ref.md (source arg, species restriction, which/cds note, release note, and the GENCODE fetch example). - es/updates.md: add the Spanish 0.30.9 GENCODE changelog entry. - en/updates.md: drop the "(fixes issue 73)" parenthetical from the GENCODE entry (readers of the changelog don't need the issue link). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…verse#73) The curated `which` set (gtf/dna/cdna/ncrna/pep) mirrors Ensembl and covers the common needs, but GENCODE ships many more files (GFF3, basic/comprehensive annotation, pc_transcripts, metadata, ...). Rather than growing `which` into a source-specific, non-portable vocabulary, add a `list_files` escape hatch. - `ref(..., source="gencode", list_files=True)` returns every file in the GENCODE release directory keyed by filename, each with its FTP link, release date/time, and size. Honors `ftp=True` (URL list) and `save=True`. - CLI: `--list_files`/`-lf`. Requires `source="gencode"`; using it with Ensembl raises a clear ValueError (Ensembl splits files across per-type subdirectories, so there is no single directory to list). - New `_list_gencode_files` parses the release directory HTML, skipping the parent-directory link and subdirectories. - Tests: offline (parse/filter, bad status, dict/ftp shapes, ensembl-guard) and a live test asserting the listing includes pc_transcripts (a file with no `which` key), skipping on transient errors. - Docs (en/es ref.md + updates.md) document the flag and add an example. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #73
What this adds
A
sourceargument togget refselecting the reference database:"ensembl"(default, unchanged) or"gencode". Withsource="gencode",gget reffetches GENCODE reference GTFs and FASTAs (human and mouse only).releaseis the GENCODE release number (e.g.46; the mouseMprefix is added automatically), defaulting to the latest.whichsupportsgtf,dna(primary-assembly genome),cdna(transcripts),ncrna(lncRNA transcripts), andpep(protein translations); GENCODE ships nocds, so that raises a hinted error.CLI:
--source/-src. Existing Ensembl behaviour is untouched.Escape hatch for the full GENCODE directory
whichintentionally mirrors the Ensembl vocabulary, so it exposes only the 5 common file types. GENCODE ships more (GFF3, basic/comprehensive annotation,pc_transcripts, full genome, metadata, ...). Rather than growingwhichinto a source-specific, non-portable vocabulary,list_files=True(CLI--list_files/-lf,source="gencode"only) returns every file in the release directory keyed by filename, each with its FTP link, date, and size. Using it with Ensembl raises a clear error (Ensembl splits files across per-type subdirectories).Testing
whichvalidation, missing-file warning,list_filesparse/filter, save/ftp paths, Ensembllist_filesguard).TestGencodeRefLive) that hit the real GENCODE FTP, anchored to frozen releases (human v46 / mouse M35) for exact assertions, plus the latest-release scraping path with loose bounds and alist_filescheck. All skip on a network error or transient non-200 rather than failing CI.