Skip to content

feat(ref): support fetching GENCODE GTFs and FASTAs (#73)#238

Open
Elarwei001 wants to merge 6 commits into
scverse:devfrom
Elarwei001:feature/ref-seq-gencode-73
Open

feat(ref): support fetching GENCODE GTFs and FASTAs (#73)#238
Elarwei001 wants to merge 6 commits into
scverse:devfrom
Elarwei001:feature/ref-seq-gencode-73

Conversation

@Elarwei001

@Elarwei001 Elarwei001 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Resolves #73

What this adds

A source argument to gget ref selecting the reference database: "ensembl" (default, unchanged) or "gencode". With source="gencode", gget ref fetches GENCODE reference GTFs and FASTAs (human and mouse only).

  • release is the GENCODE release number (e.g. 46; the mouse M prefix is added automatically), defaulting to the latest.
  • which supports gtf, dna (primary-assembly genome), cdna (transcripts), ncrna (lncRNA transcripts), and pep (protein translations); GENCODE ships no cds, so that raises a hinted error.
  • The return shape mirrors the Ensembl branch (same outer keys), so downstream code is unaffected.

CLI: --source/-src. Existing Ensembl behaviour is untouched.

gget ref -w gtf,dna -r 46 --source gencode human

Escape hatch for the full GENCODE directory

which intentionally mirrors the Ensembl vocabulary, so it exposes only the 5 common file types. GENCODE ships more (GFF3, basic/comprehensive annotation, pc_transcripts, full genome, metadata, ...). Rather than growing which into a source-specific, non-portable vocabulary, list_files=True (CLI --list_files/-lf, source="gencode" only) returns every file in the release directory keyed by filename, each with its FTP link, date, and size. Using it with Ensembl raises a clear error (Ensembl splits files across per-type subdirectories).

gget ref --list_files -r 46 --source gencode human

Testing

  • Network-free unit tests for the GENCODE helpers (release parsing, which validation, missing-file warning, list_files parse/filter, save/ftp paths, Ensembl list_files guard).
  • Live tests (TestGencodeRefLive) that hit the real GENCODE FTP, anchored to frozen releases (human v46 / mouse M35) for exact assertions, plus the latest-release scraping path with loose bounds and a list_files check. All skip on a network error or transient non-200 rather than failing CI.

@codecov-commenter

codecov-commenter commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.87179% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.41%. Comparing base (e43a804) to head (5d5aebd).
⚠️ Report is 2 commits behind head on dev.

Files with missing lines Patch % Lines
gget/gget_ref.py 94.82% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #238      +/-   ##
==========================================
+ Coverage   56.70%   57.41%   +0.71%     
==========================================
  Files          29       29              
  Lines        9392     9509     +117     
==========================================
+ Hits         5326     5460     +134     
+ Misses       4066     4049      -17     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Elarwei001 Elarwei001 marked this pull request as draft June 25, 2026 03:44
@lauraluebbert lauraluebbert deleted the branch scverse:dev June 28, 2026 20:31
@lauraluebbert lauraluebbert reopened this Jun 28, 2026
Elarwei001 and others added 3 commits June 28, 2026 19:51
Add a `source` argument to `gget ref` (Python API and CLI) to select the
reference database: "ensembl" (default, unchanged) or "gencode" (human and
mouse only).

- With source="gencode", `release` is the GENCODE release number (mouse "M"
  prefix added automatically) and defaults to the latest release.
- `which` supports gtf, dna (primary assembly genome), cdna (transcripts),
  ncrna (lncRNA transcripts) and pep (translations); cds is not provided by
  GENCODE and raises a clear error.
- Returns the same dict/list structure as the Ensembl path (with a
  `gencode_release` field). Tests + fixtures (live GENCODE FTP) and docs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a network-free TestGencodeRefOffline class that mocks requests and
find_FTP_link to cover _find_latest_gencode_release (human/mouse release
resolution and error paths) and _gencode_ref (species and 'which'
validation, link found/not-found, ftp/json save, release resolution),
plus the ref() source validation and GENCODE delegation. All PR-added
lines are now covered.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lauraluebbert lauraluebbert force-pushed the feature/ref-seq-gencode-73 branch from da4c7a1 to 7b932f0 Compare June 28, 2026 23:55
Elarwei001 added a commit to Elarwei001/gget_issues that referenced this pull request Jul 3, 2026
Bilingual (ZH/EN) analysis of scverse/gget#238 (adds source='gencode' to
gget ref for human/mouse GTF+FASTA). Covers platform fit, return-value
shape vs Ensembl, and the live-data-test hardening gaps (two fixture
tests are de-facto live without skip-on-network-error; the latest-release
HTML-scraping path has no live test). Updated README index + index.html card.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Elarwei001 and others added 3 commits July 3, 2026 22:14
Address live-data-test robustness for the GENCODE path:

1. Move the two network-touching fixtures (test_ref_gencode /
   test_ref_gencode_ftp) out of test_ref.json into an explicit
   TestGencodeRefLive class. They still assert exact values (anchored to
   the frozen releases human v46 / mouse M35, whose files never change),
   but now skip on a network error or a transient non-200 (rate-limit /
   HTML page) instead of reddening CI. The two validation-only fixtures
   (bad_species / no_cds) stay in test_ref.json — they raise before any
   request.
2. Add live tests for the HTML-scraping "latest release" path
   (_find_latest_gencode_release) with loose assertions (human -> integer
   string >= 46; mouse -> 'M'+integer >= 35), skipping on transient
   errors. This is the only regression net for upstream listing changes.
3. Warn (logger.warning) when a GENCODE file name no longer matches,
   instead of silently returning an empty entry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- es/ref.md: mirror the GENCODE additions from en/ref.md (source arg,
  species restriction, which/cds note, release note, and the GENCODE
  fetch example).
- es/updates.md: add the Spanish 0.30.9 GENCODE changelog entry.
- en/updates.md: drop the "(fixes issue 73)" parenthetical from the
  GENCODE entry (readers of the changelog don't need the issue link).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…verse#73)

The curated `which` set (gtf/dna/cdna/ncrna/pep) mirrors Ensembl and
covers the common needs, but GENCODE ships many more files (GFF3,
basic/comprehensive annotation, pc_transcripts, metadata, ...). Rather
than growing `which` into a source-specific, non-portable vocabulary,
add a `list_files` escape hatch.

- `ref(..., source="gencode", list_files=True)` returns every file in the
  GENCODE release directory keyed by filename, each with its FTP link,
  release date/time, and size. Honors `ftp=True` (URL list) and `save=True`.
- CLI: `--list_files`/`-lf`. Requires `source="gencode"`; using it with
  Ensembl raises a clear ValueError (Ensembl splits files across per-type
  subdirectories, so there is no single directory to list).
- New `_list_gencode_files` parses the release directory HTML, skipping the
  parent-directory link and subdirectories.
- Tests: offline (parse/filter, bad status, dict/ftp shapes, ensembl-guard)
  and a live test asserting the listing includes pc_transcripts (a file with
  no `which` key), skipping on transient errors.
- Docs (en/es ref.md + updates.md) document the flag and add an example.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Elarwei001 Elarwei001 marked this pull request as ready for review July 3, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants