Skip to content

Add attribute_aliases + cast_version_columns to read_gtf (closes #63, #64)#67

Merged
iskandr merged 3 commits into
masterfrom
feature/gencode-aliases-and-versions
May 13, 2026
Merged

Add attribute_aliases + cast_version_columns to read_gtf (closes #63, #64)#67
iskandr merged 3 commits into
masterfrom
feature/gencode-aliases-and-versions

Conversation

@iskandr
Copy link
Copy Markdown
Contributor

@iskandr iskandr commented May 13, 2026

Summary

  • read_gtf attribute_aliases option for GENCODE biotype column names #63attribute_aliases: new kwarg on read_gtf that renames alias attribute columns onto canonical ones after expansion. Exports GENCODE_BIOTYPE_ALIASES = {\"gene_type\": \"gene_biotype\", \"transcript_type\": \"transcript_biotype\"} so downstream tools can normalize a GENCODE GTF onto Ensembl column names in one line. If both alias and canonical are present, the canonical wins and a warning is logged. Applied before infer_biotype_column so the inference path sees the renamed columns.
  • Optional integer-typed version columns (gene_version, transcript_version, protein_version, exon_version) #64cast_version_columns (default True): the four well-known Ensembl *_version attributes (gene_version, transcript_version, protein_version, exon_version) come back as pandas nullable Int64 instead of strings. Missing rows produce pd.NA. Pass cast_version_columns=False to keep the prior string behavior. Exports INTEGER_VERSION_COLUMNS constant for downstream use.
  • Bumps version to 2.7.0 (per AGENTS.md).

Motivated by openvax/pyensembl#335, where a user wiring a GENCODE GTF into pyensembl got NoncodingTranscript(...) for every variant effect because GENCODE's gene_type/transcript_type columns landed in unrelated names and pyensembl's biotype check returned False. With attribute_aliases=GENCODE_BIOTYPE_ALIASES the rename happens at parse time and downstream tools don't need to special-case GENCODE.

Companion follow-up in pyensembl: openvax/pyensembl#351 (preserve FASTA-header versions in SequenceData).

Test plan

  • ./lint.sh clean
  • ./test.sh — 35 passed (15 new), coverage 90% overall, read_gtf.py at 87%
  • New test file tests/test_gencode_gtf.py covers:
    • alias rename basic case, polars + pandas result types
    • alias-vs-canonical conflict (canonical wins, warning logged)
    • missing alias is a no-op
    • empty attribute_aliases={} is a no-op
    • cast_version_columns default casts to Int64, False keeps strings
    • missing version rows become pd.NA
    • GTFs without any *_version attributes don't trip the cast
    • aliases + version casting compose
    • aliases are visible to infer_biotype_column
    • constants are stable

Closes #63: `attribute_aliases` lets callers normalize GENCODE-format
GTFs onto Ensembl column names (`gene_type` -> `gene_biotype`,
`transcript_type` -> `transcript_biotype`) at parse time. New
`GENCODE_BIOTYPE_ALIASES` constant exports the canonical pair.

Closes #64: `cast_version_columns=True` (default) casts the four
well-known Ensembl `*_version` attributes (gene/transcript/protein/exon)
from strings to pandas nullable Int64 when present, so downstream
consumers like pyensembl don't have to `int(...)` themselves. Pass
`cast_version_columns=False` to opt out and keep them as strings.

Adds tests/data/gencode.head.gtf (synthetic GENCODE-style fixture
carrying both alias columns and separate *_version attributes) plus
test_gencode_gtf.py covering: alias rename, alias-vs-canonical conflict,
missing alias no-op, opt-out, missing-attribute no-op, polars/pandas
result types, and interaction with infer_biotype_column.

Bump version to 2.7.0.
Python 3.9 / 3.11 CI failed on test_cast_version_columns_false_keeps_strings
because newer pandas returns its own StringDtype (rather than the numpy
object dtype) for string-typed columns. Relax the assertion to
pd.api.types.is_string_dtype which returns True for both.
@coveralls
Copy link
Copy Markdown

coveralls commented May 13, 2026

Coverage Report for CI Build 25808853095

Coverage increased (+7.3%) to 93.78%

Details

  • Coverage increased (+7.3%) from the base build.
  • Patch coverage: No coverable lines changed in this PR.
  • 8 coverage regressions across 1 file.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

8 previously-covered lines in 1 file lost coverage.

File Lines Losing Coverage Coverage
read_gtf.py 8 93.55%

Coverage Stats

Coverage Status
Relevant Lines: 209
Covered Lines: 196
Line Coverage: 93.78%
Coverage Strength: 0.94 hits per line

💛 - Coveralls

Fixes from PR review:

1. _apply_attribute_aliases: two aliases targeting the same canonical
   (e.g. both gene_type and a hypothetical gene_kind -> gene_biotype)
   silently produced a duplicate-column rename. Track renames in a
   running set so the second collision is flagged the same way as a
   pre-existing canonical (drop + warn). First in iteration order wins.

2. usecols + attribute_aliases: usecols was enforced at parse time
   (inside parse_gtf_and_expand_attributes) before aliases could run,
   so requesting `usecols=["gene_biotype"]` against a GENCODE GTF that
   only had gene_type returned an empty frame. Expand the parse-time
   filter to pull alias source columns through when their canonical is
   in usecols. End-of-function usecols filter still narrows down to
   the canonical name only.

New test coverage (18 new tests, 54 total):

  * explicit attribute_aliases=None matches implicit default
  * result_type="dict" with both kwargs
  * multiple aliases -> same canonical: first wins, second warned+dropped
  * alias-chain semantics pinned (no chasing — each rule judged against
    the pre-rename column set)
  * aliasing preserves all unrelated columns
  * usecols composes with attribute_aliases
  * gene_version=0 round-trips as Int64(0), not pd.NA
  * malformed version string ("v3") coerces to pd.NA (lenient by design)
  * version cast idempotent across repeated reads (global polars cache safety)

Real GENCODE fixture (tests/data/gencode.real.head.gtf) reflecting the
on-disk shape — versioned IDs embedded in gene_id/transcript_id rather
than separate *_version attrs, GENCODE-only fields (level, hgnc_id,
havana_gene/transcript, tag, transcript_support_level), and chr-prefixed
seqnames — plus 7 tests against it:

  * alias rename works on the real shape
  * no synthesized *_version columns when attrs absent
  * versioned ID strings preserved verbatim (ENSG/ENST/ENSP/ENSE)
  * GENCODE-only attribute fields all come through
  * chr1/chrM seqname convention preserved
  * gzipped input round-trips
  * infer_biotype_column stays silent on GENCODE (source is HAVANA, not
    a biotype value) — regression guard

Ensembl regression checks:
  * old release-75 fixture still uses 'source' for biotype inference
  * passing GENCODE_BIOTYPE_ALIASES against a pure Ensembl GTF is a no-op
@iskandr iskandr merged commit 1c66fa0 into master May 13, 2026
6 checks passed
@iskandr iskandr deleted the feature/gencode-aliases-and-versions branch May 13, 2026 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants