Add attribute_aliases + cast_version_columns to read_gtf (closes #63, #64) by iskandr · Pull Request #67 · openvax/gtfparse

iskandr · 2026-05-13T15:05:30Z

Summary

read_gtf attribute_aliases option for GENCODE biotype column names #63 — attribute_aliases: new kwarg on read_gtf that renames alias attribute columns onto canonical ones after expansion. Exports GENCODE_BIOTYPE_ALIASES = {\"gene_type\": \"gene_biotype\", \"transcript_type\": \"transcript_biotype\"} so downstream tools can normalize a GENCODE GTF onto Ensembl column names in one line. If both alias and canonical are present, the canonical wins and a warning is logged. Applied before infer_biotype_column so the inference path sees the renamed columns.
Optional integer-typed version columns (gene_version, transcript_version, protein_version, exon_version) #64 — cast_version_columns (default True): the four well-known Ensembl *_version attributes (gene_version, transcript_version, protein_version, exon_version) come back as pandas nullable Int64 instead of strings. Missing rows produce pd.NA. Pass cast_version_columns=False to keep the prior string behavior. Exports INTEGER_VERSION_COLUMNS constant for downstream use.
Bumps version to 2.7.0 (per AGENTS.md).

Motivated by openvax/pyensembl#335, where a user wiring a GENCODE GTF into pyensembl got NoncodingTranscript(...) for every variant effect because GENCODE's gene_type/transcript_type columns landed in unrelated names and pyensembl's biotype check returned False. With attribute_aliases=GENCODE_BIOTYPE_ALIASES the rename happens at parse time and downstream tools don't need to special-case GENCODE.

Companion follow-up in pyensembl: openvax/pyensembl#351 (preserve FASTA-header versions in SequenceData).

Test plan

./lint.sh clean
./test.sh — 35 passed (15 new), coverage 90% overall, read_gtf.py at 87%
New test file tests/test_gencode_gtf.py covers:
- alias rename basic case, polars + pandas result types
- alias-vs-canonical conflict (canonical wins, warning logged)
- missing alias is a no-op
- empty attribute_aliases={} is a no-op
- cast_version_columns default casts to Int64, False keeps strings
- missing version rows become pd.NA
- GTFs without any *_version attributes don't trip the cast
- aliases + version casting compose
- aliases are visible to infer_biotype_column
- constants are stable

Closes #63: `attribute_aliases` lets callers normalize GENCODE-format GTFs onto Ensembl column names (`gene_type` -> `gene_biotype`, `transcript_type` -> `transcript_biotype`) at parse time. New `GENCODE_BIOTYPE_ALIASES` constant exports the canonical pair. Closes #64: `cast_version_columns=True` (default) casts the four well-known Ensembl `*_version` attributes (gene/transcript/protein/exon) from strings to pandas nullable Int64 when present, so downstream consumers like pyensembl don't have to `int(...)` themselves. Pass `cast_version_columns=False` to opt out and keep them as strings. Adds tests/data/gencode.head.gtf (synthetic GENCODE-style fixture carrying both alias columns and separate *_version attributes) plus test_gencode_gtf.py covering: alias rename, alias-vs-canonical conflict, missing alias no-op, opt-out, missing-attribute no-op, polars/pandas result types, and interaction with infer_biotype_column. Bump version to 2.7.0.

Python 3.9 / 3.11 CI failed on test_cast_version_columns_false_keeps_strings because newer pandas returns its own StringDtype (rather than the numpy object dtype) for string-typed columns. Relax the assertion to pd.api.types.is_string_dtype which returns True for both.

coveralls · 2026-05-13T15:15:18Z

Coverage Report for CI Build 25808853095

Coverage increased (+7.3%) to 93.78%

Details

Coverage increased (+7.3%) from the base build.
Patch coverage: No coverable lines changed in this PR.
8 coverage regressions across 1 file.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

8 previously-covered lines in 1 file lost coverage.

File	Lines Losing Coverage	Coverage
read_gtf.py	8	93.55%

Coverage Stats


Relevant Lines:	209
Covered Lines:	196
Line Coverage:	93.78%
Coverage Strength:	0.94 hits per line

💛 - Coveralls

Fixes from PR review: 1. _apply_attribute_aliases: two aliases targeting the same canonical (e.g. both gene_type and a hypothetical gene_kind -> gene_biotype) silently produced a duplicate-column rename. Track renames in a running set so the second collision is flagged the same way as a pre-existing canonical (drop + warn). First in iteration order wins. 2. usecols + attribute_aliases: usecols was enforced at parse time (inside parse_gtf_and_expand_attributes) before aliases could run, so requesting `usecols=["gene_biotype"]` against a GENCODE GTF that only had gene_type returned an empty frame. Expand the parse-time filter to pull alias source columns through when their canonical is in usecols. End-of-function usecols filter still narrows down to the canonical name only. New test coverage (18 new tests, 54 total): * explicit attribute_aliases=None matches implicit default * result_type="dict" with both kwargs * multiple aliases -> same canonical: first wins, second warned+dropped * alias-chain semantics pinned (no chasing — each rule judged against the pre-rename column set) * aliasing preserves all unrelated columns * usecols composes with attribute_aliases * gene_version=0 round-trips as Int64(0), not pd.NA * malformed version string ("v3") coerces to pd.NA (lenient by design) * version cast idempotent across repeated reads (global polars cache safety) Real GENCODE fixture (tests/data/gencode.real.head.gtf) reflecting the on-disk shape — versioned IDs embedded in gene_id/transcript_id rather than separate *_version attrs, GENCODE-only fields (level, hgnc_id, havana_gene/transcript, tag, transcript_support_level), and chr-prefixed seqnames — plus 7 tests against it: * alias rename works on the real shape * no synthesized *_version columns when attrs absent * versioned ID strings preserved verbatim (ENSG/ENST/ENSP/ENSE) * GENCODE-only attribute fields all come through * chr1/chrM seqname convention preserved * gzipped input round-trips * infer_biotype_column stays silent on GENCODE (source is HAVANA, not a biotype value) — regression guard Ensembl regression checks: * old release-75 fixture still uses 'source' for biotype inference * passing GENCODE_BIOTYPE_ALIASES against a pure Ensembl GTF is a no-op

iskandr mentioned this pull request May 13, 2026

Preserve FASTA-header versions in SequenceData (closes #351) openvax/pyensembl#355

Merged

3 tasks

iskandr merged commit 1c66fa0 into master May 13, 2026
6 checks passed

iskandr deleted the feature/gencode-aliases-and-versions branch May 13, 2026 15:26

This was referenced May 13, 2026

Fix #335 (part 1): wire GENCODE_BIOTYPE_ALIASES into read_gtf call openvax/pyensembl#356

Merged

ID version handling and GENCODE compatibility openvax/pyensembl#335

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add attribute_aliases + cast_version_columns to read_gtf (closes #63, #64)#67

Add attribute_aliases + cast_version_columns to read_gtf (closes #63, #64)#67
iskandr merged 3 commits into
masterfrom
feature/gencode-aliases-and-versions

iskandr commented May 13, 2026

Uh oh!

coveralls commented May 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iskandr commented May 13, 2026

Summary

Test plan

Uh oh!

coveralls commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report for CI Build 25808853095

Coverage increased (+7.3%) to 93.78%

Details

Uncovered Changes

Coverage Regressions

Coverage Stats

💛 - Coveralls

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coveralls commented May 13, 2026 •

edited

Loading