Add attribute_aliases + cast_version_columns to read_gtf (closes #63, #64)#67
Merged
Merged
Conversation
Closes #63: `attribute_aliases` lets callers normalize GENCODE-format GTFs onto Ensembl column names (`gene_type` -> `gene_biotype`, `transcript_type` -> `transcript_biotype`) at parse time. New `GENCODE_BIOTYPE_ALIASES` constant exports the canonical pair. Closes #64: `cast_version_columns=True` (default) casts the four well-known Ensembl `*_version` attributes (gene/transcript/protein/exon) from strings to pandas nullable Int64 when present, so downstream consumers like pyensembl don't have to `int(...)` themselves. Pass `cast_version_columns=False` to opt out and keep them as strings. Adds tests/data/gencode.head.gtf (synthetic GENCODE-style fixture carrying both alias columns and separate *_version attributes) plus test_gencode_gtf.py covering: alias rename, alias-vs-canonical conflict, missing alias no-op, opt-out, missing-attribute no-op, polars/pandas result types, and interaction with infer_biotype_column. Bump version to 2.7.0.
3 tasks
Python 3.9 / 3.11 CI failed on test_cast_version_columns_false_keeps_strings because newer pandas returns its own StringDtype (rather than the numpy object dtype) for string-typed columns. Relax the assertion to pd.api.types.is_string_dtype which returns True for both.
Coverage Report for CI Build 25808853095Coverage increased (+7.3%) to 93.78%Details
Uncovered ChangesNo uncovered changes found. Coverage Regressions8 previously-covered lines in 1 file lost coverage.
Coverage Stats
💛 - Coveralls |
Fixes from PR review:
1. _apply_attribute_aliases: two aliases targeting the same canonical
(e.g. both gene_type and a hypothetical gene_kind -> gene_biotype)
silently produced a duplicate-column rename. Track renames in a
running set so the second collision is flagged the same way as a
pre-existing canonical (drop + warn). First in iteration order wins.
2. usecols + attribute_aliases: usecols was enforced at parse time
(inside parse_gtf_and_expand_attributes) before aliases could run,
so requesting `usecols=["gene_biotype"]` against a GENCODE GTF that
only had gene_type returned an empty frame. Expand the parse-time
filter to pull alias source columns through when their canonical is
in usecols. End-of-function usecols filter still narrows down to
the canonical name only.
New test coverage (18 new tests, 54 total):
* explicit attribute_aliases=None matches implicit default
* result_type="dict" with both kwargs
* multiple aliases -> same canonical: first wins, second warned+dropped
* alias-chain semantics pinned (no chasing — each rule judged against
the pre-rename column set)
* aliasing preserves all unrelated columns
* usecols composes with attribute_aliases
* gene_version=0 round-trips as Int64(0), not pd.NA
* malformed version string ("v3") coerces to pd.NA (lenient by design)
* version cast idempotent across repeated reads (global polars cache safety)
Real GENCODE fixture (tests/data/gencode.real.head.gtf) reflecting the
on-disk shape — versioned IDs embedded in gene_id/transcript_id rather
than separate *_version attrs, GENCODE-only fields (level, hgnc_id,
havana_gene/transcript, tag, transcript_support_level), and chr-prefixed
seqnames — plus 7 tests against it:
* alias rename works on the real shape
* no synthesized *_version columns when attrs absent
* versioned ID strings preserved verbatim (ENSG/ENST/ENSP/ENSE)
* GENCODE-only attribute fields all come through
* chr1/chrM seqname convention preserved
* gzipped input round-trips
* infer_biotype_column stays silent on GENCODE (source is HAVANA, not
a biotype value) — regression guard
Ensembl regression checks:
* old release-75 fixture still uses 'source' for biotype inference
* passing GENCODE_BIOTYPE_ALIASES against a pure Ensembl GTF is a no-op
This was referenced May 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
attribute_aliases: new kwarg onread_gtfthat renames alias attribute columns onto canonical ones after expansion. ExportsGENCODE_BIOTYPE_ALIASES = {\"gene_type\": \"gene_biotype\", \"transcript_type\": \"transcript_biotype\"}so downstream tools can normalize a GENCODE GTF onto Ensembl column names in one line. If both alias and canonical are present, the canonical wins and a warning is logged. Applied beforeinfer_biotype_columnso the inference path sees the renamed columns.cast_version_columns(defaultTrue): the four well-known Ensembl*_versionattributes (gene_version,transcript_version,protein_version,exon_version) come back as pandas nullableInt64instead of strings. Missing rows producepd.NA. Passcast_version_columns=Falseto keep the prior string behavior. ExportsINTEGER_VERSION_COLUMNSconstant for downstream use.Motivated by openvax/pyensembl#335, where a user wiring a GENCODE GTF into pyensembl got
NoncodingTranscript(...)for every variant effect because GENCODE'sgene_type/transcript_typecolumns landed in unrelated names and pyensembl's biotype check returned False. Withattribute_aliases=GENCODE_BIOTYPE_ALIASESthe rename happens at parse time and downstream tools don't need to special-case GENCODE.Companion follow-up in pyensembl: openvax/pyensembl#351 (preserve FASTA-header versions in
SequenceData).Test plan
./lint.shclean./test.sh— 35 passed (15 new), coverage 90% overall,read_gtf.pyat 87%tests/test_gencode_gtf.pycovers:attribute_aliases={}is a no-opcast_version_columnsdefault casts toInt64,Falsekeeps stringspd.NA*_versionattributes don't trip the castinfer_biotype_column