feat(mitocarta): new module to fetch the MitoCarta3.0 database (#118)#242
feat(mitocarta): new module to fetch the MitoCarta3.0 database (#118)#242Elarwei001 wants to merge 13 commits into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #242 +/- ##
==========================================
+ Coverage 56.70% 57.13% +0.42%
==========================================
Files 29 30 +1
Lines 9392 9472 +80
==========================================
+ Hits 5326 5412 +86
+ Misses 4066 4060 -6 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…se#118) Add `gget mitocarta` (Python API and CLI) to fetch the Broad Institute's MitoCarta3.0 inventory of mammalian mitochondrial proteins and pathways. - species: 'human' (default) or 'mouse'. - which: 'mitocarta' (default, the mitochondrial gene inventory), 'all_genes' (all genes with Maestro localization scores), or 'pathways' (the MitoPathways hierarchy and their genes). - Returns a pandas DataFrame (or list of dicts / JSON with json=True). - Adds xlrd (>=2) as a dependency to read the MitoCarta .xls workbook. - New module docs, introduction grid entry, citation, tests + fixtures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#118) Add the missing nav entry linking docs/src/en/mitocarta.md so the new module appears in the site navigation. (The Spanish SUMMARY entries and es/ pages are auto-generated and are not edited manually.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
0b6c774 to
3435f14
Compare
Normalize the raw Excel into an analysis-ready ("tidy") table instead of a raw dump:
- pathways ('C') sheet: drop the stray unlabeled leading column (header reads as an
integer), leaving MitoPathway / MitoPathways Hierarchy / Genes.
- split delimited string columns into Python lists: Synonyms and
MitoCarta3.0_MitoPathways ('|'-separated) and the pathways sheet's Genes (','-separated);
missing values become an empty list.
Add network-free unit tests (TestMitocartaClean) for _clean_df.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wrap the requests.get in a small retry loop (3 attempts, linear backoff) that retries on connection errors/timeouts and 5xx responses, but fails fast on 4xx (e.g. 404). Add a network-free test that mocks a persistent ConnectionError and asserts the download is retried before raising. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
excel_file.sheet_names is typed int | str; narrow to str before .startswith so the baseline-filtered mypy pre-commit hook passes.
- move xlrd out of the core dependencies into optional-dependencies.mitocarta (matches the gget[cellxgene] pattern; xlrd is only needed to read the .xls) - install the mitocarta extra in the hatch-test matrix on every tested Python version so the gget mitocarta tests can still read the workbook - point the missing-xlrd error message and the docs at `pip install gget[mitocarta]` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add docs/src/es/mitocarta.md and register it in SUMMARY.md's Español section; mirror the mitocarta additions into the es introduction grid, updates, and cite pages so the Spanish docs stay in sync with English. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Hi @lauraluebbert — this PR adds a gget mitocarta module (resolves #118). Quick notes for review:
Could you help review it when you are convenient? thanks |
mitocarta() returns DataFrame | list[dict]; without overloads the CLI handler's 'mitocarta_results: pd.DataFrame = mitocarta(...)' fails mypy. Mirror the bgee pattern: json=True -> list[dict], json=False (default) -> DataFrame.
94be589 to
8948562
Compare
|
The CI keep failed on "RuntimeError: http://rest.ensembl.org/ returned error status code 5xx", but locally run could pass. |
…embl.org outage across Ensembl-backed modules)

Resolves #118
Summary
Adds a new
gget mitocartamodule that fetches the MitoCarta3.0 inventory of mammalian mitochondrial proteins and pathways (Broad Institute) and returns it as an analysis-ready ("tidy") pandas DataFrame or JSON — in both the Python API and on the command line.MitoCarta3.0 is a manually curated inventory of ~1,100 human genes with strong evidence of mitochondrial localization, annotated with sub-mitochondrial localization, MitoPathways, and integrated (Maestro) localization scores. It is published only as static Excel files (no API), so this module downloads the
.xls, selects the requested table, and normalizes it.What it does
species:human(default) ormouse(homo_sapiens/mus_musculusalso accepted).which: which table to return —mitocarta(default) — the curated inventory of mitochondrial genes;all_genes— all ~19k genes with Maestro localization scores (same columns asmitocarta);pathways— the MitoPathways hierarchy and the genes in each pathway.list[dict]withjson=True);save=Truewrites a CSV/JSON to the working directory.Tidy normalization
The raw Excel is cleaned into an analysis-ready table rather than returned as-is:
C) sheet's stray unlabeled leading column (its header parses as an integer) is dropped, leavingMitoPathway/MitoPathways Hierarchy/Genes;SynonymsandMitoCarta3.0_MitoPathways(|-separated), and the pathways sheet'sGenes(,-separated); missing values become[].Column names and row counts are unchanged. (This changes the return shape of those columns from
strtolist— flagging for review.)Download robustness
The download retries on transient network errors (3 attempts, linear backoff) — connection errors/timeouts and 5xx responses — and fails fast on 4xx (e.g. 404).
Testing
TestMitocartaClean) and retry (TestMitocartaRetry, mocked)..xls(test_mitocarta_human,test_mitocarta_pathways). Thelen == 1136assertion is intentional — the MitoCarta3.0 tables have not changed since 2023, so the count is stable.ruffclean.Dependencies
Adds runtime dependency
xlrd>=2(to read the legacy.xls). Can be made an optional extra (gget[mitocarta]) if maintainers prefer.Scope
Wraps only MitoCarta's
.xlsdata tables (curated inventory + Maestro scores + pathways). It does not wrap the other MitoCarta3.0 distributions (.gmx, UCSC.bed, RefSeq.fasta, GFP images) — those serve narrower, tool-specific uses and could be added later on request.