Skip to content

feat(mitocarta): new module to fetch the MitoCarta3.0 database (#118)#242

Open
Elarwei001 wants to merge 13 commits into
scverse:devfrom
Elarwei001:feature/mitocarta-118
Open

feat(mitocarta): new module to fetch the MitoCarta3.0 database (#118)#242
Elarwei001 wants to merge 13 commits into
scverse:devfrom
Elarwei001:feature/mitocarta-118

Conversation

@Elarwei001

@Elarwei001 Elarwei001 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Resolves #118

Summary

Adds a new gget mitocarta module that fetches the MitoCarta3.0 inventory of mammalian mitochondrial proteins and pathways (Broad Institute) and returns it as an analysis-ready ("tidy") pandas DataFrame or JSON — in both the Python API and on the command line.

MitoCarta3.0 is a manually curated inventory of ~1,100 human genes with strong evidence of mitochondrial localization, annotated with sub-mitochondrial localization, MitoPathways, and integrated (Maestro) localization scores. It is published only as static Excel files (no API), so this module downloads the .xls, selects the requested table, and normalizes it.

What it does

  • species: human (default) or mouse (homo_sapiens / mus_musculus also accepted).
  • which: which table to return —
    • mitocarta (default) — the curated inventory of mitochondrial genes;
    • all_genes — all ~19k genes with Maestro localization scores (same columns as mitocarta);
    • pathways — the MitoPathways hierarchy and the genes in each pathway.
  • Returns a tidy DataFrame (or list[dict] with json=True); save=True writes a CSV/JSON to the working directory.
gget mitocarta --species human --which mitocarta
gget mitocarta --which pathways
import gget
gget.mitocarta("human", which="mitocarta")
gget.mitocarta("human", which="pathways")

Tidy normalization

The raw Excel is cleaned into an analysis-ready table rather than returned as-is:

  • the pathways (C) sheet's stray unlabeled leading column (its header parses as an integer) is dropped, leaving MitoPathway / MitoPathways Hierarchy / Genes;
  • delimited string columns are split into Python lists — Synonyms and MitoCarta3.0_MitoPathways (|-separated), and the pathways sheet's Genes (,-separated); missing values become [].

Column names and row counts are unchanged. (This changes the return shape of those columns from str to list — flagging for review.)

Download robustness

The download retries on transient network errors (3 attempts, linear backoff) — connection errors/timeouts and 5xx responses — and fails fast on 4xx (e.g. 404).

Testing

  • Network-free unit tests: tidy cleaning (TestMitocartaClean) and retry (TestMitocartaRetry, mocked).
  • Live tests hit the real Broad .xls (test_mitocarta_human, test_mitocarta_pathways). The len == 1136 assertion is intentional — the MitoCarta3.0 tables have not changed since 2023, so the count is stable.
  • ruff clean.

Dependencies

Adds runtime dependency xlrd>=2 (to read the legacy .xls). Can be made an optional extra (gget[mitocarta]) if maintainers prefer.

Scope

Wraps only MitoCarta's .xls data tables (curated inventory + Maestro scores + pathways). It does not wrap the other MitoCarta3.0 distributions (.gmx, UCSC .bed, RefSeq .fasta, GFP images) — those serve narrower, tool-specific uses and could be added later on request.

@codecov-commenter

codecov-commenter commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 78.75000% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.13%. Comparing base (e43a804) to head (a1bd9da).
⚠️ Report is 2 commits behind head on dev.

Files with missing lines Patch % Lines
gget/gget_mitocarta.py 78.20% 17 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #242      +/-   ##
==========================================
+ Coverage   56.70%   57.13%   +0.42%     
==========================================
  Files          29       30       +1     
  Lines        9392     9472      +80     
==========================================
+ Hits         5326     5412      +86     
+ Misses       4066     4060       -6     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Elarwei001 Elarwei001 marked this pull request as draft June 25, 2026 03:44
@lauraluebbert lauraluebbert deleted the branch scverse:dev June 28, 2026 20:31
@lauraluebbert lauraluebbert reopened this Jun 28, 2026
Elarwei001 and others added 2 commits June 28, 2026 19:51
…se#118)

Add `gget mitocarta` (Python API and CLI) to fetch the Broad Institute's
MitoCarta3.0 inventory of mammalian mitochondrial proteins and pathways.

- species: 'human' (default) or 'mouse'.
- which: 'mitocarta' (default, the mitochondrial gene inventory), 'all_genes'
  (all genes with Maestro localization scores), or 'pathways' (the MitoPathways
  hierarchy and their genes).
- Returns a pandas DataFrame (or list of dicts / JSON with json=True).
- Adds xlrd (>=2) as a dependency to read the MitoCarta .xls workbook.
- New module docs, introduction grid entry, citation, tests + fixtures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#118)

Add the missing nav entry linking docs/src/en/mitocarta.md so the new
module appears in the site navigation. (The Spanish SUMMARY entries and
es/ pages are auto-generated and are not edited manually.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lauraluebbert lauraluebbert force-pushed the feature/mitocarta-118 branch from 0b6c774 to 3435f14 Compare June 28, 2026 23:55
Elarwei001 and others added 6 commits July 1, 2026 21:54
Normalize the raw Excel into an analysis-ready ("tidy") table instead of a raw dump:
- pathways ('C') sheet: drop the stray unlabeled leading column (header reads as an
  integer), leaving MitoPathway / MitoPathways Hierarchy / Genes.
- split delimited string columns into Python lists: Synonyms and
  MitoCarta3.0_MitoPathways ('|'-separated) and the pathways sheet's Genes (','-separated);
  missing values become an empty list.

Add network-free unit tests (TestMitocartaClean) for _clean_df.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wrap the requests.get in a small retry loop (3 attempts, linear backoff) that
retries on connection errors/timeouts and 5xx responses, but fails fast on 4xx
(e.g. 404). Add a network-free test that mocks a persistent ConnectionError and
asserts the download is retried before raising.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
excel_file.sheet_names is typed int | str; narrow to str before .startswith
so the baseline-filtered mypy pre-commit hook passes.
- move xlrd out of the core dependencies into optional-dependencies.mitocarta
  (matches the gget[cellxgene] pattern; xlrd is only needed to read the .xls)
- install the mitocarta extra in the hatch-test matrix on every tested Python
  version so the gget mitocarta tests can still read the workbook
- point the missing-xlrd error message and the docs at `pip install gget[mitocarta]`

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add docs/src/es/mitocarta.md and register it in SUMMARY.md's Español section;
mirror the mitocarta additions into the es introduction grid, updates, and cite
pages so the Spanish docs stay in sync with English.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Elarwei001

Copy link
Copy Markdown
Contributor Author

Hi @lauraluebbert — this PR adds a gget mitocarta module (resolves #118). Quick notes for review:

  1. What it does — fetches the MitoCarta3.0 data from the Broad .xls and returns a tidy DataFrame or JSON (Python + CLI). Supported: human and mouse, across three tables — mitocarta (the curated inventory), all_genes (Maestro scores), and pathways (the MitoPathways hierarchy + member genes). Delimited fields (Synonyms, MitoCarta3.0_MitoPathways, pathway Genes) are split into lists, which become nested arrays in JSON.

  2. Not (yet) covered — only MitoCarta's .xls tables are wrapped; the other distributions (.gmx, UCSC .bed, RefSeq .fasta, GFP images) are left out for now and can be added later on request. Tables are returned tidy, not reshaped into matrices (a one-line pivot for anyone who wants that).

  3. How it's tested — network-free unit tests cover the table normalization and the download-retry logic (deterministic, no network); live tests download the real .xls and verify the expected tables and row counts. The download retries on transient network errors so the live tests stay stable.

Could you help review it when you are convenient? thanks

@Elarwei001 Elarwei001 marked this pull request as ready for review July 1, 2026 14:32
mitocarta() returns DataFrame | list[dict]; without overloads the CLI handler's
'mitocarta_results: pd.DataFrame = mitocarta(...)' fails mypy. Mirror the bgee
pattern: json=True -> list[dict], json=False (default) -> DataFrame.
@Elarwei001 Elarwei001 force-pushed the feature/mitocarta-118 branch from 94be589 to 8948562 Compare July 1, 2026 14:43
@Elarwei001

Elarwei001 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

The CI keep failed on "RuntimeError: http://rest.ensembl.org/ returned error status code 5xx", but locally run could pass.
image

…embl.org outage across Ensembl-backed modules)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants