Skip to content

feat(enrichr): retrieve gene sets incl. MSigDB collections (#139)#241

Open
Elarwei001 wants to merge 6 commits into
scverse:devfrom
Elarwei001:feature/enrichr-msigdb-139
Open

feat(enrichr): retrieve gene sets incl. MSigDB collections (#139)#241
Elarwei001 wants to merge 6 commits into
scverse:devfrom
Elarwei001:feature/enrichr-msigdb-139

Conversation

@Elarwei001

@Elarwei001 Elarwei001 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Resolves #139

What this adds

gget enrichr can now fetch the gene sets behind an Enrichr library — so users can pull MSigDB collections (Hallmark, Oncogenic, Computational, …) directly, not just run enrichment.

  • enrichr_library(name) — member genes of any library (e.g. MSigDB_Hallmark_2020) as a long-format gene_set/gene data frame. gene_set= returns a single set; descriptions=True adds each set's description. CLI: --get_library/-gl, --gene_set, --descriptions/-desc.
  • enrichr_libraries(species="human", filter=None) — list the available libraries (name, term count, gene coverage) to discover what can be fetched. CLI: --list_libraries/-ll.

Existing enrichment behaviour is unchanged.

Testing

Extended tests/test_enrichr.py with offline tests (mocked library parsing, descriptions, and listing) plus live tests that skip on transient upstream errors. Docs updated (en + es).

@codecov-commenter

codecov-commenter commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.15385% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.26%. Comparing base (e43a804) to head (72aa2ed).
⚠️ Report is 2 commits behind head on dev.

Files with missing lines Patch % Lines
gget/gget_enrichr.py 95.89% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #241      +/-   ##
==========================================
+ Coverage   56.70%   57.26%   +0.55%     
==========================================
  Files          29       29              
  Lines        9392     9469      +77     
==========================================
+ Hits         5326     5422      +96     
+ Misses       4066     4047      -19     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Elarwei001 Elarwei001 marked this pull request as draft June 25, 2026 03:44
@lauraluebbert lauraluebbert deleted the branch scverse:dev June 28, 2026 20:31
@lauraluebbert lauraluebbert reopened this Jun 28, 2026
Elarwei001 and others added 2 commits June 28, 2026 19:51
Add `gget.enrichr_library()` (CLI: `gget enrichr --get_library`) to fetch the
gene sets (members) of any Enrichr gene-set library — the recommended way to
retrieve MSigDB gene sets (e.g. MSigDB_Hallmark_2020) without MSigDB login.

- Returns a long-format DataFrame (gene_set, gene), or a {gene_set: [genes]}
  dict with json=True. `gene_set=` returns a single set; `species` selects the
  non-human Enrichr variants.
- CLI: new --get_library/-gl and --gene_set/-gs; genes/--database made optional
  in library mode (still enforced for enrichment). Backward compatible.
- Detects Enrichr's HTML-404 (HTTP 200) response for unknown libraries.
- Tests + fixtures (live Enrichr) and docs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a network-free TestEnrichrLibraryOffline class that mocks requests
to cover enrichr_library: invalid species, verbose logging, blank-line
parsing, bad/empty library errors, gene_set filter + not-found, and the
json/json+save/CSV-save branches. All PR-added lines now covered.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lauraluebbert lauraluebbert force-pushed the feature/enrichr-msigdb-139 branch from 69cc3ea to 32105d7 Compare June 28, 2026 23:55
Elarwei001 and others added 4 commits July 2, 2026 21:36
…ests

Implements the follow-up improvements flagged in review:
- enrichr_libraries(): list available Enrichr gene-set libraries (via the
  datasetStatistics endpoint), with an optional substring filter (e.g. "MSigDB");
  CLI `gget enrichr --list_libraries [FILTER]`.
- enrichr_library(descriptions=True): also return each gene set's description
  (adds a 'description' column / nested dict); CLI `--descriptions`.
- Harden the de-facto-live library tests: skip (not fail) on network errors or a
  transient non-data Enrichr response; add network-free unit tests for
  enrichr_libraries and the descriptions option.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Also add the get_library/gene_set library docs + example to the Spanish page
(the original library feature had only been documented in English).
Add MSigDB references (Subramanian 2005, Liberzon 2011/2015, and
Castanza 2023 for mouse MSigDB) to the References section of the en/es
enrichr docs, following gget's per-source citation convention now that
enrichr can retrieve MSigDB collections.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
enrichr_libraries rejected species="mouse" because it looked the value up
directly in DATASETSTATISTICS_ENRICHR_URLS, which has no "mouse" key.
enrichr_library and enrichr already treat mouse as the human/Enrichr
variant, so align enrichr_libraries: accept mouse and query the human
endpoint. Add an offline test asserting the human URL is used.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Elarwei001 Elarwei001 marked this pull request as ready for review July 2, 2026 15:12
@Elarwei001

Copy link
Copy Markdown
Contributor Author

Hi @lauraluebbert — quick summary of #241 (resolves #139):

  1. Fetch a library's gene sets. enrichr_library(name) returns the member genes of any Enrichr library (e.g. MSigDB_Hallmark_2020) as a long-format gene_set/gene data frame; gene_set= narrows to a single set and descriptions=True adds each set's description. CLI: --get_library/-gl, --gene_set, --descriptions/-desc.

  2. Discover libraries. enrichr_libraries(filter="MSigDB") lists the available libraries (name, term count, gene coverage) so users can find the exact name to pass. CLI: --list_libraries/-ll. species="mouse" maps to the human/Enrichr variant, matching enrichr and enrichr_library.

  3. Enrichment CLI unchanged. --get_library/--list_libraries don't take genes/database, so I moved those required-checks from argparse into the handler (the new modes return before reaching them). Running enrichment without genes or -db still raises the same error.

  4. Docs + citations. en/es docs updated, and MSigDB references (Subramanian 2005; Liberzon 2011/2015; Castanza 2023 for mouse) added to the References section, per the per-source citation convention.

btw, tests are network-free where possible; the live ones skipTest (not fail) on transient Enrichr/Ensembl errors, while still asserting real errors like bad library names.

Happy to change anything — would appreciate a review when you have a moment. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants