feat(pineapple): add gget pineapple module (#161) by Elarwei001 · Pull Request #245 · scverse/gget

Elarwei001 · 2026-06-25T00:23:12Z

Summary

Adds a new gget pineapple module to list and download curated bio-imaging data from Pineapple.

A point worth being explicit about: Pineapple is a Rust command-line tool for image-based cell profiling — it does not expose a REST/JSON API. What it does provide (via its pineapple download command) is a curated, standardized catalog of bio-imaging datasets and pre-trained model weights, hosted on Google Drive (each resource maps to a hardcoded Google Drive file ID in the pineapple-data crate). This module mirrors that catalog and downloads the resources directly from the same Google Drive host, so users get the data without installing the Rust binary.

What it does

List the catalog by category: segmentation (12 datasets), benchmark (13 datasets), or weights (5 pre-trained model weights). Returns names, authors, sizes (GB), licenses, filenames, and Google Drive IDs.
Download a resource by name directly from Google Drive (handles the large-file virus-scan-warning confirmation flow).
Available in both the Python API (gget.pineapple(...)) and the command line (gget pineapple ...).

# List the segmentation datasets
gget pineapple --category segmentation

# Download one by name
gget pineapple vicar_2021 --download --out_dir ./pineapple_data

# List pre-trained model weights
gget pineapple --category weights

import gget
gget.pineapple(category="segmentation")
gget.pineapple("vicar_2021", download=True, out_dir="./pineapple_data")

The catalog includes each dataset's original author and license so users can check usage terms (some datasets are non-commercial only).

Testing

10 network-free unit tests (tests/test_pineapple.py, mocked downloads) covering catalog contents/columns, filename conventions, Google Drive virus-scan-warning form parsing, and download routing. No real downloads happen in tests.
ruff: clean on all changed files.
Live-verified: a real Google Drive file ID from the catalog returns HTTP 200 (application/octet-stream, Content-Disposition: attachment; filename="arvidsson-2022.tar.gz"), matching the module's filename convention.
Docs added (docs/src/en/pineapple.md + SUMMARY entry) and an updates.md bullet under Version ≥ 0.30.8.

⚠️ Note for maintainers

Because Pineapple hardcodes its dataset Google Drive file IDs in its own source (the upstream author notes this isn't ideal and may move them out of the library later), this module's catalog is a point-in-time mirror of those IDs. If upstream Pineapple adds/changes datasets, gget's catalog would need a corresponding refresh. I went with mirroring because there is no programmatic API to query — flagging this wrapper approach for your input in case you'd prefer a different strategy (e.g. fetching the registry from a maintained upstream manifest if/when one exists).

Resolves #161

Scope — download only

This module intentionally wraps only Pineapple's data-distribution layer (its pineapple download subcommand): it lets gget users list the curated bio-imaging catalog (segmentation/benchmark datasets and pre-trained model weights) and download individual resources directly from Google Drive, without installing the Pineapple Rust binary.

It deliberately does not reimplement Pineapple's image-processing functionality (process, profile, neural, measure) — extracting morphological descriptors, running self-supervised embeddings, etc. Those are the core of the Pineapple tool itself and are out of scope for gget, whose role here is data access, not image analysis. Users who want to process images should use Pineapple directly. This matches the original request (#161): a module to download bio-imaging data.

codecov-commenter · 2026-06-25T00:37:55Z

Codecov Report

❌ Patch coverage is 78.94737% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.92%. Comparing base (e43a804) to head (ccc6383).
⚠️ Report is 2 commits behind head on dev.

Files with missing lines	Patch %	Lines
gget/gget_pineapple.py	78.49%	20 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #245      +/-   ##
==========================================
+ Coverage   56.70%   56.92%   +0.21%     
==========================================
  Files          29       30       +1     
  Lines        9392     9487      +95     
==========================================
+ Hits         5326     5400      +74     
- Misses       4066     4087      +21

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…data (scverse#161) New module `gget pineapple` lists and downloads the curated bio-imaging datasets (segmentation/benchmark) and pre-trained model weights distributed by Pineapple (https://github.com/tomouellette/pineapple, suggested by Prof. Anne Carpenter). The catalog (names, authors, sizes, licenses, Google Drive IDs) is mirrored from pineapple's pineapple-data crate; resources are downloaded directly from their Google Drive host (with the large-file virus-scan-warning bypass), so no Rust binary is required. Exposed via the Python API and the command line. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

for more information, see https://pre-commit.ci

…resolve Sweep all 30 catalog resources (segmentation/benchmark/weights) against Google Drive to guarantee the curated data stays downloadable. The new TestPineappleLiveAccess checks headers only (no body download): each file ID must still resolve to a binary attachment whose filename matches the catalog and is not a tiny placeholder; transient Google Drive throttling (HTML quota page) is treated as unavailable, not a failure. Extract _resolve_gdrive_response from _download_from_gdrive so the live tests exercise the same production resolution path, including the large-file virus-scan-warning confirmation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

for more information, see https://pre-commit.ci

… note Run the live Google Drive accessibility check over a 4-resource representative sample (one per category, both filename conventions, both the direct and virus-scan-warning resolution paths) instead of all 30, to keep CI fast (~17s vs ~96s) and resilient to Google Drive rate-limiting. Drop two circular catalog-value assertions in test_list_segmentation (google_drive_id, license echoed straight back from the dict under test); keep the derived-filename assertion, which exercises real logic. Document that gget pineapple wraps only Pineapple's data download, not its image-processing features. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Annotate the public pineapple() overloads with category: Literal["segmentation", "benchmark", "weights"] so editors and type checkers flag typos at call sites. The implementation and internal helpers keep `str` (the value is lowercased before use and validated at runtime), consistent with gget's string-based, CLI-friendly API. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

BeautifulSoup's find()/find_all() return Tag | NavigableString, and Tag.get() returns str | list[str]; narrow with isinstance(Tag) and handle the multi-valued-attr case so the function type-checks cleanly. This was the only new error tripping the baseline-filtered mypy pre-commit hook. Behavior is unchanged (the matched element is always a Tag at runtime). Also drop the "suggested by ..." attribution from the docs intro. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Elarwei001 · 2026-06-30T15:26:00Z

Hi @lauraluebbert — I think this PR is ready for review. Quick summary:

Scope — download only. It wraps only Pineapple's data download (pineapple download): listing the curated catalog and downloading individual datasets/weights directly from Google Drive, no Rust binary required. It deliberately does not reimplement Pineapple's image-processing features (process, profile, neural, measure) — those stay with the Pineapple tool itself. This matches the original request in Pineapple module to download bio-imaging data #161 ("download bio-imaging data").
Tests. Two complementary layers: network-free unit tests covering the catalog/parsing/routing logic, plus a live data test (TestPineappleLiveAccess) that does a header-only check (no multi-GB body downloads) confirming each resource still resolves to a binary download from Google Drive with the catalog's expected filename — so CI fails loudly if an ID is repointed/removed upstream. To keep CI fast and resilient, the live test covers a representative sample (one per category + both filename conventions + both resolution paths), not all 30 IDs; if Google Drive temporarily rate-limits the request, that resource is skipped rather than failing the build.
Docs. Added English (docs/src/en/pineapple.md) and Spanish (docs/src/es/pineapple.md) pages, both registered in SUMMARY.md, plus an updates.md entry.
Please Note: the Spanish page is an AI-generated translation — it would be good to have a Spanish speaker review it before merging.

Could you take a look when you have a chance? Thanks

Elarwei001 marked this pull request as draft June 25, 2026 03:44

lauraluebbert deleted the branch scverse:dev June 28, 2026 20:31

lauraluebbert closed this Jun 28, 2026

lauraluebbert reopened this Jun 28, 2026

lauraluebbert force-pushed the feature/pineapple-161 branch from af30558 to f615173 Compare June 28, 2026 23:23

Elarwei001 and others added 3 commits June 30, 2026 21:09

[pre-commit.ci] auto fixes from pre-commit.com hooks

5d9db77

for more information, see https://pre-commit.ci

Elarwei001 force-pushed the feature/pineapple-161 branch from f615173 to a36a9a0 Compare June 30, 2026 14:03

pre-commit-ci Bot and others added 6 commits June 30, 2026 14:04

[pre-commit.ci] auto fixes from pre-commit.com hooks

29d085c

for more information, see https://pre-commit.ci

docs(pineapple): cite upstream Pineapple before gget in References

8363577

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(pineapple): credit upstream author Tom Ouellette in updates entry

d62a20b

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Elarwei001 marked this pull request as ready for review June 30, 2026 15:18

docs(pineapple): add Spanish (es) translation and register in SUMMARY

ccc6383

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(pineapple): add gget pineapple module (#161)#245

feat(pineapple): add gget pineapple module (#161)#245
Elarwei001 wants to merge 10 commits into
scverse:devfrom
Elarwei001:feature/pineapple-161

Elarwei001 commented Jun 25, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 25, 2026 •

edited

Loading

Uh oh!

Elarwei001 commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Elarwei001 commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What it does

Testing

⚠️ Note for maintainers

Scope — download only

Uh oh!

codecov-commenter commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Elarwei001 commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Elarwei001 commented Jun 25, 2026 •

edited

Loading

codecov-commenter commented Jun 25, 2026 •

edited

Loading