Skip to content

feat(pineapple): add gget pineapple module (#161)#245

Open
Elarwei001 wants to merge 10 commits into
scverse:devfrom
Elarwei001:feature/pineapple-161
Open

feat(pineapple): add gget pineapple module (#161)#245
Elarwei001 wants to merge 10 commits into
scverse:devfrom
Elarwei001:feature/pineapple-161

Conversation

@Elarwei001

@Elarwei001 Elarwei001 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a new gget pineapple module to list and download curated bio-imaging data from Pineapple.

A point worth being explicit about: Pineapple is a Rust command-line tool for image-based cell profiling — it does not expose a REST/JSON API. What it does provide (via its pineapple download command) is a curated, standardized catalog of bio-imaging datasets and pre-trained model weights, hosted on Google Drive (each resource maps to a hardcoded Google Drive file ID in the pineapple-data crate). This module mirrors that catalog and downloads the resources directly from the same Google Drive host, so users get the data without installing the Rust binary.

What it does

  • List the catalog by category: segmentation (12 datasets), benchmark (13 datasets), or weights (5 pre-trained model weights). Returns names, authors, sizes (GB), licenses, filenames, and Google Drive IDs.
  • Download a resource by name directly from Google Drive (handles the large-file virus-scan-warning confirmation flow).
  • Available in both the Python API (gget.pineapple(...)) and the command line (gget pineapple ...).
# List the segmentation datasets
gget pineapple --category segmentation

# Download one by name
gget pineapple vicar_2021 --download --out_dir ./pineapple_data

# List pre-trained model weights
gget pineapple --category weights
import gget
gget.pineapple(category="segmentation")
gget.pineapple("vicar_2021", download=True, out_dir="./pineapple_data")

The catalog includes each dataset's original author and license so users can check usage terms (some datasets are non-commercial only).

Testing

  • 10 network-free unit tests (tests/test_pineapple.py, mocked downloads) covering catalog contents/columns, filename conventions, Google Drive virus-scan-warning form parsing, and download routing. No real downloads happen in tests.
  • ruff: clean on all changed files.
  • Live-verified: a real Google Drive file ID from the catalog returns HTTP 200 (application/octet-stream, Content-Disposition: attachment; filename="arvidsson-2022.tar.gz"), matching the module's filename convention.
  • Docs added (docs/src/en/pineapple.md + SUMMARY entry) and an updates.md bullet under Version ≥ 0.30.8.

⚠️ Note for maintainers

Because Pineapple hardcodes its dataset Google Drive file IDs in its own source (the upstream author notes this isn't ideal and may move them out of the library later), this module's catalog is a point-in-time mirror of those IDs. If upstream Pineapple adds/changes datasets, gget's catalog would need a corresponding refresh. I went with mirroring because there is no programmatic API to query — flagging this wrapper approach for your input in case you'd prefer a different strategy (e.g. fetching the registry from a maintained upstream manifest if/when one exists).

Resolves #161

Scope — download only

This module intentionally wraps only Pineapple's data-distribution layer (its pineapple download subcommand): it lets gget users list the curated bio-imaging catalog (segmentation/benchmark datasets and pre-trained model weights) and download individual resources directly from Google Drive, without installing the Pineapple Rust binary.

It deliberately does not reimplement Pineapple's image-processing functionality (process, profile, neural, measure) — extracting morphological descriptors, running self-supervised embeddings, etc. Those are the core of the Pineapple tool itself and are out of scope for gget, whose role here is data access, not image analysis. Users who want to process images should use Pineapple directly. This matches the original request (#161): a module to download bio-imaging data.

@codecov-commenter

codecov-commenter commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 78.94737% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.92%. Comparing base (e43a804) to head (ccc6383).
⚠️ Report is 2 commits behind head on dev.

Files with missing lines Patch % Lines
gget/gget_pineapple.py 78.49% 20 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #245      +/-   ##
==========================================
+ Coverage   56.70%   56.92%   +0.21%     
==========================================
  Files          29       30       +1     
  Lines        9392     9487      +95     
==========================================
+ Hits         5326     5400      +74     
- Misses       4066     4087      +21     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Elarwei001 Elarwei001 marked this pull request as draft June 25, 2026 03:44
@lauraluebbert lauraluebbert deleted the branch scverse:dev June 28, 2026 20:31
@lauraluebbert lauraluebbert reopened this Jun 28, 2026
@lauraluebbert lauraluebbert force-pushed the feature/pineapple-161 branch from af30558 to f615173 Compare June 28, 2026 23:23
Elarwei001 and others added 3 commits June 30, 2026 21:09
…data (scverse#161)

New module `gget pineapple` lists and downloads the curated bio-imaging
datasets (segmentation/benchmark) and pre-trained model weights
distributed by Pineapple (https://github.com/tomouellette/pineapple,
suggested by Prof. Anne Carpenter). The catalog (names, authors, sizes,
licenses, Google Drive IDs) is mirrored from pineapple's pineapple-data
crate; resources are downloaded directly from their Google Drive host
(with the large-file virus-scan-warning bypass), so no Rust binary is
required. Exposed via the Python API and the command line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…resolve

Sweep all 30 catalog resources (segmentation/benchmark/weights) against
Google Drive to guarantee the curated data stays downloadable. The new
TestPineappleLiveAccess checks headers only (no body download): each file
ID must still resolve to a binary attachment whose filename matches the
catalog and is not a tiny placeholder; transient Google Drive throttling
(HTML quota page) is treated as unavailable, not a failure.

Extract _resolve_gdrive_response from _download_from_gdrive so the live
tests exercise the same production resolution path, including the
large-file virus-scan-warning confirmation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Elarwei001 Elarwei001 force-pushed the feature/pineapple-161 branch from f615173 to a36a9a0 Compare June 30, 2026 14:03
pre-commit-ci Bot and others added 6 commits June 30, 2026 14:04
… note

Run the live Google Drive accessibility check over a 4-resource
representative sample (one per category, both filename conventions, both
the direct and virus-scan-warning resolution paths) instead of all 30, to
keep CI fast (~17s vs ~96s) and resilient to Google Drive rate-limiting.

Drop two circular catalog-value assertions in test_list_segmentation
(google_drive_id, license echoed straight back from the dict under test);
keep the derived-filename assertion, which exercises real logic.

Document that gget pineapple wraps only Pineapple's data download, not its
image-processing features.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Annotate the public pineapple() overloads with
category: Literal["segmentation", "benchmark", "weights"] so editors and
type checkers flag typos at call sites. The implementation and internal
helpers keep `str` (the value is lowercased before use and validated at
runtime), consistent with gget's string-based, CLI-friendly API.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
BeautifulSoup's find()/find_all() return Tag | NavigableString, and
Tag.get() returns str | list[str]; narrow with isinstance(Tag) and handle
the multi-valued-attr case so the function type-checks cleanly. This was
the only new error tripping the baseline-filtered mypy pre-commit hook.
Behavior is unchanged (the matched element is always a Tag at runtime).

Also drop the "suggested by ..." attribution from the docs intro.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Elarwei001 Elarwei001 marked this pull request as ready for review June 30, 2026 15:18
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Elarwei001

Copy link
Copy Markdown
Contributor Author

Hi @lauraluebbert — I think this PR is ready for review. Quick summary:

  1. Scope — download only. It wraps only Pineapple's data download (pineapple download): listing the curated catalog and downloading individual datasets/weights directly from Google Drive, no Rust binary required. It deliberately does not reimplement Pineapple's image-processing features (process, profile, neural, measure) — those stay with the Pineapple tool itself. This matches the original request in Pineapple module to download bio-imaging data #161 ("download bio-imaging data").

  2. Tests. Two complementary layers: network-free unit tests covering the catalog/parsing/routing logic, plus a live data test (TestPineappleLiveAccess) that does a header-only check (no multi-GB body downloads) confirming each resource still resolves to a binary download from Google Drive with the catalog's expected filename — so CI fails loudly if an ID is repointed/removed upstream. To keep CI fast and resilient, the live test covers a representative sample (one per category + both filename conventions + both resolution paths), not all 30 IDs; if Google Drive temporarily rate-limits the request, that resource is skipped rather than failing the build.

  3. Docs. Added English (docs/src/en/pineapple.md) and Spanish (docs/src/es/pineapple.md) pages, both registered in SUMMARY.md, plus an updates.md entry.
    Please Note: the Spanish page is an AI-generated translation — it would be good to have a Spanish speaker review it before merging.

Could you take a look when you have a chance? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants