feat(pineapple): add gget pineapple module (#161)#245
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #245 +/- ##
==========================================
+ Coverage 56.70% 56.92% +0.21%
==========================================
Files 29 30 +1
Lines 9392 9487 +95
==========================================
+ Hits 5326 5400 +74
- Misses 4066 4087 +21 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
af30558 to
f615173
Compare
…data (scverse#161) New module `gget pineapple` lists and downloads the curated bio-imaging datasets (segmentation/benchmark) and pre-trained model weights distributed by Pineapple (https://github.com/tomouellette/pineapple, suggested by Prof. Anne Carpenter). The catalog (names, authors, sizes, licenses, Google Drive IDs) is mirrored from pineapple's pineapple-data crate; resources are downloaded directly from their Google Drive host (with the large-file virus-scan-warning bypass), so no Rust binary is required. Exposed via the Python API and the command line. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
for more information, see https://pre-commit.ci
…resolve Sweep all 30 catalog resources (segmentation/benchmark/weights) against Google Drive to guarantee the curated data stays downloadable. The new TestPineappleLiveAccess checks headers only (no body download): each file ID must still resolve to a binary attachment whose filename matches the catalog and is not a tiny placeholder; transient Google Drive throttling (HTML quota page) is treated as unavailable, not a failure. Extract _resolve_gdrive_response from _download_from_gdrive so the live tests exercise the same production resolution path, including the large-file virus-scan-warning confirmation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f615173 to
a36a9a0
Compare
for more information, see https://pre-commit.ci
… note Run the live Google Drive accessibility check over a 4-resource representative sample (one per category, both filename conventions, both the direct and virus-scan-warning resolution paths) instead of all 30, to keep CI fast (~17s vs ~96s) and resilient to Google Drive rate-limiting. Drop two circular catalog-value assertions in test_list_segmentation (google_drive_id, license echoed straight back from the dict under test); keep the derived-filename assertion, which exercises real logic. Document that gget pineapple wraps only Pineapple's data download, not its image-processing features. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Annotate the public pineapple() overloads with category: Literal["segmentation", "benchmark", "weights"] so editors and type checkers flag typos at call sites. The implementation and internal helpers keep `str` (the value is lowercased before use and validated at runtime), consistent with gget's string-based, CLI-friendly API. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
BeautifulSoup's find()/find_all() return Tag | NavigableString, and Tag.get() returns str | list[str]; narrow with isinstance(Tag) and handle the multi-valued-attr case so the function type-checks cleanly. This was the only new error tripping the baseline-filtered mypy pre-commit hook. Behavior is unchanged (the matched element is always a Tag at runtime). Also drop the "suggested by ..." attribution from the docs intro. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Hi @lauraluebbert — I think this PR is ready for review. Quick summary:
Could you take a look when you have a chance? Thanks |
Summary
Adds a new
gget pineapplemodule to list and download curated bio-imaging data from Pineapple.A point worth being explicit about: Pineapple is a Rust command-line tool for image-based cell profiling — it does not expose a REST/JSON API. What it does provide (via its
pineapple downloadcommand) is a curated, standardized catalog of bio-imaging datasets and pre-trained model weights, hosted on Google Drive (each resource maps to a hardcoded Google Drive file ID in thepineapple-datacrate). This module mirrors that catalog and downloads the resources directly from the same Google Drive host, so users get the data without installing the Rust binary.What it does
segmentation(12 datasets),benchmark(13 datasets), orweights(5 pre-trained model weights). Returns names, authors, sizes (GB), licenses, filenames, and Google Drive IDs.gget.pineapple(...)) and the command line (gget pineapple ...).The catalog includes each dataset's original author and license so users can check usage terms (some datasets are non-commercial only).
Testing
tests/test_pineapple.py, mocked downloads) covering catalog contents/columns, filename conventions, Google Drive virus-scan-warning form parsing, and download routing. No real downloads happen in tests.application/octet-stream,Content-Disposition: attachment; filename="arvidsson-2022.tar.gz"), matching the module's filename convention.docs/src/en/pineapple.md+ SUMMARY entry) and anupdates.mdbullet under Version ≥ 0.30.8.Because Pineapple hardcodes its dataset Google Drive file IDs in its own source (the upstream author notes this isn't ideal and may move them out of the library later), this module's catalog is a point-in-time mirror of those IDs. If upstream Pineapple adds/changes datasets, gget's catalog would need a corresponding refresh. I went with mirroring because there is no programmatic API to query — flagging this wrapper approach for your input in case you'd prefer a different strategy (e.g. fetching the registry from a maintained upstream manifest if/when one exists).
Resolves #161
Scope — download only
This module intentionally wraps only Pineapple's data-distribution layer (its
pineapple downloadsubcommand): it lets gget users list the curated bio-imaging catalog (segmentation/benchmark datasets and pre-trained model weights) and download individual resources directly from Google Drive, without installing the Pineapple Rust binary.It deliberately does not reimplement Pineapple's image-processing functionality (
process,profile,neural,measure) — extracting morphological descriptors, running self-supervised embeddings, etc. Those are the core of the Pineapple tool itself and are out of scope for gget, whose role here is data access, not image analysis. Users who want to process images should use Pineapple directly. This matches the original request (#161): a module to download bio-imaging data.