WIP: DataRepo abstraction for dataset identity / physical location separation by ReadyPlayerEmma · Pull Request #275 · earthspecies/esp-data

ReadyPlayerEmma · 2026-04-24T16:05:47Z

Status: Draft — looking for design feedback before formalizing.

Targets gagan/fsspec-read-public-r2-bucket (#267) so the diff stays clean. Our resolver generates HTTPS URLs that depend on #267's PureHTTPSPath / fsspec HTTP integration. When #267 merges, GitHub should auto-retarget this to main.

What this adds

An optional new shape for DatasetInfo that separates what a dataset is (identity: name, version, schema, relative paths of files) from where it's physically hosted (base URL of one or more repos it's mirrored to). Same pattern as Python packaging (PyPI + mirrors), HuggingFace Hub, Docker/OCI, Maven/Go/npm — a dataset has an identity + a list of repositories it's available in; a resolver picks the highest-priority accessible repo at read time.

Concretely:

New module esp_data/io/datarepo.py — DataRepo dataclass, in-memory registry (register_repo / unregister_repo / get_repo / list_repos), join_url helper, resolve(repos_ids, folder, relpath) function, strict-subset path validator, pluggable access_checker hook.
DatasetInfo gains three optional fields: repos: list[str], folder: str, splits: dict. split_paths is now also optional. A @model_validator(mode=\"after\") enforces exactly one of the two shapes (legacy split_paths, or modern repos+folder+splits) and auto-derives split_paths from the modern fields when used.
esp_data/io/__init__.py re-exports the new public API.
41 new tests in tests/io/test_datarepo.py.

Why I think this is worth the weight

Today DatasetInfo.split_paths hardcodes one physical URL per split. OSS users without GCP credentials can't access ESP's datasets even when they're publicly mirrored. Adding a new mirror requires touching 37 Dataset subclasses. #267 lets a user manually swap in an HTTPS URL, but doesn't address the coupling between dataset identity and physical location.

With a Repo abstraction, the same catalog serves ESP-internal (via gs://), OSS users (via the R2 public mirror's https://), and future mirrors (AWS Open Data, Zenodo) with a single config change per dataset — no code edit per subclass. The resolver picks based on priority + accessibility.

Backward compat is real, not theoretical: the 37 existing Dataset subclasses read info.split_paths[split] and info.split_paths.keys() — both continue to work unchanged whether the info was constructed legacy or modern. I verified this end-to-end by instantiating BarkleyCanyon (zero code changes to the subclass) backed by a modern-shape DatasetInfo pointing at a local-kind DataRepo, and loading 9,770 real rows successfully.

Existing tests (test_dataset.py, test_concat.py, test_chained.py) — 33 pass, 2 skipped, 0 failed, same as on the base branch.

Open questions I'd love your take on

Is "dataset identity vs. physical location" a separation you've considered before? The unmerged tar-exporter branch and the schema-* branches are adjacent in shape, and I don't want to reinvent something you've already thought through. The schema-descriptions work in particular composes nicely with this — schemas describe what's in a dataset, DataRepos describe where to get it.
Does the auto-derive split_paths from modern fields feel right? The @model_validator + object.__setattr__ pattern means zero subclass changes, but it means at-init resolution — if the repo registry changes after a DatasetInfo is built, stale. A future version could use @computed_field for lazy resolution; I kept it simple for the prototype.
Is path_encoding worth implementing fully now, or defer until xeno-canto needs it? The field is declared on DataRepo but not yet used in join_url. For strict-only deployments (all our data today except xeno-canto) it's a no-op. My lean: defer.
Would you want this stacked as-is on your branch, or should I wait for Add https endpoint support #267 to land and rebase off main? I can see arguments either way. Stacking keeps the diff focused; rebasing gives you a cleaner independent review.

Design notes (things you might want to know)

Strict subset [A-Za-z0-9._~\\-/] chosen because most existing ESP paths already fit it. I empirically surveyed 1,873 path values across 28 dataset modules — fn, filename, filepath, audio_fp, local_path, originals_path are all within this subset today. Only xeno-canto has spaces / non-ASCII in filenames, and that dataset has pre-existing mixed-encoding inconsistencies that "legacy_raw"/"legacy_percent" modes are placeholders for.
Built-in default repo: esp-internal-gcs with base_url=\"gs://esp-ml-datasets/\", priority 10. Registered at module import. Users override with register_repo(...).
Relative-path validation runs at resolve() time, not at DataRepo construction time. This is a deliberate choice: the constraint depends on the target repo's path_encoding, and the same dataset can be served from repos with different encodings.

What this explicitly doesn't touch

The 37 existing Dataset subclasses. They keep using split_paths via the legacy branch.
PureHTTPSPath from Add https endpoint support #267. Coexists; not used by this code directly.
DatasetConfig / ChainedDatasetConfig — I didn't see any parallel changes needed but haven't exhaustively read those paths. Worth flagging if you spot something I missed.

Future follow-ups (not in this PR)

legacy_raw / legacy_percent path encoding conversion in join_url. Needed before xeno-canto can use the system.
Lazy @computed_field resolution if repo-registry changes need to affect in-flight Datasets.
Credential/auth system beyond the access_checker boolean.
Porting existing Dataset subclasses to modern shape — additive migration, can happen per-dataset when someone has bandwidth.
Registering the public R2 repo (waiting on the mirror URL).

I haven't pitched any of this to Milad or anyone else yet — wanted to get your read first since this is downstream of your #267 work and adjacent to your schema branches.

Introduces an optional new shape for DatasetInfo that separates what a dataset is from where it's physically hosted. Same pattern as Python packaging (PyPI + mirrors), HuggingFace Hub, Docker/OCI, Maven/Go/npm: dataset has identity + list of repositories; resolver picks the highest-priority accessible repo at read time. Changes: - New esp_data/io/datarepo.py: DataRepo dataclass, in-memory registry, join_url helper, resolve function, strict-subset path validator, access_checker hook. - esp_data/dataset.py: DatasetInfo gains optional repos / folder / splits fields. split_paths is now optional. model_validator enforces legacy-XOR-modern shape and auto-derives split_paths from modern fields. - esp_data/io/__init__.py: re-exports the new public API. - tests/io/test_datarepo.py: 41 new tests covering DataRepo basics, join_url dispatch, registry, resolve, access-checker gate, strict validator (13 cases), and DatasetInfo backward compat. Zero changes to the 37 existing Dataset subclasses. Existing tests (test_dataset, test_concat, test_chained) pass unchanged.

ReadyPlayerEmma · 2026-04-25T15:54:07Z

I have some additional thoughts on this after I've been studying more of the prior work in this space. I should have something more interesting and complete to share for discussion soon.

ReadyPlayerEmma · 2026-05-08T15:53:27Z

Closing this and merging the ideas into future work.

ReadyPlayerEmma mentioned this pull request Apr 24, 2026

Add https endpoint support #267

Open

ReadyPlayerEmma closed this May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: DataRepo abstraction for dataset identity / physical location separation#275

WIP: DataRepo abstraction for dataset identity / physical location separation#275
ReadyPlayerEmma wants to merge 1 commit into
gagan/fsspec-read-public-r2-bucketfrom
emma/datarepos-prototype

ReadyPlayerEmma commented Apr 24, 2026

Uh oh!

ReadyPlayerEmma commented Apr 25, 2026

Uh oh!

ReadyPlayerEmma commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ReadyPlayerEmma commented Apr 24, 2026

What this adds

Why I think this is worth the weight

Open questions I'd love your take on

Design notes (things you might want to know)

What this explicitly doesn't touch

Future follow-ups (not in this PR)

Uh oh!

ReadyPlayerEmma commented Apr 25, 2026

Uh oh!

ReadyPlayerEmma commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant