Skip to content

WIP: DataRepo abstraction for dataset identity / physical location separation#275

Closed
ReadyPlayerEmma wants to merge 1 commit into
gagan/fsspec-read-public-r2-bucketfrom
emma/datarepos-prototype
Closed

WIP: DataRepo abstraction for dataset identity / physical location separation#275
ReadyPlayerEmma wants to merge 1 commit into
gagan/fsspec-read-public-r2-bucketfrom
emma/datarepos-prototype

Conversation

@ReadyPlayerEmma
Copy link
Copy Markdown
Contributor

Status: Draft — looking for design feedback before formalizing.

Targets gagan/fsspec-read-public-r2-bucket (#267) so the diff stays clean. Our resolver generates HTTPS URLs that depend on #267's PureHTTPSPath / fsspec HTTP integration. When #267 merges, GitHub should auto-retarget this to main.

What this adds

An optional new shape for DatasetInfo that separates what a dataset is (identity: name, version, schema, relative paths of files) from where it's physically hosted (base URL of one or more repos it's mirrored to). Same pattern as Python packaging (PyPI + mirrors), HuggingFace Hub, Docker/OCI, Maven/Go/npm — a dataset has an identity + a list of repositories it's available in; a resolver picks the highest-priority accessible repo at read time.

Concretely:

  • New module esp_data/io/datarepo.pyDataRepo dataclass, in-memory registry (register_repo / unregister_repo / get_repo / list_repos), join_url helper, resolve(repos_ids, folder, relpath) function, strict-subset path validator, pluggable access_checker hook.
  • DatasetInfo gains three optional fields: repos: list[str], folder: str, splits: dict. split_paths is now also optional. A @model_validator(mode=\"after\") enforces exactly one of the two shapes (legacy split_paths, or modern repos+folder+splits) and auto-derives split_paths from the modern fields when used.
  • esp_data/io/__init__.py re-exports the new public API.
  • 41 new tests in tests/io/test_datarepo.py.

Why I think this is worth the weight

Today DatasetInfo.split_paths hardcodes one physical URL per split. OSS users without GCP credentials can't access ESP's datasets even when they're publicly mirrored. Adding a new mirror requires touching 37 Dataset subclasses. #267 lets a user manually swap in an HTTPS URL, but doesn't address the coupling between dataset identity and physical location.

With a Repo abstraction, the same catalog serves ESP-internal (via gs://), OSS users (via the R2 public mirror's https://), and future mirrors (AWS Open Data, Zenodo) with a single config change per dataset — no code edit per subclass. The resolver picks based on priority + accessibility.

Backward compat is real, not theoretical: the 37 existing Dataset subclasses read info.split_paths[split] and info.split_paths.keys() — both continue to work unchanged whether the info was constructed legacy or modern. I verified this end-to-end by instantiating BarkleyCanyon (zero code changes to the subclass) backed by a modern-shape DatasetInfo pointing at a local-kind DataRepo, and loading 9,770 real rows successfully.

Existing tests (test_dataset.py, test_concat.py, test_chained.py) — 33 pass, 2 skipped, 0 failed, same as on the base branch.

Open questions I'd love your take on

  1. Is "dataset identity vs. physical location" a separation you've considered before? The unmerged tar-exporter branch and the schema-* branches are adjacent in shape, and I don't want to reinvent something you've already thought through. The schema-descriptions work in particular composes nicely with this — schemas describe what's in a dataset, DataRepos describe where to get it.

  2. Does the auto-derive split_paths from modern fields feel right? The @model_validator + object.__setattr__ pattern means zero subclass changes, but it means at-init resolution — if the repo registry changes after a DatasetInfo is built, stale. A future version could use @computed_field for lazy resolution; I kept it simple for the prototype.

  3. Is path_encoding worth implementing fully now, or defer until xeno-canto needs it? The field is declared on DataRepo but not yet used in join_url. For strict-only deployments (all our data today except xeno-canto) it's a no-op. My lean: defer.

  4. Would you want this stacked as-is on your branch, or should I wait for Add https endpoint support #267 to land and rebase off main? I can see arguments either way. Stacking keeps the diff focused; rebasing gives you a cleaner independent review.

Design notes (things you might want to know)

  • Strict subset [A-Za-z0-9._~\\-/] chosen because most existing ESP paths already fit it. I empirically surveyed 1,873 path values across 28 dataset modules — fn, filename, filepath, audio_fp, local_path, originals_path are all within this subset today. Only xeno-canto has spaces / non-ASCII in filenames, and that dataset has pre-existing mixed-encoding inconsistencies that "legacy_raw"/"legacy_percent" modes are placeholders for.
  • Built-in default repo: esp-internal-gcs with base_url=\"gs://esp-ml-datasets/\", priority 10. Registered at module import. Users override with register_repo(...).
  • Relative-path validation runs at resolve() time, not at DataRepo construction time. This is a deliberate choice: the constraint depends on the target repo's path_encoding, and the same dataset can be served from repos with different encodings.

What this explicitly doesn't touch

  • The 37 existing Dataset subclasses. They keep using split_paths via the legacy branch.
  • PureHTTPSPath from Add https endpoint support #267. Coexists; not used by this code directly.
  • DatasetConfig / ChainedDatasetConfig — I didn't see any parallel changes needed but haven't exhaustively read those paths. Worth flagging if you spot something I missed.

Future follow-ups (not in this PR)

  • legacy_raw / legacy_percent path encoding conversion in join_url. Needed before xeno-canto can use the system.
  • Lazy @computed_field resolution if repo-registry changes need to affect in-flight Datasets.
  • Credential/auth system beyond the access_checker boolean.
  • Porting existing Dataset subclasses to modern shape — additive migration, can happen per-dataset when someone has bandwidth.
  • Registering the public R2 repo (waiting on the mirror URL).

I haven't pitched any of this to Milad or anyone else yet — wanted to get your read first since this is downstream of your #267 work and adjacent to your schema branches.

Introduces an optional new shape for DatasetInfo that separates what a
dataset is from where it's physically hosted. Same pattern as Python
packaging (PyPI + mirrors), HuggingFace Hub, Docker/OCI, Maven/Go/npm:
dataset has identity + list of repositories; resolver picks the
highest-priority accessible repo at read time.

Changes:
- New esp_data/io/datarepo.py: DataRepo dataclass, in-memory registry,
  join_url helper, resolve function, strict-subset path validator,
  access_checker hook.
- esp_data/dataset.py: DatasetInfo gains optional repos / folder /
  splits fields. split_paths is now optional. model_validator enforces
  legacy-XOR-modern shape and auto-derives split_paths from modern
  fields.
- esp_data/io/__init__.py: re-exports the new public API.
- tests/io/test_datarepo.py: 41 new tests covering DataRepo basics,
  join_url dispatch, registry, resolve, access-checker gate, strict
  validator (13 cases), and DatasetInfo backward compat.

Zero changes to the 37 existing Dataset subclasses. Existing tests
(test_dataset, test_concat, test_chained) pass unchanged.
@ReadyPlayerEmma
Copy link
Copy Markdown
Contributor Author

I have some additional thoughts on this after I've been studying more of the prior work in this space. I should have something more interesting and complete to share for discussion soon.

@ReadyPlayerEmma
Copy link
Copy Markdown
Contributor Author

Closing this and merging the ideas into future work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant