WIP: DataRepo abstraction for dataset identity / physical location separation#275
Closed
ReadyPlayerEmma wants to merge 1 commit into
Closed
WIP: DataRepo abstraction for dataset identity / physical location separation#275ReadyPlayerEmma wants to merge 1 commit into
ReadyPlayerEmma wants to merge 1 commit into
Conversation
Introduces an optional new shape for DatasetInfo that separates what a dataset is from where it's physically hosted. Same pattern as Python packaging (PyPI + mirrors), HuggingFace Hub, Docker/OCI, Maven/Go/npm: dataset has identity + list of repositories; resolver picks the highest-priority accessible repo at read time. Changes: - New esp_data/io/datarepo.py: DataRepo dataclass, in-memory registry, join_url helper, resolve function, strict-subset path validator, access_checker hook. - esp_data/dataset.py: DatasetInfo gains optional repos / folder / splits fields. split_paths is now optional. model_validator enforces legacy-XOR-modern shape and auto-derives split_paths from modern fields. - esp_data/io/__init__.py: re-exports the new public API. - tests/io/test_datarepo.py: 41 new tests covering DataRepo basics, join_url dispatch, registry, resolve, access-checker gate, strict validator (13 cases), and DatasetInfo backward compat. Zero changes to the 37 existing Dataset subclasses. Existing tests (test_dataset, test_concat, test_chained) pass unchanged.
Contributor
Author
|
I have some additional thoughts on this after I've been studying more of the prior work in this space. I should have something more interesting and complete to share for discussion soon. |
Contributor
Author
|
Closing this and merging the ideas into future work. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status: Draft — looking for design feedback before formalizing.
Targets
gagan/fsspec-read-public-r2-bucket(#267) so the diff stays clean. Our resolver generates HTTPS URLs that depend on #267'sPureHTTPSPath/ fsspec HTTP integration. When #267 merges, GitHub should auto-retarget this tomain.What this adds
An optional new shape for
DatasetInfothat separates what a dataset is (identity: name, version, schema, relative paths of files) from where it's physically hosted (base URL of one or more repos it's mirrored to). Same pattern as Python packaging (PyPI + mirrors), HuggingFace Hub, Docker/OCI, Maven/Go/npm — a dataset has an identity + a list of repositories it's available in; a resolver picks the highest-priority accessible repo at read time.Concretely:
esp_data/io/datarepo.py—DataRepodataclass, in-memory registry (register_repo/unregister_repo/get_repo/list_repos),join_urlhelper,resolve(repos_ids, folder, relpath)function, strict-subset path validator, pluggableaccess_checkerhook.DatasetInfogains three optional fields:repos: list[str],folder: str,splits: dict.split_pathsis now also optional. A@model_validator(mode=\"after\")enforces exactly one of the two shapes (legacysplit_paths, or modernrepos+folder+splits) and auto-derivessplit_pathsfrom the modern fields when used.esp_data/io/__init__.pyre-exports the new public API.tests/io/test_datarepo.py.Why I think this is worth the weight
Today
DatasetInfo.split_pathshardcodes one physical URL per split. OSS users without GCP credentials can't access ESP's datasets even when they're publicly mirrored. Adding a new mirror requires touching 37 Dataset subclasses. #267 lets a user manually swap in an HTTPS URL, but doesn't address the coupling between dataset identity and physical location.With a Repo abstraction, the same catalog serves ESP-internal (via
gs://), OSS users (via the R2 public mirror'shttps://), and future mirrors (AWS Open Data, Zenodo) with a single config change per dataset — no code edit per subclass. The resolver picks based on priority + accessibility.Backward compat is real, not theoretical: the 37 existing Dataset subclasses read
info.split_paths[split]andinfo.split_paths.keys()— both continue to work unchanged whether the info was constructed legacy or modern. I verified this end-to-end by instantiatingBarkleyCanyon(zero code changes to the subclass) backed by a modern-shapeDatasetInfopointing at a local-kindDataRepo, and loading 9,770 real rows successfully.Existing tests (
test_dataset.py,test_concat.py,test_chained.py) — 33 pass, 2 skipped, 0 failed, same as on the base branch.Open questions I'd love your take on
Is "dataset identity vs. physical location" a separation you've considered before? The unmerged
tar-exporterbranch and theschema-*branches are adjacent in shape, and I don't want to reinvent something you've already thought through. Theschema-descriptionswork in particular composes nicely with this — schemas describe what's in a dataset, DataRepos describe where to get it.Does the auto-derive
split_pathsfrom modern fields feel right? The@model_validator+object.__setattr__pattern means zero subclass changes, but it means at-init resolution — if the repo registry changes after a DatasetInfo is built, stale. A future version could use@computed_fieldfor lazy resolution; I kept it simple for the prototype.Is
path_encodingworth implementing fully now, or defer until xeno-canto needs it? The field is declared onDataRepobut not yet used injoin_url. For strict-only deployments (all our data today except xeno-canto) it's a no-op. My lean: defer.Would you want this stacked as-is on your branch, or should I wait for Add https endpoint support #267 to land and rebase off main? I can see arguments either way. Stacking keeps the diff focused; rebasing gives you a cleaner independent review.
Design notes (things you might want to know)
[A-Za-z0-9._~\\-/]chosen because most existing ESP paths already fit it. I empirically surveyed 1,873 path values across 28 dataset modules —fn,filename,filepath,audio_fp,local_path,originals_pathare all within this subset today. Only xeno-canto has spaces / non-ASCII in filenames, and that dataset has pre-existing mixed-encoding inconsistencies that "legacy_raw"/"legacy_percent" modes are placeholders for.esp-internal-gcswithbase_url=\"gs://esp-ml-datasets/\", priority 10. Registered at module import. Users override withregister_repo(...).resolve()time, not atDataRepoconstruction time. This is a deliberate choice: the constraint depends on the target repo'spath_encoding, and the same dataset can be served from repos with different encodings.What this explicitly doesn't touch
split_pathsvia the legacy branch.PureHTTPSPathfrom Add https endpoint support #267. Coexists; not used by this code directly.DatasetConfig/ChainedDatasetConfig— I didn't see any parallel changes needed but haven't exhaustively read those paths. Worth flagging if you spot something I missed.Future follow-ups (not in this PR)
legacy_raw/legacy_percentpath encoding conversion injoin_url. Needed before xeno-canto can use the system.@computed_fieldresolution if repo-registry changes need to affect in-flight Datasets.access_checkerboolean.I haven't pitched any of this to Milad or anyone else yet — wanted to get your read first since this is downstream of your #267 work and adjacent to your schema branches.