Webdataset backend#212
Conversation
mil-ad
left a comment
There was a problem hiding this comment.
few comments while looking at it more carefully
| else: | ||
| return all(sample.get(col) is not None for col in subset) | ||
|
|
||
| self._dropna_fn = dropna_fn |
There was a problem hiding this comment.
I haven't looked at other backends but doesn't this pattern of storing in self._dropna_fn() means that we won't be able to do any chaining?
I'd expect to be able to do:
foo.backend.dropna_fn(...).dropna_fn()
and right now we only apply the last one.
There was a problem hiding this comment.
same thing for filter_isin()
There was a problem hiding this comment.
yeah the implementation is a bit incomplete right now (these methods have to applied to the WebDataset obj) because I wanted you to have a quick look. But you're right we should instead return a new WebDatasetBackend(). This is the pattern in the other backends
There was a problem hiding this comment.
can we raise an issue for this?
There was a problem hiding this comment.
i've changed the implementation in the recent commits, that allows for chaining . wdyt ?
4929968 to
6b01b23
Compare
4d249bd to
ef58864
Compare
| # Imported here to avoid circular import: generic_dataset imports Dataset | ||
| from esp_data.generic_dataset import GenericDataset |
There was a problem hiding this comment.
I'm not used to keep circular import, I feel like it isn't clean and safe generally, but here I'm still thinking how I could avoid that, wdyt ?
There was a problem hiding this comment.
good question, this shouldn't really be done.. one solution would be to not add @register_dataset decorator to GenericDataset (and also not subclass it to Dataset). that breaks the pattern..lets keep it like this for now
| backend=dataset_config.backend, | ||
| streaming=dataset_config.streaming, | ||
| ) | ||
| if dataset_config.transformations: |
There was a problem hiding this comment.
transformations shouldn't be applied here.. we have to think about whether transformations can be applied at all, but if they can they have to be done within from_path using dataset_config.transformations
There was a problem hiding this comment.
Ok for now, I've removed it
| if resolved.is_dir(): | ||
| data = WebDatasetBackend.from_path(path, **kwargs) | ||
| else: | ||
| data = get_backend(backend).from_path(path, streaming=streaming, **kwargs) |
There was a problem hiding this comment.
you can just use get_backend instead of explicitly importing and instantiating WebDatasetBackend (no if - else needed),. We shouldnt check for is_dir() .. path is expected to be a dir anyway
There was a problem hiding this comment.
actually this whole bit from line 525 to 539 should be moved to init of GenericDataset. then init only takes (path, **kwargs), not (data, info)
|
|
||
| info = None | ||
| config_dir = resolved if resolved.is_dir() else resolved.parent | ||
| for config_name in ("config.json", "config.yaml"): |
There was a problem hiding this comment.
lets assume its yaml
| If provided, overrides the class-level generic `info`. | ||
| """ | ||
| self._is_webdataset = isinstance(data, WebDatasetBackend) | ||
| streaming = self._is_webdataset |
There was a problem hiding this comment.
hmm, i think streaming should be an init param, not dependent on a BackendType. We should not mention any strict types in GenericDataset except literals like backend="polars"
| pass | ||
|
|
||
| def __len__(self) -> int: | ||
| if self._is_webdataset: |
There was a problem hiding this comment.
same issue here
| return len(self._data) | ||
|
|
||
| def __getitem__(self, idx: int) -> Dict[str, Any]: | ||
| if self._is_webdataset: |
| ) | ||
|
|
||
| @classmethod | ||
| def from_config(cls, dataset_config: DatasetConfig) -> tuple["GenericDataset", dict[str, Any]]: |
There was a problem hiding this comment.
hmm this is getting quite mixed up.. GenericDataset.from_config is calling Dataset.from_path within it.. we need a rethink
There was a problem hiding this comment.
re my comment on dataset_from_config: if we introduce GenericDatasetConfig, then we dont need to use Dataset.from_path here. Rather just create cls(path)
GaganNarula
left a comment
There was a problem hiding this comment.
some comments
| the type of dataset configuration: | ||
| - 'dataset' for DatasetConfig | ||
| - 'concat' for ConcatConfig | ||
| - 'chain' for ChainedDatasetConfig |
There was a problem hiding this comment.
maybe we can directly use a 'generic' key and do GenericDataset.from_config in dataset_from_config
| def __str__(self) -> str: | ||
| return ( | ||
| f"GenericDataset(source={self.info.split_paths}, " | ||
| f"backend={'webdataset' if self._is_webdataset else 'tabular'})" |
There was a problem hiding this comment.
again _is_webdataset check.. we should maybe just print the self.info information in str.. reason is that if we overwrite self.info when we load from path, then print(ds) should reflect the actual dataset.. so it could be anuraset or something specific
ef58864 to
e15a3a8
Compare
|
After playing around with esp_data and webdataset backend, I'm currently confused. The save_to export I've made is working but is useless, you can't do much from a GenericDataset after having it loaded from tar files. maybe I've misunderstood the purpose of the export and load methods ? GenericDataset yields raw dicts with no per-sample processing. Round-trip AnuraSetStrong → save_to → from_path → GenericDataset loses all the domain logic (audio decoding, label mapping, typed fields). |
| """ | ||
| self.path = anypath(path) | ||
| fs = filesystem_from_path(self.path) | ||
| info_path = self.path / "info.yaml" |
There was a problem hiding this comment.
hmm, the info object itself doesnt contain the params used to construct the dataset object.. for e.g. we need:
backend_typesplitstreamingsample_rate- .. other dataset specific params like
versionin audioset
So i think we should expect one file called config.yaml, which contains these params along with an info key which is the dict representing DatasetInfo
| f"No info.yaml found at {info_path}. Cannot infer backend or streaming mode." | ||
| ) | ||
|
|
||
| super().__init__(backend=self.info.backend, streaming=self.info.streaming) |
There was a problem hiding this comment.
same issue, DatasetInfo doesnt contain these so self.info wont have them
| yield from self._data | ||
|
|
||
| def __str__(self) -> str: | ||
| return f"GenericDataset(name={self.info.name}, version={self.info.version})" |
There was a problem hiding this comment.
we can add split and sample_rate etc to it
| def save_to(self, path: str, format: str = "webdataset", **kwargs: Any) -> int: | ||
| """Save the dataset to a file. | ||
|
|
||
| Writes sharded tar files and an ``info.yaml`` containing `DatasetInfo` |
There was a problem hiding this comment.
should not be specific to tar because that's just one format. As mentioned, we probably need config.yaml not info.yaml
| info_dict["sample_rate"] = sample_rate | ||
|
|
||
| # Record backend so from_path can reload without the user specifying it. | ||
| info_dict["backend"] = backend |
There was a problem hiding this comment.
oh! i see you're overloading the info object here..
| If `format` is not supported | ||
| """ | ||
| self._ensure_not_streaming("save_to") | ||
| if format == "webdataset": |
There was a problem hiding this comment.
if our exporters are functions like write_to_webdataset (which i prefer) , then we can make an abstraction like export_to(path, format) which is a public function that does this selection .. and then we dont have to import a webdataset utility in a pandas backend! which is kind of a confusing pattern
There was a problem hiding this comment.
I took a look to your tar_exporter PR, maybe we could merge it into the webdataset_backend PR, then I would add the export abstraction and I could use the code you wrote for tar files export
There was a problem hiding this comment.
wdyt ?
Or I can add the export abstraction directly here
There was a problem hiding this comment.
and we can put this export_to into an export.py and make it public api from there.. the stuff like webdataset utils or format specific stuff shouldnt be public
There was a problem hiding this comment.
oh i just oh i just saw your message :D .. yes i think that works to merge this and the other branch, but maybe we should stack the export on this one, exposing a dummy export function for now
There was a problem hiding this comment.
#291 I open this stacked PR for the export abstraction
…amingBackend Protocol
0e6d73b to
818848a
Compare
StreamingBackendprotocol which differs fromDataBackend(should it be StreamingDataBackend?)