Skip to content

Webdataset backend#212

Open
GaganNarula wants to merge 30 commits into
mainfrom
gagan/webdataset-backend
Open

Webdataset backend#212
GaganNarula wants to merge 30 commits into
mainfrom
gagan/webdataset-backend

Conversation

@GaganNarula
Copy link
Copy Markdown
Collaborator

  • Adds tarfile dataset read via webdataset library as a backend
  • Adds StreamingBackend protocol which differs from DataBackend (should it be StreamingDataBackend?)

@GaganNarula GaganNarula marked this pull request as ready for review January 14, 2026 11:02
Copy link
Copy Markdown
Collaborator

@mil-ad mil-ad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments while looking at it more carefully

Comment thread esp_data/backends/webdataset_backend.py Outdated
else:
return all(sample.get(col) is not None for col in subset)

self._dropna_fn = dropna_fn
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at other backends but doesn't this pattern of storing in self._dropna_fn() means that we won't be able to do any chaining?

I'd expect to be able to do:

foo.backend.dropna_fn(...).dropna_fn()

and right now we only apply the last one.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing for filter_isin()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the implementation is a bit incomplete right now (these methods have to applied to the WebDataset obj) because I wanted you to have a quick look. But you're right we should instead return a new WebDatasetBackend(). This is the pattern in the other backends

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we raise an issue for this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've changed the implementation in the recent commits, that allows for chaining . wdyt ?

Comment thread esp_data/backends/webdataset_backend.py Outdated
Comment thread esp_data/backends/__init__.py Outdated
Comment thread esp_data/backends/webdataset_backend.py Outdated
@Paulchen-git Paulchen-git force-pushed the gagan/webdataset-backend branch from 4929968 to 6b01b23 Compare May 11, 2026 07:51
@Paulchen-git Paulchen-git requested a review from a team as a code owner May 11, 2026 07:51
@Paulchen-git Paulchen-git force-pushed the gagan/webdataset-backend branch 3 times, most recently from 4d249bd to ef58864 Compare May 11, 2026 13:51
Comment thread esp_data/dataset.py
Comment on lines +520 to +521
# Imported here to avoid circular import: generic_dataset imports Dataset
from esp_data.generic_dataset import GenericDataset
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not used to keep circular import, I feel like it isn't clean and safe generally, but here I'm still thinking how I could avoid that, wdyt ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, this shouldn't really be done.. one solution would be to not add @register_dataset decorator to GenericDataset (and also not subclass it to Dataset). that breaks the pattern..lets keep it like this for now

Comment thread esp_data/dataset.py Outdated
backend=dataset_config.backend,
streaming=dataset_config.streaming,
)
if dataset_config.transformations:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transformations shouldn't be applied here.. we have to think about whether transformations can be applied at all, but if they can they have to be done within from_path using dataset_config.transformations

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for now, I've removed it

Comment thread esp_data/dataset.py Outdated
if resolved.is_dir():
data = WebDatasetBackend.from_path(path, **kwargs)
else:
data = get_backend(backend).from_path(path, streaming=streaming, **kwargs)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just use get_backend instead of explicitly importing and instantiating WebDatasetBackend (no if - else needed),. We shouldnt check for is_dir() .. path is expected to be a dir anyway

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this whole bit from line 525 to 539 should be moved to init of GenericDataset. then init only takes (path, **kwargs), not (data, info)

Comment thread esp_data/dataset.py Outdated

info = None
config_dir = resolved if resolved.is_dir() else resolved.parent
for config_name in ("config.json", "config.yaml"):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets assume its yaml

Comment thread esp_data/dataset.py Outdated
Comment thread esp_data/generic_dataset.py Outdated
If provided, overrides the class-level generic `info`.
"""
self._is_webdataset = isinstance(data, WebDatasetBackend)
streaming = self._is_webdataset
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i think streaming should be an init param, not dependent on a BackendType. We should not mention any strict types in GenericDataset except literals like backend="polars"

Comment thread esp_data/generic_dataset.py Outdated
pass

def __len__(self) -> int:
if self._is_webdataset:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue here

Comment thread esp_data/generic_dataset.py Outdated
return len(self._data)

def __getitem__(self, idx: int) -> Dict[str, Any]:
if self._is_webdataset:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

Comment thread esp_data/generic_dataset.py Outdated
)

@classmethod
def from_config(cls, dataset_config: DatasetConfig) -> tuple["GenericDataset", dict[str, Any]]:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this is getting quite mixed up.. GenericDataset.from_config is calling Dataset.from_path within it.. we need a rethink

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re my comment on dataset_from_config: if we introduce GenericDatasetConfig, then we dont need to use Dataset.from_path here. Rather just create cls(path)

Copy link
Copy Markdown
Collaborator Author

@GaganNarula GaganNarula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments

Comment thread esp_data/dataset.py Outdated
Comment thread esp_data/dataset.py
the type of dataset configuration:
- 'dataset' for DatasetConfig
- 'concat' for ConcatConfig
- 'chain' for ChainedDatasetConfig
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can directly use a 'generic' key and do GenericDataset.from_config in dataset_from_config

Comment thread esp_data/generic_dataset.py Outdated
def __str__(self) -> str:
return (
f"GenericDataset(source={self.info.split_paths}, "
f"backend={'webdataset' if self._is_webdataset else 'tabular'})"
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again _is_webdataset check.. we should maybe just print the self.info information in str.. reason is that if we overwrite self.info when we load from path, then print(ds) should reflect the actual dataset.. so it could be anuraset or something specific

@Paulchen-git Paulchen-git force-pushed the gagan/webdataset-backend branch from ef58864 to e15a3a8 Compare May 12, 2026 12:46
@Paulchen-git
Copy link
Copy Markdown
Contributor

After playing around with esp_data and webdataset backend, I'm currently confused. The save_to export I've made is working but is useless, you can't do much from a GenericDataset after having it loaded from tar files. maybe I've misunderstood the purpose of the export and load methods ?

GenericDataset yields raw dicts with no per-sample processing. Round-trip AnuraSetStrong → save_to → from_path → GenericDataset loses all the domain logic (audio decoding, label mapping, typed fields).

Comment thread esp_data/generic_dataset.py Outdated
"""
self.path = anypath(path)
fs = filesystem_from_path(self.path)
info_path = self.path / "info.yaml"
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, the info object itself doesnt contain the params used to construct the dataset object.. for e.g. we need:

  1. backend_type
  2. split
  3. streaming
  4. sample_rate
  5. .. other dataset specific params like version in audioset

So i think we should expect one file called config.yaml, which contains these params along with an info key which is the dict representing DatasetInfo

Comment thread esp_data/generic_dataset.py Outdated
f"No info.yaml found at {info_path}. Cannot infer backend or streaming mode."
)

super().__init__(backend=self.info.backend, streaming=self.info.streaming)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue, DatasetInfo doesnt contain these so self.info wont have them

yield from self._data

def __str__(self) -> str:
return f"GenericDataset(name={self.info.name}, version={self.info.version})"
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add split and sample_rate etc to it

Comment thread esp_data/dataset.py Outdated
def save_to(self, path: str, format: str = "webdataset", **kwargs: Any) -> int:
"""Save the dataset to a file.

Writes sharded tar files and an ``info.yaml`` containing `DatasetInfo`
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not be specific to tar because that's just one format. As mentioned, we probably need config.yaml not info.yaml

Comment thread esp_data/dataset.py Outdated
info_dict["sample_rate"] = sample_rate

# Record backend so from_path can reload without the user specifying it.
info_dict["backend"] = backend
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh! i see you're overloading the info object here..

Comment thread esp_data/backends/pandas_backend.py Outdated
If `format` is not supported
"""
self._ensure_not_streaming("save_to")
if format == "webdataset":
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if our exporters are functions like write_to_webdataset (which i prefer) , then we can make an abstraction like export_to(path, format) which is a public function that does this selection .. and then we dont have to import a webdataset utility in a pandas backend! which is kind of a confusing pattern

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look to your tar_exporter PR, maybe we could merge it into the webdataset_backend PR, then I would add the export abstraction and I could use the code you wrote for tar files export

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdyt ?

Or I can add the export abstraction directly here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we can put this export_to into an export.py and make it public api from there.. the stuff like webdataset utils or format specific stuff shouldnt be public

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i just oh i just saw your message :D .. yes i think that works to merge this and the other branch, but maybe we should stack the export on this one, exposing a dummy export function for now

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#291 I open this stacked PR for the export abstraction

@Paulchen-git Paulchen-git force-pushed the gagan/webdataset-backend branch from 0e6d73b to 818848a Compare May 22, 2026 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants