Add spanish carrion crows vox#290
Conversation
Individual focal vocalization loader for the Spanish carrion crow biologger dataset. Clips audio from longer recordings, supports noisy and MixIT-denoised modes, optional padding, and filtering by overlap_window_id context window. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| def available_splits(self) -> list[str]: | ||
| return [ | ||
| "all", | ||
| 1, |
There was a problem hiding this comment.
These splits (1,2, .. ) dont have any paths defined ?
|
|
||
| @property | ||
| def columns(self) -> list[str]: | ||
| return list(self._data.columns) |
There was a problem hiding this comment.
this can just be return self._data.columns, already correct type
| Backend to use ("pandas" or "polars"). Defaults to "polars". | ||
| streaming : bool | ||
| Whether to use streaming mode. Defaults to False. | ||
| denoised : bool |
There was a problem hiding this comment.
for datasets with extra parameters like "denoised" , "fallback_to_noisy" you need a separate config, see audioset.py for AudiosetConfig for instance
| owner="maddie", | ||
| split_paths={ | ||
| # TODO: update path once finalized in GCS | ||
| "all": "gs://esp-ml-datasets/spanish-carrion-crows-vox/conversational_preprocessed.csv", |
There was a problem hiding this comment.
is this file already in esp-ml-datasets ? if not it should be in esp-data-ingestion otherwise the tests wont run ?
| na_values=[""], | ||
| ) | ||
| if self.split != "all": | ||
| self._data = self._data.filter_isin("overlap_window_id", [float(self.split)]) |
There was a problem hiding this comment.
oh i see whats happening.. but this is a transform of a split, so i think its better to let the user do this in their code rather than add as a fake "split" which gets generated on _load
Also keeps are api a bit more stable
| return len(self._data) | ||
|
|
||
| def _process(self, row: dict[str, Any]) -> dict[str, Any]: | ||
| start_sec = max(0.0, float(row["derived.centered_focal.long.start_sec"]) - self.padding_sec) |
There was a problem hiding this comment.
are these files also very long (e.g. more than 30 min) .. our current read_audio actually doesn't do a kind of "streaming cut" of a long audio in a bucket, which means we'll hit OOM issues if the files are long (related to PR #257)
| if self.denoised: | ||
| denoised_success = row["derived.denoised_focal.success"] | ||
| if isinstance(denoised_success, str): | ||
| denoised_success = denoised_success == "True" |
There was a problem hiding this comment.
seems like a check for a dtype that should really be a bool in the underlying data (all rows should be bool not a mixed type)
A different version of the Spanish Carrion Crow datasets that only includes focal vocalizations (i.e., vocalizations made by an adult bird wearing a biologger), in time periods where we have more than one crow tagged.
We have pre-computed audio clips for each focal vocalization, making it simpler than dataset #255
Note that the full dataset has more focal vocalizations than this one, but we are subsetting to sections where there is synchronization between >1 bird in the same year/territory.