Skip to content

Add spanish carrion crows vox#290

Open
mcusi wants to merge 4 commits into
mainfrom
add-spanish-carrion-crows-vox
Open

Add spanish carrion crows vox#290
mcusi wants to merge 4 commits into
mainfrom
add-spanish-carrion-crows-vox

Conversation

@mcusi
Copy link
Copy Markdown
Contributor

@mcusi mcusi commented May 14, 2026

A different version of the Spanish Carrion Crow datasets that only includes focal vocalizations (i.e., vocalizations made by an adult bird wearing a biologger), in time periods where we have more than one crow tagged.

We have pre-computed audio clips for each focal vocalization, making it simpler than dataset #255

Note that the full dataset has more focal vocalizations than this one, but we are subsetting to sections where there is synchronization between >1 bird in the same year/territory.

mcusi and others added 4 commits March 16, 2026 23:58
Individual focal vocalization loader for the Spanish carrion crow
biologger dataset. Clips audio from longer recordings, supports
noisy and MixIT-denoised modes, optional padding, and filtering
by overlap_window_id context window.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mcusi mcusi requested a review from a team as a code owner May 14, 2026 13:24
def available_splits(self) -> list[str]:
return [
"all",
1,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These splits (1,2, .. ) dont have any paths defined ?


@property
def columns(self) -> list[str]:
return list(self._data.columns)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can just be return self._data.columns, already correct type

Backend to use ("pandas" or "polars"). Defaults to "polars".
streaming : bool
Whether to use streaming mode. Defaults to False.
denoised : bool
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for datasets with extra parameters like "denoised" , "fallback_to_noisy" you need a separate config, see audioset.py for AudiosetConfig for instance

owner="maddie",
split_paths={
# TODO: update path once finalized in GCS
"all": "gs://esp-ml-datasets/spanish-carrion-crows-vox/conversational_preprocessed.csv",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this file already in esp-ml-datasets ? if not it should be in esp-data-ingestion otherwise the tests wont run ?

Copy link
Copy Markdown
Collaborator

@GaganNarula GaganNarula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes needed

na_values=[""],
)
if self.split != "all":
self._data = self._data.filter_isin("overlap_window_id", [float(self.split)])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see whats happening.. but this is a transform of a split, so i think its better to let the user do this in their code rather than add as a fake "split" which gets generated on _load

Also keeps are api a bit more stable

return len(self._data)

def _process(self, row: dict[str, Any]) -> dict[str, Any]:
start_sec = max(0.0, float(row["derived.centered_focal.long.start_sec"]) - self.padding_sec)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these files also very long (e.g. more than 30 min) .. our current read_audio actually doesn't do a kind of "streaming cut" of a long audio in a bucket, which means we'll hit OOM issues if the files are long (related to PR #257)

if self.denoised:
denoised_success = row["derived.denoised_focal.success"]
if isinstance(denoised_success, str):
denoised_success = denoised_success == "True"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like a check for a dtype that should really be a bool in the underlying data (all rows should be bool not a mixed type)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants