Add spanish carrion crows vox by mcusi · Pull Request #290 · earthspecies/esp-data

mcusi · 2026-05-14T13:24:01Z

A different version of the Spanish Carrion Crow datasets that only includes focal vocalizations (i.e., vocalizations made by an adult bird wearing a biologger), in time periods where we have more than one crow tagged.

We have pre-computed audio clips for each focal vocalization, making it simpler than dataset #255

Note that the full dataset has more focal vocalizations than this one, but we are subsetting to sections where there is synchronization between >1 bird in the same year/territory.

Individual focal vocalization loader for the Spanish carrion crow biologger dataset. Clips audio from longer recordings, supports noisy and MixIT-denoised modes, optional padding, and filtering by overlap_window_id context window. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

GaganNarula · 2026-05-16T10:02:04Z

+    def available_splits(self) -> list[str]:
+        return [
+            "all",
+            1,


These splits (1,2, .. ) dont have any paths defined ?

GaganNarula · 2026-05-16T10:02:29Z

+
+    @property
+    def columns(self) -> list[str]:
+        return list(self._data.columns)


this can just be return self._data.columns, already correct type

GaganNarula · 2026-05-16T10:04:20Z

+            Backend to use ("pandas" or "polars"). Defaults to "polars".
+        streaming : bool
+            Whether to use streaming mode. Defaults to False.
+        denoised : bool


for datasets with extra parameters like "denoised" , "fallback_to_noisy" you need a separate config, see audioset.py for AudiosetConfig for instance

GaganNarula · 2026-05-16T10:04:52Z

+        owner="maddie",
+        split_paths={
+            # TODO: update path once finalized in GCS
+            "all": "gs://esp-ml-datasets/spanish-carrion-crows-vox/conversational_preprocessed.csv",


is this file already in esp-ml-datasets ? if not it should be in esp-data-ingestion otherwise the tests wont run ?

GaganNarula

changes needed

GaganNarula · 2026-05-16T10:06:34Z

+            na_values=[""],
+        )
+        if self.split != "all":
+            self._data = self._data.filter_isin("overlap_window_id", [float(self.split)])


oh i see whats happening.. but this is a transform of a split, so i think its better to let the user do this in their code rather than add as a fake "split" which gets generated on _load

Also keeps are api a bit more stable

GaganNarula · 2026-05-16T10:09:02Z

+        return len(self._data)
+
+    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
+        start_sec = max(0.0, float(row["derived.centered_focal.long.start_sec"]) - self.padding_sec)


are these files also very long (e.g. more than 30 min) .. our current read_audio actually doesn't do a kind of "streaming cut" of a long audio in a bucket, which means we'll hit OOM issues if the files are long (related to PR #257)

GaganNarula · 2026-05-16T10:10:04Z

+        if self.denoised:
+            denoised_success = row["derived.denoised_focal.success"]
+            if isinstance(denoised_success, str):
+                denoised_success = denoised_success == "True"


seems like a check for a dtype that should really be a bool in the underlying data (all rows should be bool not a mixed type)

mcusi and others added 4 commits March 16, 2026 23:58

Add Spanish carrion crows dataset and passing tests

2a7af2f

Remove file without selection table

1810ddc

Address PR review comments

fb4376d

mcusi requested a review from a team as a code owner May 14, 2026 13:24

GaganNarula reviewed May 16, 2026

View reviewed changes

GaganNarula requested changes May 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add spanish carrion crows vox#290

Add spanish carrion crows vox#290
mcusi wants to merge 4 commits into
mainfrom
add-spanish-carrion-crows-vox

mcusi commented May 14, 2026

Uh oh!

GaganNarula May 16, 2026

Uh oh!

GaganNarula May 16, 2026

Uh oh!

GaganNarula May 16, 2026

Uh oh!

GaganNarula May 16, 2026

Uh oh!

GaganNarula left a comment

Uh oh!

GaganNarula May 16, 2026

Uh oh!

GaganNarula May 16, 2026

Uh oh!

GaganNarula May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mcusi commented May 14, 2026

Uh oh!

GaganNarula May 16, 2026

Choose a reason for hiding this comment

Uh oh!

GaganNarula May 16, 2026

Choose a reason for hiding this comment

Uh oh!

GaganNarula May 16, 2026

Choose a reason for hiding this comment

Uh oh!

GaganNarula May 16, 2026

Choose a reason for hiding this comment

Uh oh!

GaganNarula left a comment

Choose a reason for hiding this comment

Uh oh!

GaganNarula May 16, 2026

Choose a reason for hiding this comment

Uh oh!

GaganNarula May 16, 2026

Choose a reason for hiding this comment

Uh oh!

GaganNarula May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants