Skip to content

Data discovery 177#189

Open
inesnolas wants to merge 3 commits into
mainfrom
data-discovery-177
Open

Data discovery 177#189
inesnolas wants to merge 3 commits into
mainfrom
data-discovery-177

Conversation

@inesnolas
Copy link
Copy Markdown

@inesnolas inesnolas commented Dec 12, 2025

This change is linked to this issue
goal is to generate an updated database of the existent datasets in esp-data, where searching contents and attributes is easy and quick without having to open each dataset and explore in depth.

At this stage, I've defined new attributes in DatasetInfo and there is a script for generating the database.
so when we run /scripts/data_discovery.py the table is generated in ./dataset_attributes_table_test.csv"

Questions:

  1. not sure about location and timing to run the script. possibly we want to automatically generate the table whenever there are changes to datasets
  2. location and how to use this table?
  3. the field: taxonomic_info is intended to be semi auto-filled by looking up a database, I've already included the download of that, but the logic behind it is still flagged as a #TODO. I need to test it alongside a adding a new dataset perhaps?

@inesnolas inesnolas added the enhancement New feature or request label Dec 12, 2025
@inesnolas inesnolas linked an issue Dec 12, 2025 that may be closed by this pull request
Comment thread esp_data/dataset.py
# TODO if taxonomic database cannot be loaded, set to None and skip validation :/?
TAXON_DB = None
print("Warning: Could not initialize ete3.NCBITaxa. All taxonomic validation will be skipped.")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this is very specific library code and this module is supposed to be more abstract.. needs a refactor, lets discuss

Comment thread esp_data/dataset.py

# --- VALIDATOR ---
# TODO
# validate that the name provided exists in the NCBI taxonomy database for the given rank
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

people in esp use gbif more than NCBI.. we even have a taxonomy app that is built on the gbif api

Comment thread esp_data/dataset.py
label_content: List[
Literal[
"species",
"individual identification",
Copy link
Copy Markdown
Collaborator

@GaganNarula GaganNarula Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for string Literals that are categories its preferable to not have spaces (replace with underscore), and not to have two possible names (like position/distance is not a good idea). We would also need some place for a description of what each term means, for e.g. "health state" or "group belonging" .. are these terms canonical ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add description of each type to the description field in Field.

Comment thread scripts/data_discovery.py
generate_dataset_table_with_attributes()

# TODO
# make it a command line call and create a url to access the generated file and search on it.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add this script to the CI

Copy link
Copy Markdown
Collaborator

@GaganNarula GaganNarula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments, lets discuss further :) I feel like this taxonomic info thing is bloating the DatasetInfo a bit much

Comment thread esp_data/dataset.py
"health state",
"emotional state",
"distractor classes / background animals",
"recording conditions", # weather, habitat, time of day or year, season
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is very descriptive! so lets move descriptions of these categories to the description field

benjaminsshoffman pushed a commit that referenced this pull request Jan 22, 2026
- Port GBIFConverter from taxonomy repo to esp-data as part of the new
data discovery module (calling it `discover` for now, the idea would be
to have a cli tool first like `uv run discover .. some queries ..`.
Having taxonomy in this module will help to build a more comprehensive
DatasetInfo c.f. #189 and do taxonomy validation.
- Added a `AddTaxonomy` transform that can help users add GBIF taxonomy
to their datasets.
- Tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Data discovery with esp-data

2 participants