Data discovery 177#189
Conversation
| # TODO if taxonomic database cannot be loaded, set to None and skip validation :/? | ||
| TAXON_DB = None | ||
| print("Warning: Could not initialize ete3.NCBITaxa. All taxonomic validation will be skipped.") | ||
|
|
There was a problem hiding this comment.
hmm this is very specific library code and this module is supposed to be more abstract.. needs a refactor, lets discuss
|
|
||
| # --- VALIDATOR --- | ||
| # TODO | ||
| # validate that the name provided exists in the NCBI taxonomy database for the given rank |
There was a problem hiding this comment.
people in esp use gbif more than NCBI.. we even have a taxonomy app that is built on the gbif api
| label_content: List[ | ||
| Literal[ | ||
| "species", | ||
| "individual identification", |
There was a problem hiding this comment.
for string Literals that are categories its preferable to not have spaces (replace with underscore), and not to have two possible names (like position/distance is not a good idea). We would also need some place for a description of what each term means, for e.g. "health state" or "group belonging" .. are these terms canonical ?
There was a problem hiding this comment.
Add description of each type to the description field in Field.
| generate_dataset_table_with_attributes() | ||
|
|
||
| # TODO | ||
| # make it a command line call and create a url to access the generated file and search on it. |
There was a problem hiding this comment.
we can add this script to the CI
GaganNarula
left a comment
There was a problem hiding this comment.
some comments, lets discuss further :) I feel like this taxonomic info thing is bloating the DatasetInfo a bit much
| "health state", | ||
| "emotional state", | ||
| "distractor classes / background animals", | ||
| "recording conditions", # weather, habitat, time of day or year, season |
There was a problem hiding this comment.
The comment is very descriptive! so lets move descriptions of these categories to the description field
- Port GBIFConverter from taxonomy repo to esp-data as part of the new data discovery module (calling it `discover` for now, the idea would be to have a cli tool first like `uv run discover .. some queries ..`. Having taxonomy in this module will help to build a more comprehensive DatasetInfo c.f. #189 and do taxonomy validation. - Added a `AddTaxonomy` transform that can help users add GBIF taxonomy to their datasets. - Tests
This change is linked to this issue
goal is to generate an updated database of the existent datasets in esp-data, where searching contents and attributes is easy and quick without having to open each dataset and explore in depth.
At this stage, I've defined new attributes in DatasetInfo and there is a script for generating the database.
so when we run /scripts/data_discovery.py the table is generated in ./dataset_attributes_table_test.csv"
Questions: