The Tabiya Livelihoods Classifier provides an easy-to-use implementation of the entity-linking paradigm to support job description heuristics. Using state-of-the-art transformer neural networks, this tool can extract five entity types: Occupation, Skill, Qualification, Experience, and Domain. For the Occupations and Skills, ESCO-related entries are retrieved. The procedure consists of two discrete steps: entity extraction and similarity vector search.
See ARCHITECTURE.md for a full overview of the system design, services, and infrastructure.
Environments follow the pattern <env>.classifier.tabiya.tech. The dev environment is used as an example below.
| Resource | URL |
|---|---|
| API | https://dev.classifier.tabiya.tech |
| App (dashboard) | https://app.dev.classifier.tabiya.tech |
| Docs | https://docs.dev.classifier.tabiya.tech |
| API health check | https://dev.classifier.tabiya.tech/v1/health |
| Classify Swagger UI | https://dev.classifier.tabiya.tech/docs |
| NER Swagger UI | https://dev.classifier.tabiya.tech/docs/ner |
| NEL Swagger UI | https://dev.classifier.tabiya.tech/docs/nel |
- Installation
- Use the model: Instructions on how to use the inference pipeline.
- Job Analysis Application: A web application for analyzing job descriptions and extracting and linking relevant entities.
- Training: Details on how to train the model.
- Model's Architecture
- Datasets
- License
- Bibliography
-
A recent version of git (e.g. ^2.37 )
-
Note: to install Poetry consult the Poetry documentation
Note: Install poetry system-wide (not in a virtualenv).
This repository uses Git LFS for handling large files. Before you can use this repository, you need to install and set up Git LFS on your local machine. See https://git-lfs.com/ for installation instructions.
After Git LFS is set up, follow these steps to clone the repository:
git clone https://github.com/tabiya-tech/tabiya-livelihoods-classifier.gitIf you already cloned the repository without Git LFS, run:
git lfs pullIn the root directory of the backend project (so, the same directory as this README file), run the following commands:
# create a virtual environment
python3 -m venv venv
# activate the virtual environment
source venv/bin/activate# Use the version of the dependencies specified in the lock file
poetry lock --no-update
# Install missing and remove unreferenced packages
poetry install --syncNote: Install the dependencies for the training using:
# Use the version of the dependencies specified in the lock file poetry lock --no-update # Install missing and remove unreferenced packages poetry install --sync --with train
Note: Before running any tasks, activate the virtual environment so that the installed dependencies are available:
# activate the virtual environment source venv/bin/activateTo deactivate the virtual environment, run:
# deactivate the virtual environment deactivate
Activate Python and download the NLTK punctuation package to use the sentence tokenizer. You only need to download punkt once.
python <<EOF
import nltk
nltk.download('punkt')
EOFThe repo uses the following environment variable:
HF_TOKEN: To use the project, you need access to the HuggingFace 🤗 entity extraction model. Contact the administrators via [tabiya@benisis.de]. From there, you need to create a read access token to use the model. Find or create your read access token here. The backend supports the use of a.envfile to set the environment variable. Create a.envfile in the root directory of the backend project and set the environment variables as follows:
# .env file
HF_TOKEN=<YOUR_HF_TOKEN>ATTENTION: The .env file should be kept secure and not shared with others as it contains sensitive information. It should not be committed to the repository.
- Use the model: Instructions on how to use the inference tool.
- Use the API: Instructions on how to use the API.
- Training: Details on how to train the model.
The code and model weights are licensed under the MIT License. See the LICENSE file for details.
The datasets are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). See the DATA_LICENSE file for details.
- Location: inference/files/occupations_augmented.csv
- Source: ESCO dataset - v1.1.1
- Description: ESCO (European Skills, Competences, Qualifications and Occupations) is the European multilingual classification of Skills, Competences, and Occupations. This dataset includes information relevant to the occupations.
- License: Creative Commons Attribution 4.0 International see DATA_LICENSE for details.
- Modifications: The columns retained are
alt_label,preferred_label,esco_code, anduuid. Each alternative label has been separated into individual rows.
- Location: inference/files/skills.csv
- Source: ESCO dataset - v1.1.1
- Description: ESCO (European Skills, Competences, Qualifications and Occupations) is the European multilingual classification of Skills, Competences and Occupations. This dataset includes information relevant to the skills.
- License: Creative Commons Attribution 4.0 International see Data License for details.
- Modifications: The columns retained are
preferred_labelanduuid.
- Location: inference/files/qualifications.csv
- Source: Official European Union EQF comparison website
- Description: This dataset contains EQF (European Qualifications Framework) relevant information extracted from the official EQF comparison website. It includes data strings, country information, and EQF levels. Non-English text was ignored.
- License: Please refer to the original source for license information.
- Modifications: Non-English text was removed, and the remaining information was formatted into a structured database.
- Location: inference/files/eval/redacted_hahu_test_with_id.csv
- Source: hahu_test
- Description: This dataset consists of 542 entries chosen at random from the 11 general classification system of the Ethiopian hahu jobs platform. 50 entries were selected from each class to create the final dataset.
- License: Creative Commons Attribution 4.0 International see Data License for details.
- Modifications: No modifications were made to the selected entries.
- Location:
- Source: Provided by Decorte et al.
- Description: The dataset includes the HOUSE and TECH extensions of the SkillSpan Dataset. In the original work by Decorte et al., the test and development entities of the SkillSpan Dataset were annotated into the ESCO model.
- License: MIT, Please refer to the original source.
- Modifications: The datasets were used as provided without further modifications.
- Location: inference/files/eval/qualification_mapping.csv
- Source: Extended from the Green Benchmark Qualifications
- Description: This dataset maps the Green Benchmark Qualifications to the appropriate EQF levels. Two annotators tagged the qualifications, resulting in a Cohen's Kappa agreement of 0.45, indicating moderate agreement.
- License: Creative Commons Attribution 4.0 International see Data License for details.
- Modifications: Extended the dataset to include EQF level mappings, and the annotations were verified by two annotators.
To use these datasets, ensure you comply with the original dataset's license and terms of use. Any modifications made should be documented and attributed appropriately in your project.
For datasets requiring access tokens, such as those from HuggingFace 🤗, please contact the maintainers to obtain a read access token.
A list on interesting and relevant material for reading:
-
GPT NER GPT-NER: Named Entity Recognition via Large Language Models (Shuhe Wang)
-
Skill Extraction with LLMs Rethinking Skill Extraction in the Job Market Domain using Large Language Models (Mike Zhang)
-
NER annotation with LLM LLMs Accelerate Annotation for Medical Information Extraction
-
Skills Entity Linking Zhang, Mike, Rob van der Goot, and Barbara Plank. "Entity Linking in the Job Market Domain." arXiv preprint arXiv:2401.17979 (2024).
-
Skills-ML is an open-source Python library for developing and analyzing skills and competencies from unstructured text. (link: http://dataatwork.org/skills-ml/)
-
SkillSpan: Hard and Soft Skill Extraction from English Job Postings https://arxiv.org/abs/2204.12811 (Mike Zhang)
-
work2vec: Using the full text of data from 200 million online job postings, we train and evaluate a natural language processing (NLP) model to learn the language of jobs. We analyze how jobs have changed in the past decade, and show how different words in the posting denote different occupations. We use this approach to create novel indexes of jobs, such as work-from-home ability. In ongoing work, we quantify the return to various skills.
https://digitaleconomy.stanford.edu/research/job2vec/ https://digitaleconomy.stanford.edu/people/sarah-h-bana/
-
Data Science and ESCO Insights into how ESCO is leveraging data-science techniques. https://esco.ec.europa.eu/en/about-esco/data-science-and-esco
-
Machine Learning Assisted Mapping of Multilingual Occupational Data to ESCO: A report that discusses the multilingual mapping approach that the ESCO team established to support the maintenance of ESCO. https://esco.ec.europa.eu/en/about-esco/publications/publication/machine-learning-assisted-mapping-multilingual-occupational
-
ESCO Publications: Artificial intelligence & machine learning. https://esco.ec.europa.eu/en/about-esco/publications?f%5B0%5D=theme%3A109860&page=0
