Skip to content

tabiya-tech/tabiya-livelihoods-classifier

Repository files navigation

Tabiya Livelihoods Classifier

The Tabiya Livelihoods Classifier provides an easy-to-use implementation of the entity-linking paradigm to support job description heuristics. Using state-of-the-art transformer neural networks, this tool can extract five entity types: Occupation, Skill, Qualification, Experience, and Domain. For the Occupations and Skills, ESCO-related entries are retrieved. The procedure consists of two discrete steps: entity extraction and similarity vector search.

Architecture

See ARCHITECTURE.md for a full overview of the system design, services, and infrastructure.

Deployed Environments

Environments follow the pattern <env>.classifier.tabiya.tech. The dev environment is used as an example below.

Resource URL
API https://dev.classifier.tabiya.tech
App (dashboard) https://app.dev.classifier.tabiya.tech
Docs https://docs.dev.classifier.tabiya.tech
API health check https://dev.classifier.tabiya.tech/v1/health
Classify Swagger UI https://dev.classifier.tabiya.tech/docs
NER Swagger UI https://dev.classifier.tabiya.tech/docs/ner
NEL Swagger UI https://dev.classifier.tabiya.tech/docs/nel

Table of Contents

Installation

Prerequisites

Using Git LFS

This repository uses Git LFS for handling large files. Before you can use this repository, you need to install and set up Git LFS on your local machine. See https://git-lfs.com/ for installation instructions.

After Git LFS is set up, follow these steps to clone the repository:

git clone https://github.com/tabiya-tech/tabiya-livelihoods-classifier.git

If you already cloned the repository without Git LFS, run:

git lfs pull

Install the dependencies

Set up virtualenv

In the root directory of the backend project (so, the same directory as this README file), run the following commands:

# create a virtual environment
python3 -m venv venv

# activate the virtual environment
source venv/bin/activate
# Use the version of the dependencies specified in the lock file
poetry lock --no-update
# Install missing and remove unreferenced packages
poetry install --sync

Note: Install the dependencies for the training using:

# Use the version of the dependencies specified in the lock file
poetry lock --no-update
# Install missing and remove unreferenced packages
poetry install --sync --with train

Note: Before running any tasks, activate the virtual environment so that the installed dependencies are available:

# activate the virtual environment
source venv/bin/activate

To deactivate the virtual environment, run:

# deactivate the virtual environment
deactivate

Activate Python and download the NLTK punctuation package to use the sentence tokenizer. You only need to download punkt once.

python <<EOF
import nltk
nltk.download('punkt')
EOF

Environment Variable & Configuration

The repo uses the following environment variable:

  • HF_TOKEN: To use the project, you need access to the HuggingFace 🤗 entity extraction model. Contact the administrators via [tabiya@benisis.de]. From there, you need to create a read access token to use the model. Find or create your read access token here. The backend supports the use of a .env file to set the environment variable. Create a .env file in the root directory of the backend project and set the environment variables as follows:
# .env file
HF_TOKEN=<YOUR_HF_TOKEN>

ATTENTION: The .env file should be kept secure and not shared with others as it contains sensitive information. It should not be committed to the repository.

Model's Architecture

Model Architecture

License

The code and model weights are licensed under the MIT License. See the LICENSE file for details.

The datasets are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). See the DATA_LICENSE file for details.

Datasets

Occupations

  • Location: inference/files/occupations_augmented.csv
  • Source: ESCO dataset - v1.1.1
  • Description: ESCO (European Skills, Competences, Qualifications and Occupations) is the European multilingual classification of Skills, Competences, and Occupations. This dataset includes information relevant to the occupations.
  • License: Creative Commons Attribution 4.0 International see DATA_LICENSE for details.
  • Modifications: The columns retained are alt_label, preferred_label, esco_code, and uuid. Each alternative label has been separated into individual rows.

Skills

  • Location: inference/files/skills.csv
  • Source: ESCO dataset - v1.1.1
  • Description: ESCO (European Skills, Competences, Qualifications and Occupations) is the European multilingual classification of Skills, Competences and Occupations. This dataset includes information relevant to the skills.
  • License: Creative Commons Attribution 4.0 International see Data License for details.
  • Modifications: The columns retained are preferred_label and uuid.

Qualifications

  • Location: inference/files/qualifications.csv
  • Source: Official European Union EQF comparison website
  • Description: This dataset contains EQF (European Qualifications Framework) relevant information extracted from the official EQF comparison website. It includes data strings, country information, and EQF levels. Non-English text was ignored.
  • License: Please refer to the original source for license information.
  • Modifications: Non-English text was removed, and the remaining information was formatted into a structured database.

Hahu Test

  • Location: inference/files/eval/redacted_hahu_test_with_id.csv
  • Source: hahu_test
  • Description: This dataset consists of 542 entries chosen at random from the 11 general classification system of the Ethiopian hahu jobs platform. 50 entries were selected from each class to create the final dataset.
  • License: Creative Commons Attribution 4.0 International see Data License for details.
  • Modifications: No modifications were made to the selected entries.

House and Tech

Qualification Mapping

  • Location: inference/files/eval/qualification_mapping.csv
  • Source: Extended from the Green Benchmark Qualifications
  • Description: This dataset maps the Green Benchmark Qualifications to the appropriate EQF levels. Two annotators tagged the qualifications, resulting in a Cohen's Kappa agreement of 0.45, indicating moderate agreement.
  • License: Creative Commons Attribution 4.0 International see Data License for details.
  • Modifications: Extended the dataset to include EQF level mappings, and the annotations were verified by two annotators.

Access and Usage

To use these datasets, ensure you comply with the original dataset's license and terms of use. Any modifications made should be documented and attributed appropriately in your project.

For datasets requiring access tokens, such as those from HuggingFace 🤗, please contact the maintainers to obtain a read access token.

Bibliography

A list on interesting and relevant material for reading:

About

Tabiya's Livelihoods Classifier provides an easy-to-use implementation for entity-linking of job descriptions to the ESCO framework

Resources

License

MIT, CC-BY-4.0 licenses found

Licenses found

MIT
LICENSE
CC-BY-4.0
DATA_LICENSE

Stars

Watchers

Forks

Packages

 
 
 

Contributors