Data collection and preprocessing stack for African job-market intelligence.
This repository includes:
- A scraper notebook that pulls job postings from Afriwork and Hahu Jobs.
- A production preprocessing pipeline that transforms raw CSV job rows into structured ML-ready features.
Data-ML/
afriwork_hahu_scraper.ipynb
requirements.txt
Job_pipeline/
README.md
run_preprocessing_pipeline.py
data/
raw/
processed/
aggregated/
preprocessing/
clean_text.py
job_id.py
date_features.py
title_normalization.py
description_embedding.py
location_extraction.py
remote_detection.py
job_type_extraction.py
education_extraction.py
skills_extraction.py
semantic_utils.py
gemini_key_selector.py
unified_preprocessor.py
taxonomy/
roles.json
skills.json
tests/
test_step1_clean_text.py
...
test_step10_skills_extraction.py
test_pipeline_target_features.py
Use afriwork_hahu_scraper.ipynb to pull fresh source data.
| Source | Endpoint | Auth |
|---|---|---|
| Afri-work - all jobs | https://api.afriworket.com/v1/graphql |
None (anonymous) |
| Afri-work - software/tech jobs | https://api.afriworket.com/v1/graphql |
Optional Bearer token |
| Hahu Jobs - tech sector | https://graph.aggregator.hahu.jobs/v1/graphql |
None |
- Open Google Colab and load
afriwork_hahu_scraper.ipynb. - Run all notebook cells in sequence.
- Save CSV outputs to Drive or local Colab storage.
Expected raw outputs:
afriwork_all_jobs_YYYYMMDD_HHMMSS.csvafriwork_tech_jobs_YYYYMMDD_HHMMSS.csvhahu_tech_jobs_YYYYMMDD_HHMMSS.csv
The preprocessing system is implemented under Job_pipeline/ and runs Step 1 to Step 10 end-to-end.
- Deterministic text cleaning, ID generation, and date feature engineering.
- Semantic embeddings (
sentence-transformers) for title/skill matching and description vectors. - Rule/regex extraction for location, remote mode, job type, and education.
- Optional Gemini fallbacks when confidence is low.
- Time-seeded random Gemini key selection through environment configuration.
Each processed row includes:
year_monthtimestampmonthholiday_flagjob_idjob_titlenormalized_titleDescriptionVeccityregioncountryis_remotejob_typeeducation_levelskills
From repository root:
pip install -r requirements.txt
python -m spacy download en_core_web_smMain dependencies:
requestspydanticgoogle-genaisentence-transformersspacyholidaysrapidfuzz
From repository root:
python Job_pipeline/run_preprocessing_pipeline.pyOptional flags:
python Job_pipeline/run_preprocessing_pipeline.py --max-rows 100
python Job_pipeline/run_preprocessing_pipeline.py --enable-gemini-fallback
python Job_pipeline/run_preprocessing_pipeline.py --raw-dir Job_pipeline/data/raw --processed-dir Job_pipeline/data/processedBehavior notes:
- Input files are read from
Job_pipeline/data/raw/. - Output files are written to
Job_pipeline/data/processed/with the same filenames. - Gemini fallback is disabled by default for stable high-throughput batch runs.
python -m unittest discover -s Job_pipeline/tests -vFor full pipeline module-level documentation, see Job_pipeline/README.md.