An MVP pipeline that takes a forename and surname and returns a probability distribution over the countries where that name is likely to originate.
Example — given "John Smith", the system might return:
| Country | Probability |
|---|---|
| United Kingdom | 0.85 |
| United States | 0.10 |
| Australia | 0.05 |
Input: forename + surname
↓
Normalise (lowercase, strip accents, collapse hyphens)
↓
Transliterate if non-Latin script (Cyrillic, Arabic, …)
↓
Lookup in forename + surname tables → candidate countries
↓
LLM ranks candidates by probability
↓
Output: ranked country list with probabilities
The lookup tables are built by intersecting two public datasets:
| Dataset | What it provides |
|---|---|
| sigpwned/popular-names-by-country | Popular forenames and surnames per country |
| names-dataset (PyPI) | Large first/last name corpus with country distributions |
Only names confirmed by both sources are kept, reducing noise.
Many names are common across multiple countries due to linguistic, historical, or colonial overlap:
| Name | Possible Countries |
|---|---|
| Silva | Portugal, Brazil, Mozambique |
| Lee | China, Korea, United States |
| Sofia | Italy, Spain, Bulgaria |
The LLM is given the full candidate list and asked to rank by likelihood, allowing multi-country output.
Names from non-Latin scripts are romanised before lookup:
- Cyrillic:
Михаил→Mikhail - Arabic:
محمد→Muhammad
uv syncuv run python main.pyEdit main.py lines 258–260 to change the input name.
uv run python evaluate_method.py --test-size 0.2Requires HF_TOKEN set in a .env file (HuggingFace inference token):
HF_TOKEN=hf_...
Optional flags:
# Limit samples (faster for testing)
uv run python evaluate_method.py --max-samples 50
# Control number of distractor countries added per name (default 5)
uv run python evaluate_method.py --n-distractors 5The evaluation tests the LLM's ability to identify the correct country of origin for a name, given a mixed candidate list.
Setup (80/20 train/test split on unique names):
- Split the lookup tables into train (80%) and test (20%) by name — names are deduplicated before splitting so no name appears in both sets.
- For each test name, take its true countries and add N random distractor countries sampled from the full country pool.
- Pass the shuffled candidate list (true + distractors) to the LLM.
- Mark as correct if the LLM's top-ranked country is in the true country list.
This avoids the trivial 100%-accuracy result that occurs when the candidate list contains only the correct answers.
Output files:
| File | Contents |
|---|---|
evaluation_results.csv |
Per-name predictions, true countries, distractors used, correctness |
evaluation_summary.txt |
Aggregate accuracy and coverage statistics |
- Data coverage: only names present in both source datasets are included. Rare or regional names may be missing.
- Single ground truth: the dataset records all countries where a name is common, not a single "primary" country. Accuracy is measured as top-1 prediction ∈ true country set.
- Transliteration variants:
Muhammad,Mohamed,Mohamadare treated as separate entries. Variant normalisation is not exhaustive. - LLM dependency: results depend on the model used (currently
CohereLabs/command-a-reasoning-08-2025via HuggingFace router). Changing the model may affect accuracy.
- Scrape official national statistics sites for more comprehensive name coverage.
- Add Hobson/surname-nationality as a third data source.
- Weight country probabilities by name frequency, not just presence.
- Evaluate with an independent held-out dataset for more reliable accuracy estimates.