Name Origin — Country Likelihood from Forenames and Surnames

An MVP pipeline that takes a forename and surname and returns a probability distribution over the countries where that name is likely to originate.

Example — given "John Smith", the system might return:

Country	Probability
United Kingdom	0.85
United States	0.10
Australia	0.05

How it works

Input: forename + surname
       ↓
Normalise (lowercase, strip accents, collapse hyphens)
       ↓
Transliterate if non-Latin script (Cyrillic, Arabic, …)
       ↓
Lookup in forename + surname tables → candidate countries
       ↓
LLM ranks candidates by probability
       ↓
Output: ranked country list with probabilities

The lookup tables are built by intersecting two public datasets:

Dataset	What it provides
sigpwned/popular-names-by-country	Popular forenames and surnames per country
names-dataset (PyPI)	Large first/last name corpus with country distributions

Only names confirmed by both sources are kept, reducing noise.

Handling ambiguity

Many names are common across multiple countries due to linguistic, historical, or colonial overlap:

Name	Possible Countries
Silva	Portugal, Brazil, Mozambique
Lee	China, Korea, United States
Sofia	Italy, Spain, Bulgaria

The LLM is given the full candidate list and asked to rank by likelihood, allowing multi-country output.

Transliteration

Names from non-Latin scripts are romanised before lookup:

Cyrillic: Михаил → Mikhail
Arabic: محمد → Muhammad

Usage

Install dependencies

uv sync

Run inference

uv run python main.py

Edit main.py lines 258–260 to change the input name.

Run evaluation

uv run python evaluate_method.py --test-size 0.2

Requires HF_TOKEN set in a .env file (HuggingFace inference token):

HF_TOKEN=hf_...

Optional flags:

# Limit samples (faster for testing)
uv run python evaluate_method.py --max-samples 50

# Control number of distractor countries added per name (default 5)
uv run python evaluate_method.py --n-distractors 5

Evaluation methodology

The evaluation tests the LLM's ability to identify the correct country of origin for a name, given a mixed candidate list.

Setup (80/20 train/test split on unique names):

Split the lookup tables into train (80%) and test (20%) by name — names are deduplicated before splitting so no name appears in both sets.
For each test name, take its true countries and add N random distractor countries sampled from the full country pool.
Pass the shuffled candidate list (true + distractors) to the LLM.
Mark as correct if the LLM's top-ranked country is in the true country list.

This avoids the trivial 100%-accuracy result that occurs when the candidate list contains only the correct answers.

Output files:

File	Contents
`evaluation_results.csv`	Per-name predictions, true countries, distractors used, correctness
`evaluation_summary.txt`	Aggregate accuracy and coverage statistics

Limitations

Data coverage: only names present in both source datasets are included. Rare or regional names may be missing.
Single ground truth: the dataset records all countries where a name is common, not a single "primary" country. Accuracy is measured as top-1 prediction ∈ true country set.
Transliteration variants: Muhammad, Mohamed, Mohamad are treated as separate entries. Variant normalisation is not exhaustive.
LLM dependency: results depend on the model used (currently CohereLabs/command-a-reasoning-08-2025 via HuggingFace router). Changing the model may affect accuracy.

Future enhancements

Scrape official national statistics sites for more comprehensive name coverage.
Add Hobson/surname-nationality as a third data source.
Weight country probabilities by name frequency, not just presence.
Evaluate with an independent held-out dataset for more reliable accuracy estimates.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
utils		utils
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
README.md		README.md
create_clean_dataset.py		create_clean_dataset.py
evaluate_method.py		evaluate_method.py
evaluation_results.csv		evaluation_results.csv
evaluation_summary.txt		evaluation_summary.txt
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Name Origin — Country Likelihood from Forenames and Surnames

How it works

Handling ambiguity

Transliteration

Usage

Install dependencies

Run inference

Run evaluation

Evaluation methodology

Limitations

Future enhancements

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Name Origin — Country Likelihood from Forenames and Surnames

How it works

Handling ambiguity

Transliteration

Usage

Install dependencies

Run inference

Run evaluation

Evaluation methodology

Limitations

Future enhancements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages