This project provides a deterministic, reversible infrastructure layer for Amharic text processing.
It consists of:
- CAR (Canonical Amharic Representation) --- strict 1→1 encoding of Ethiopic script
- AN (Amharic Normalizer v0) --- ambiguity-aware normalization pipeline
- UI Resolver (v1) --- deterministic UI lexicon contract layer
- FastAPI service wrapper --- deployable API interface
am_normalizer/--- Core normalization and canonical encoding engineapi/--- FastAPI wrapper exposing HTTP endpointstables/--- Authoritative mapping tablesresources/--- UI lexicon and alias datatests/--- Conformance and regression testsspec/--- Formal specifications and design rationale
CAR encodes each Ethiopic character as:
<base><variant><order>
Where:
base= consonant family (e.g.m,sh,t')variant= optional family digit (for historically distinct sets)order= Ethiopic order (1--7; 8 only allowlisted)
Examples:
Ethiopic CAR
ምን m6n6 የት y1t6 ሲመጣ s3m1t'4
CAR is fully reversible and contains no ambiguity.
The normalizer converts:
- Ethiopic Unicode
- Latin transliteration (auto or strict)
- Mixed input
into canonical CAR.
Features:
- Ambiguity-aware decoding
- Confidence scoring
- Explicit alternatives (never silent guessing)
- Optional Latin-Std reversible output
The UI resolver maps user input to pinned UI lexicon entries.
Resolution order:
- Direct canonical key match (e.g.,
ui.auth.login) - Alias match
- Normalization → Amharic match
- Normalization → CAR match
It guarantees deterministic mapping for UI contracts.
From repository root:
pip install -e .
pip install -r api/requirements.txtuvicorn api.app:app --reloadAPI will be available at:
http://127.0.0.1:8000
Interactive docs:
http://127.0.0.1:8000/docs
Returns the pinned UI lexicon.
Test:
curl http://127.0.0.1:8000/ui-lexiconNormalizes text into Amharic + CAR.
Example:
curl -X POST http://127.0.0.1:8000/normalize -H "Content-Type: application/json" -d '{"text":"selam","options":{"latin_mode":"auto"}}'Response includes:
text_amcarconfidencealternatives(if requested)
Resolves UI keys or aliases to pinned lexicon entries.
Example:
curl -X POST http://127.0.0.1:8000/resolve-ui -H "Content-Type: application/json" -d '{"text":"ui.auth.login"}'Expected response:
{
"resolved": true,
"key": "ui.auth.login",
"am": "ይግቡ",
"category": "auth"
}CORS_ORIGINS supports:
-
Comma-separated:
CORS_ORIGINS=http://localhost:5173,http://127.0.0.1:5173 -
JSON array:
CORS_ORIGINS=["http://localhost:5173","http://127.0.0.1:5173"]
Run full test suite:
python -m pytestAll tests must pass before tagging releases.
MIT License (2025 Simachew Mengiste)
- CAR v0 --- complete and bijective
- AN v0 --- stable
- UI Resolver v1 --- deterministic and pinned
- API wrapper --- deployable
This project forms a foundational infrastructure layer for Amharic NLP, localization, and AI-assisted writing systems.