Bi-directional LSTM tagger with subword embeddings for EmpiriST/PostWITA datasets. The project now targets Python 3.12+, modern TensorFlow/Keras, and uses uv for dependency management.
- Install dependencies:
uv sync
- Configure embedding paths in
tagger.inior via env vars:TAGGER_W2V_SMALL=/path/to/small.vecTAGGER_W2V_BIG=/path/to/big.vec
- Train:
uv run tagger train --task postwita --config tagger.ini --epochs 5
- Predict:
uv run tagger predict --model-path artifacts/tagger.keras --output-ext .pred
src/tagger/package with config handling, data utilities, and the BiLSTM model.tagger.inioptional config; env vars prefixed withTAGGER_override values.pyproject.tomldefines runtime/dev dependencies;uv.lockproduced byuv sync.
- Default data root is
<repo>/data; override with--data-rootorTAGGER_DATA_ROOT. - Saved models use Keras’
.kerasformat containing architecture + weights.