Match labor union name text to Office of Labor-Management Standards filing numbers.
Given an input like "SEIU Local 1199", the parser returns:
is_union: Trueunion_score: 0.992union_name: SERVICE EMPLOYEESf_num: 31847match_score: 0.956
pip install labor-union-parserfrom labor_union_parser import Extractor
extractor = Extractor()
result = extractor.extract("SEIU Local 1199")
print(result)
# {'f_num': 31847,
# 'is_union': True,
# 'match_score': 0.982312023639679,
# 'union_name': 'SERVICE EMPLOYEES',
# 'union_score': 0.8276892304420471}For batch processing, use extract_batch which processes texts in parallel for better throughput:
from labor_union_parser import Extractor
extractor = Extractor()
results = extractor.extract_batch([
"SEIU Local 1199",
"Teamsters Local 705",
"UAW Local 600",
])
# {'f_num': 31847,
# 'is_union': True,
# 'match_score': 0.9823121428489685,
# 'union_name': 'SERVICE EMPLOYEES',
# 'union_score': 0.8276892900466919}
# {'f_num': 43508,
# 'is_union': True,
# 'match_score': 0.9988226294517517,
# 'union_name': 'TEAMSTERS',
# 'union_score': 0.7318565249443054}
# {'f_num': 13030,
# 'is_union': True,
# 'match_score': 0.9968639612197876,
# 'union_name': 'AUTO WORKERS AFL-CIO',
# 'union_score': 0.7855185270309448}The batch_size parameter controls how many texts are processed at once (default: 256). Larger batches are faster but use more memory:
# Process 512 texts at a time
results = extractor.extract_batch(texts, batch_size=512)For very large datasets, combine extract_batch with itertools.batched to process in chunks and avoid loading everything into memory:
import itertools
from labor_union_parser import Extractor
extractor = Extractor()
# Stream through a large file, processing 1000 at a time
with open("union_names.txt") as f:
for chunk in itertools.batched(f, 1000):
texts = [line.strip() for line in chunk]
for result in extractor.extract_batch(texts):
print(result["f_num"], result["union_name"])# Process CSV file
labor-union-parser unions.csv -c union_name -o results.csv
# Process from stdin
echo "SEIU Local 1199" | labor-union-parser --no-header
text,pred_is_union,pred_union_score,pred_union_name,pred_f_num,pred_match_score
SEIU Local 1199,True,0.8277,SERVICE EMPLOYEES,31847,0.9823| Field | Description |
|---|---|
is_union |
Whether the text is detected as a union name |
union_score |
Calibrated probability of being a union (0-1, Platt-scaled) |
union_name |
Predicted parent union name from the shared classification head |
f_num |
OLMS filing number of the best-matching gazetteer record |
match_score |
Softmax probability of best gazetteer match (0-1) |
Training data and scripts are in training/. The pipeline is orchestrated by the root Makefile:
pip install -e ".[train]" # Install training dependencies
make data # Download opdr.db, generate gazetteer and training data
make train # Train ArcFace classifier and union detector
make evaluate # Run evaluation
make all # Full pipeline (data + train)training/data/labeled_data.csv— labeled union name examplestraining/data/nonunion_examples.csv— non-union text examplestraining/data/acronym_to_fullname.csv— union acronym mappings
The model uses a two-stage pipeline:
Input: "SEIU Local 1199"
│
▼
┌───────────────────────────────────────────────────┐
│ Tokenizer │
│ tokens: ["seiu", "local", "1199"] │
│ is_num: [False, False, True] │
│ + FastText char n-gram hashes + Bloom number IDs │
└───────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ Stage 1: Union Detection (Contrastive) │
│ │
│ FastText + Bloom + RoPE Transformer (2 layers) │
│ → Mean pool → Projection → L2 normalize │
│ → Cosine similarity to learned union prototype │
│ → Platt scaling: sigmoid(a·sim + b) │
│ │
│ union_score = 0.99 → is_union = True │
└───────────────────────────────────────────────────┘
│
▼ (always runs)
┌───────────────────────────────────────────────────┐
│ Stage 2: Factored ArcFace Classifier │
│ │
│ FastText + Bloom + RoPE Transformer (3 layers) │
│ → Mean pool → L2 normalize │
│ │
│ Score against ~35K factored prototypes: │
│ prototype = W_union + W_desig + bloom(num) │
│ + W_prefix + W_suffix + W_fnum │
│ (~17K trained + ~18K zero-shot from gazetteer) │
│ │
│ Match: SERVICE EMPLOYEES LU 1199 → f_num=31847 │
└───────────────────────────────────────────────────┘
│
▼
Output: {is_union: True, union_name: "SERVICE EMPLOYEES",
f_num: 31847, match_score: 0.96, ...}
Contrastive learning to distinguish union names from non-union text. Uses the same FastText+RoPE encoder architecture as Stage 2 (2 layers instead of 3), trained with ArcFace angular margin against a learned union prototype.
- Encoder: FastText + Bloom + RoPE Transformer (2 layers, 128-dim)
- Pooling: Masked mean pool → linear projection → L2 normalize (64-dim)
- Training: ArcFace contrastive loss with 20K F7 employer names as hard negatives
- Calibration: Platt scaling (sigmoid) for calibrated probability output
- Inference: Cosine similarity to learned prototype → Platt-scaled probability
A single forward pass through the encoder produces a query embedding. This is scored against factored prototypes — one per gazetteer record — via cosine similarity. No pairwise comparisons needed.
Encoder:
- FastText embedding: Vocabulary lookup + hashed character 3-6 gram average. Typo-robust: similar spellings share n-gram hashes.
- Bloom number embedding: Numbers hashed to 3 indices in a 4096-entry table, summed. Treats numbers as opaque identifiers.
- RoPE Transformer: 3 layers, 4 heads, 128-dim. Position-aware attention helps distinguish "district 10 local 66" from "district 66 local 10".
- Pooling: Masked mean pool → L2 normalize → 128-dim query embedding.
Factored Prototypes:
Each f_num's prototype is the sum of learned field embeddings:
prototype = W_union[u] + W_desig_name[d] + bloom(desig_num)
+ W_prefix[p] + W_suffix[s] + W_fnum[f]
This additive structure means the model learns separate representations
for each field. At inference, scoring is a single matrix multiply
against ~35K pre-computed prototype vectors (~17K trained classes +
~18K zero-shot from gazetteer with W_fnum = 0).
Zero-shot prototypes: For gazetteer f_nums without training data, prototypes are built from field embeddings alone. During training, these are included as frozen distractors in the ArcFace softmax, teaching the model to distinguish trained classes from similar zero-shot prototypes. W_fnum is L2-regularized to keep trained prototypes close to their zero-shot versions.
Union Head:
An auxiliary classification head shares the W_union embedding weights
with the prototypes. During training, a disagree penalty ensures the
f_num predictions are consistent with the union head's prediction.
At inference, the union head provides the union_name output.
CRF Tag Head (training only):
A per-token CRF labels numbers as desig_num, prefix, or suffix using constrained marginalization — we know the field values from the gazetteer but not which tokens they correspond to, so the loss marginalizes over all valid alignments (à la CTC). This teaches the encoder to represent number roles without requiring ground truth token labels.
End-to-end on held-out test data (8,691 examples scored against the full 44K-record gazetteer):
| Metric | Score |
|---|---|
| Accuracy | 98.0% |
| f_num accuracy (union examples) | 98.6% (7519/7627) |
| f_num accuracy (in-vocab only) | 98.6% |
| union_name accuracy | 98.0% (9180/9364) |
| Wrong match (union, wrong f_num) | 108 |
| False negatives (union missed) | 24 |
| False positives (non-union matched) | 43 |