Labor Union Parser

Match labor union name text to Office of Labor-Management Standards filing numbers.

Given an input like "SEIU Local 1199", the parser returns:

is_union: True
union_score: 0.992
union_name: SERVICE EMPLOYEES
f_num: 31847
match_score: 0.956

Installation

pip install labor-union-parser

Usage

Python API

from labor_union_parser import Extractor

extractor = Extractor()
result = extractor.extract("SEIU Local 1199")
print(result)
# {'f_num': 31847,
#  'is_union': True,
#  'match_score': 0.982312023639679,
#  'union_name': 'SERVICE EMPLOYEES',
#  'union_score': 0.8276892304420471}

For batch processing, use extract_batch which processes texts in parallel for better throughput:

from labor_union_parser import Extractor

extractor = Extractor()
results = extractor.extract_batch([
    "SEIU Local 1199",
    "Teamsters Local 705",
    "UAW Local 600",
])
# {'f_num': 31847,
#  'is_union': True,
#  'match_score': 0.9823121428489685,
#  'union_name': 'SERVICE EMPLOYEES',
#  'union_score': 0.8276892900466919}
# {'f_num': 43508,
#  'is_union': True,
#  'match_score': 0.9988226294517517,
#  'union_name': 'TEAMSTERS',
#  'union_score': 0.7318565249443054}
# {'f_num': 13030,
#  'is_union': True,
#  'match_score': 0.9968639612197876,
#  'union_name': 'AUTO WORKERS AFL-CIO',
#  'union_score': 0.7855185270309448}

The batch_size parameter controls how many texts are processed at once (default: 256). Larger batches are faster but use more memory:

# Process 512 texts at a time
results = extractor.extract_batch(texts, batch_size=512)

For very large datasets, combine extract_batch with itertools.batched to process in chunks and avoid loading everything into memory:

import itertools
from labor_union_parser import Extractor

extractor = Extractor()

# Stream through a large file, processing 1000 at a time
with open("union_names.txt") as f:
    for chunk in itertools.batched(f, 1000):
        texts = [line.strip() for line in chunk]
        for result in extractor.extract_batch(texts):
            print(result["f_num"], result["union_name"])

Command Line

# Process CSV file
labor-union-parser unions.csv -c union_name -o results.csv

# Process from stdin
echo "SEIU Local 1199" | labor-union-parser --no-header
text,pred_is_union,pred_union_score,pred_union_name,pred_f_num,pred_match_score
SEIU Local 1199,True,0.8277,SERVICE EMPLOYEES,31847,0.9823

Output Fields

Field	Description
`is_union`	Whether the text is detected as a union name
`union_score`	Calibrated probability of being a union (0-1, Platt-scaled)
`union_name`	Predicted parent union name from the shared classification head
`f_num`	OLMS filing number of the best-matching gazetteer record
`match_score`	Softmax probability of best gazetteer match (0-1)

Training

Training data and scripts are in training/. The pipeline is orchestrated by the root Makefile:

pip install -e ".[train]"   # Install training dependencies

make data                   # Download opdr.db, generate gazetteer and training data
make train                  # Train ArcFace classifier and union detector
make evaluate               # Run evaluation
make all                    # Full pipeline (data + train)

Checked-in Data

training/data/labeled_data.csv — labeled union name examples
training/data/nonunion_examples.csv — non-union text examples
training/data/acronym_to_fullname.csv — union acronym mappings

Model Architecture

The model uses a two-stage pipeline:

Input: "SEIU Local 1199"
              │
              ▼
┌───────────────────────────────────────────────────┐
│  Tokenizer                                        │
│  tokens: ["seiu", "local", "1199"]                │
│  is_num: [False, False, True]                     │
│  + FastText char n-gram hashes + Bloom number IDs │
└───────────────────────────────────────────────────┘
              │
              ▼
┌───────────────────────────────────────────────────┐
│  Stage 1: Union Detection (Contrastive)           │
│                                                   │
│  FastText + Bloom + RoPE Transformer (2 layers)   │
│  → Mean pool → Projection → L2 normalize          │
│  → Cosine similarity to learned union prototype   │
│  → Platt scaling: sigmoid(a·sim + b)              │
│                                                   │
│  union_score = 0.99 → is_union = True             │
└───────────────────────────────────────────────────┘
              │
              ▼ (always runs)
┌───────────────────────────────────────────────────┐
│  Stage 2: Factored ArcFace Classifier             │
│                                                   │
│  FastText + Bloom + RoPE Transformer (3 layers)   │
│  → Mean pool → L2 normalize                       │
│                                                   │
│  Score against ~35K factored prototypes:           │
│  prototype = W_union + W_desig + bloom(num)       │
│            + W_prefix + W_suffix + W_fnum         │
│  (~17K trained + ~18K zero-shot from gazetteer)   │
│                                                   │
│  Match: SERVICE EMPLOYEES LU 1199 → f_num=31847   │
└───────────────────────────────────────────────────┘
              │
              ▼
Output: {is_union: True, union_name: "SERVICE EMPLOYEES",
         f_num: 31847, match_score: 0.96, ...}

Stage 1: Union Detection

Contrastive learning to distinguish union names from non-union text. Uses the same FastText+RoPE encoder architecture as Stage 2 (2 layers instead of 3), trained with ArcFace angular margin against a learned union prototype.

Encoder: FastText + Bloom + RoPE Transformer (2 layers, 128-dim)
Pooling: Masked mean pool → linear projection → L2 normalize (64-dim)
Training: ArcFace contrastive loss with 20K F7 employer names as hard negatives
Calibration: Platt scaling (sigmoid) for calibrated probability output
Inference: Cosine similarity to learned prototype → Platt-scaled probability

Stage 2: Factored ArcFace Classifier

A single forward pass through the encoder produces a query embedding. This is scored against factored prototypes — one per gazetteer record — via cosine similarity. No pairwise comparisons needed.

Encoder:

FastText embedding: Vocabulary lookup + hashed character 3-6 gram average. Typo-robust: similar spellings share n-gram hashes.
Bloom number embedding: Numbers hashed to 3 indices in a 4096-entry table, summed. Treats numbers as opaque identifiers.
RoPE Transformer: 3 layers, 4 heads, 128-dim. Position-aware attention helps distinguish "district 10 local 66" from "district 66 local 10".
Pooling: Masked mean pool → L2 normalize → 128-dim query embedding.

Factored Prototypes:

Each f_num's prototype is the sum of learned field embeddings:

prototype = W_union[u] + W_desig_name[d] + bloom(desig_num)
          + W_prefix[p] + W_suffix[s] + W_fnum[f]

This additive structure means the model learns separate representations for each field. At inference, scoring is a single matrix multiply against ~35K pre-computed prototype vectors (~17K trained classes + ~18K zero-shot from gazetteer with W_fnum = 0).

Zero-shot prototypes: For gazetteer f_nums without training data, prototypes are built from field embeddings alone. During training, these are included as frozen distractors in the ArcFace softmax, teaching the model to distinguish trained classes from similar zero-shot prototypes. W_fnum is L2-regularized to keep trained prototypes close to their zero-shot versions.

Union Head:

An auxiliary classification head shares the W_union embedding weights with the prototypes. During training, a disagree penalty ensures the f_num predictions are consistent with the union head's prediction. At inference, the union head provides the union_name output.

CRF Tag Head (training only):

A per-token CRF labels numbers as desig_num, prefix, or suffix using constrained marginalization — we know the field values from the gazetteer but not which tokens they correspond to, so the loss marginalizes over all valid alignments (à la CTC). This teaches the encoder to represent number roles without requiring ground truth token labels.

Performance

End-to-end on held-out test data (8,691 examples scored against the full 44K-record gazetteer):

Metric	Score
Accuracy	98.0%
f_num accuracy (union examples)	98.6% (7519/7627)
f_num accuracy (in-vocab only)	98.6%
union_name accuracy	98.0% (9180/9364)
Wrong match (union, wrong f_num)	108
False negatives (union missed)	24
False positives (non-union matched)	43

Name		Name	Last commit message	Last commit date
Latest commit History 278 Commits
.claude/agents		.claude/agents
.github/workflows		.github/workflows
scripts		scripts
src/labor_union_parser		src/labor_union_parser
tests		tests
training		training
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Labor Union Parser

Installation

Usage

Python API

Command Line

Output Fields

Training

Checked-in Data

Model Architecture

Stage 1: Union Detection

Stage 2: Factored ArcFace Classifier

Performance

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Labor Union Parser

Installation

Usage

Python API

Command Line

Output Fields

Training

Checked-in Data

Model Architecture

Stage 1: Union Detection

Stage 2: Factored ArcFace Classifier

Performance

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages