Pronounced "welders"
Fellegi-Sunter probabilistic record linkage in Rust, powered by Polars.
A Rust-native implementation inspired by the Splink Python package and the fastLink R package.
- Blocking rules — reduce the comparison space with equi-join blocking on one or more columns
- Exact and fuzzy comparisons — Jaro-Winkler, Levenshtein, and Jaro similarity predicates alongside exact matching
- EM training — unsupervised Expectation-Maximization to learn m/u probabilities
- Fellegi-Sunter scoring — Bayes-factor match weights and match probabilities for every candidate pair
- Connected-components clustering — union-find grouping of linked records
- Model serialization — save and load trained model parameters as JSON
- Waterfall explanations — step-by-step breakdowns showing why each pair received its score
use polars::prelude::*;
use weldrs::comparison::ComparisonBuilder;
use weldrs::prelude::*;
fn main() -> Result<()> {
// 1. Prepare your data as a Polars DataFrame
let df = df!(
"unique_id" => [1i64, 2, 3, 4],
"first_name" => ["John", "Jane", "Jon", "Jane"],
"last_name" => ["Smith", "Doe", "Smith", "Doe"],
)?;
// 2. Define comparisons and build settings
let settings = Settings::builder(LinkType::DedupeOnly)
.comparison(
ComparisonBuilder::new("first_name")
.null_level()
.exact_match_level()
.jaro_winkler_level(0.88)
.else_level()
.build(),
)
.comparison(
ComparisonBuilder::new("last_name")
.null_level()
.exact_match_level()
.else_level()
.build(),
)
.blocking_rule(BlockingRule::on(&["last_name"]))
.build()?;
// 3. Train the model
let mut linker = Linker::new(settings)?;
let lf = df.lazy();
linker.estimate_probability_two_random_records_match(
&lf,
&[BlockingRule::on(&["first_name", "last_name"])],
1.0,
)?;
linker.estimate_u_using_random_sampling(&lf, 200)?;
linker.estimate_parameters_using_em(&lf, &BlockingRule::on(&["last_name"]))?;
// 4. Predict — score all candidate pairs
let predictions = linker.predict(&lf, None)?.collect()?;
// 5. Cluster — group linked records
let clusters = linker.cluster_pairwise_predictions(&predictions, 0.5)?;
println!("{clusters}");
Ok(())
}weldrs implements the Fellegi-Sunter model of probabilistic record linkage:
-
Comparisons define how record pairs are evaluated. Each comparison targets one input column (e.g.
first_name) and contains multiple levels ordered from most to least specific (e.g. exact match, Jaro-Winkler >= 0.88, else). -
m-probability — the probability that a comparison level agrees, given the records are a true match.
-
u-probability — the probability that a comparison level agrees, given the records are not a match.
-
Bayes factor — the ratio m/u. A Bayes factor > 1 provides evidence towards a match; < 1 provides evidence against.
-
Match weight — log2 of the Bayes factor. The final match weight for a pair is the sum of the prior (from lambda) plus each comparison's individual match weight.
-
Lambda — the prior probability that two randomly chosen records are a match. Estimated from deterministic rules or set manually.
-
Blocking rules — equi-join predicates that restrict which record pairs are compared, making linkage tractable on large datasets.
Use ComparisonBuilder to define how columns are compared. Levels are evaluated top-to-bottom; the first matching level wins.
use weldrs::comparison::ComparisonBuilder;
let name_comparison = ComparisonBuilder::new("first_name")
.null_level() // both values null
.exact_match_level() // exact string equality
.jaro_winkler_level(0.88) // fuzzy: Jaro-Winkler >= 0.88
.else_level() // everything else
.build();
let city_comparison = ComparisonBuilder::new("city")
.null_level()
.exact_match_level()
.levenshtein_level(2) // fuzzy: edit distance <= 2
.else_level()
.build();Available fuzzy predicates:
jaro_winkler_level(threshold)— Jaro-Winkler similarity >= thresholdjaro_level(threshold)— Jaro similarity >= thresholdlevenshtein_level(threshold)— Levenshtein edit distance <= threshold
use weldrs::prelude::*;
let settings = Settings::builder(LinkType::DedupeOnly)
.comparison(name_comparison)
.comparison(city_comparison)
.blocking_rule(BlockingRule::on(&["last_name"]))
.blocking_rule(BlockingRule::on(&["city"]))
.unique_id_column("record_id") // default: "unique_id"
.probability_two_random_records_match(0.001) // default: 0.0001
.build()?;LinkType options:
DedupeOnly— find duplicates within a single datasetLinkOnly— link records between two datasetsLinkAndDedupe— link and deduplicate combined
Blocking rules define equi-join conditions to reduce the comparison space. Without blocking, every pair of records would be compared (quadratic).
use weldrs::prelude::*;
// Block on a single column
let rule = BlockingRule::on(&["last_name"]);
// Block on multiple columns (AND condition)
let strict_rule = BlockingRule::on(&["first_name", "last_name"]);
// Add an optional description
let rule = BlockingRule::on(&["city"]).with_description("Same city");Training estimates the model parameters in three steps:
// 1. Estimate lambda (prior match probability) from deterministic rules
linker.estimate_probability_two_random_records_match(
&lf,
&[BlockingRule::on(&["first_name", "last_name"])],
1.0, // assumed recall
)?;
// 2. Estimate u-probabilities from random record pairs
linker.estimate_u_using_random_sampling(&lf, 1_000)?;
// 3. EM training passes — each pass fixes comparisons that overlap
// with the blocking rule and trains the rest
linker.estimate_parameters_using_em(&lf, &BlockingRule::on(&["last_name"]))?;
linker.estimate_parameters_using_em(&lf, &BlockingRule::on(&["city"]))?;// Score all candidate pairs (no threshold)
let predictions = linker.predict(&lf, None)?.collect()?;
// Score with a minimum match-weight threshold
let predictions = linker.predict(&lf, Some(0.0))?.collect()?;
// Use the direct scorer (or PredictMode::Auto) for smaller candidate sets
let predictions = linker
.predict_with_mode(&lf, None, PredictMode::Direct)?
.collect()?;The resulting DataFrame includes match_weight, match_probability, and individual bf_* columns for each comparison.
// Group linked records into clusters using a match-probability threshold
let clusters = linker.cluster_pairwise_predictions(&predictions, 0.5)?;
// Returns a DataFrame with columns: [unique_id, cluster_id]Waterfall charts show exactly how each comparison contributed to a pair's score:
// Explain a single pair (by row index in the predictions DataFrame)
let chart = linker.explain_pair(&predictions, 0)?;
for step in &chart.steps {
println!("{}: {} (weight: {:.2})", step.column_name, step.label, step.log2_bayes_factor);
}
// Explain multiple pairs at once
let charts = linker.explain_pairs(&predictions, &[0, 1, 2])?;let summary = linker.model_summary();
println!("Lambda: {:.6}", summary.probability_two_random_records_match);
for comp in &summary.comparisons {
println!("Comparison: {}", comp.output_column_name);
for level in &comp.levels {
println!(" {} — BF: {:?}, weight: {:?}",
level.label, level.bayes_factor, level.log2_bayes_factor);
}
}// Save trained model to JSON
let json = linker.save_settings_json()?;
std::fs::write("model.json", &json)?;
// Load a previously trained model
let json = std::fs::read_to_string("model.json")?;
let restored_linker = Linker::load_settings_json(&json)?;
// Use the restored linker for prediction — no retraining needed
let predictions = restored_linker.predict(&lf, None)?.collect()?;Run any example with cargo run --example <name>:
| Example | Description |
|---|---|
cargo run --example basic_dedup |
Full pipeline tutorial with a 10-row dataset |
cargo run --example fuzzy_matching |
Compare exact-only vs. fuzzy comparisons on 1K rows |
cargo run --example explain_predictions |
Waterfall explanation of top/bottom scoring pairs |
cargo run --example model_parameters |
Inspect trained m/u/BF parameters |
cargo run --example save_and_load |
Serialize and restore a trained model |
cargo run --example scaling --release |
Performance benchmark (default 100K rows) |
MIT