Skip to content

gcol33/joinspy

joinspy

CRAN status CRAN downloads Monthly downloads R-CMD-check Codecov test coverage License: MIT

Find out why your keys don't match.

You ran a left join and lost 40% of your rows. dplyr says "many-to-many relationship." joinspy says 12 keys have trailing spaces, 8 differ only by case, and 3 contain invisible Unicode characters. Then it fixes them.

Quick Start

library(joinspy)

join_spy(orders, customers, by = "customer_id")

repaired <- join_repair(orders, customers, by = "customer_id")

suggest_repairs(join_spy(orders, customers, by = "customer_id"))

The Problem

Most join failures come down to string-level problems in keys:

  • "Alice" vs "Alice " (trailing space, invisible)
  • "NYC" vs "nyc" (case)
  • Zero-width spaces, BOMs, non-breaking spaces that look like regular spaces but aren't
  • "Johansson" vs "Johannson" (one character off)
  • Empty strings matching each other but not NA

R won't warn you about any of these. join_spy() catches them before the join runs.

What joinspy Does

Diagnose

join_spy() examines keys before the join:

orders <- data.frame(
  id = c("A", "B ", "c", "D"),
  amount = c(100, 200, 300, 400),
  stringsAsFactors = FALSE
)

customers <- data.frame(
  id = c("A", "B", "C", "E"),
  name = c("Alice", "Bob", "Carol", "Eve"),
  stringsAsFactors = FALSE
)

join_spy(orders, customers, by = "id")
#> -- Join Diagnostic Report --
#> Match rate (left): 25%
#>
#> Issues Detected:
#>   ! "B " has trailing whitespace (would match "B")
#>   ! "c" vs "C" — case mismatch
#>   x "D" has no match in right table

Repair

join_repair() fixes the issues, or previews what it would change with dry_run = TRUE. suggest_repairs() prints the R code instead of running it.

join_repair(orders, customers, by = "id", dry_run = TRUE)

repaired <- join_repair(orders, customers, by = "id",
                        standardize_case = "upper")

suggest_repairs(join_spy(orders, customers, by = "id"))
#> x$id <- trimws(x$id)
#> x$id <- toupper(x$id)
#> y$id <- toupper(y$id)

Predict

join_spy() also estimates result size for each join type:

report <- join_spy(orders, customers, by = "id")
report$expected_rows
#> inner_join: 1
#> left_join:  4
#> right_join: 4
#> full_join:  7

Explain

join_explain() works after the join, on the result:

result <- merge(orders, customers, by = "id", all.x = TRUE)
join_explain(result, orders, customers, by = "id", type = "left")
#> Result has same row count as left table
#> ! 3 left key(s) have no match in right table

Enforce

join_strict() performs the join and enforces the cardinality you declare. If the data has a 1:n relationship and you said 1:1, it errors before producing a result. The output row count is determined by the constraint, not discovered after the fact.

# Declare 1:1 — errors if any key is duplicated on either side
join_strict(orders, customers, by = "id", expect = "1:1")

# Declare 1:n — left keys must be unique, right may repeat
join_strict(products, line_items, by = "product_id", expect = "1:n")

The four levels are "1:1", "1:n", "n:1", and "n:m". detect_cardinality() reports the actual relationship if you need to check first.

Also Includes

Join wrappers (left_join_spy(), inner_join_spy(), etc.) run diagnostics before joining and attach the report as an attribute. check_cartesian() flags many-to-many keys that would multiply your row count. analyze_join_chain() handles multi-step A-B-C sequences. All joins work with tibbles, data.tables, and plain data frames.

Installation

# Install from CRAN
install.packages("joinspy")

# Or install development version from GitHub
# install.packages("pak")
pak::pak("gcol33/joinspy")

Documentation

Related Work

Package Focus
dplyr 1.1+ Cardinality checks via relationship argument
powerjoin 12-level configurable checks, key preprocessing
joyn Match-status reporting variable per row
tidylog Logs row count changes after joins

joinspy focuses on string-level key diagnostics: whitespace, case, encoding, typos, and type mismatches. It identifies which specific keys failed, why, and can fix them automatically.

Support

"Software is like sex: it's better when it's free." — Linus Torvalds

I'm a PhD student who builds R packages in my free time because I believe good tools should be free and open. I started these projects for my own work and figured others might find them useful too.

If this package saved you some time, buying me a coffee is a nice way to say thanks.

Buy Me A Coffee

License

MIT (see the LICENSE.md file)

Citation

@software{joinspy,
  author = {Colling, Gilles},
  title = {joinspy: Diagnostic Tools for Data Frame Joins},
  year = {2025},
  url = {https://github.com/gcol33/joinspy}
}

About

R package for diagnosing data frame joins: duplicates, mismatches, and row count prediction before joining

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors