Skip to content

perf(write): eliminate dedup full-table scan + OCC conflict storm#78

Merged
tonyalaribe merged 1 commit into
masterfrom
perf/dedup-scan-and-conflicts
Jun 23, 2026
Merged

perf(write): eliminate dedup full-table scan + OCC conflict storm#78
tonyalaribe merged 1 commit into
masterfrom
perf/dedup-scan-and-conflicts

Conversation

@tonyalaribe

Copy link
Copy Markdown
Contributor

Follow-up to #77 (Tier C). Measured in prod after #77 deployed (e8bbb64): the drain didn't improve — the buffer climbed to 100% pressure and inserts backpressured. Investigation showed Tier C optimized the post-commit hook, but the dedup replace_where sweep's cost is upstream of it, in two compounding places. This PR fixes both.

Lever 2 — OCC conflict storm (database.rs)

The dedup retry loop took delta_commit_lock but, unlike the flush staged-commit path, never refreshed the snapshot under the lock — so it committed a base_version behind latest and delta-rs's OCC checker rebased on every attempt (conflicts_checked=2, table updated during transaction in prod), re-running the full file-matching scan ~7× per commit. Fix: refresh_table_snapshot under the lock before cloning, exactly the pattern the flush path already uses → base_version == latest → commit lands first try.

Lever 1 — full-table planning scan (delta-rs fork → rev 1de46a01)

The dedup replace_where predicate compares Date32 date / Timestamp timestamp columns to Utf8 string literals (the bare-string, commit-serializable form). parse_predicate_expression doesn't coerce; the row filter gets coerced by physical planning, but the file-skipping predicate was converted to a kernel predicate uncoerced — so the kernel couldn't evaluate Date32 <op> Utf8View, the scan fell open, and it read all ~26k files (predicate_filtered=0) on every dedup commit.

Fork fix (find_files.rs coerce_skipping_terms): coerce + const-fold the skipping conjuncts' literals to the column types (TypeCoercionRewriter + ExprSimplifier) before converting to the kernel predicate, so it prunes to the target project_id/date partition. Terms that can't be coerced pass through unchanged.

Combined effect

Each dedup commit drops from ~7 full-table (26k-file) scans to one partition-scoped scan, collapsing the delta_commit_lock hold that was starving flush drain.

Tests

  • Fork: find_files_partition_prunes_with_mixed_string_literal_predicate — errored (Date32 <= Utf8View) before the fix, now asserts the file-matching scan reads only the target partition's data files. 15 find_files + 89 write tests green.
  • TF: existing refresh_incremental_matches_full_across_removes passes against the new fork rev.

Prod validation after deploy

conflicts_checked → 0, predicate_filtered > 0, planning-scan add_files_seen drops 26k → one partition, buffer drains.

Known follow-up

The partition-only path (scan_memory_table, used by DELETE/UPDATE with a Date32-partition string predicate) isn't coerced yet — latent, not the dedup path.

🤖 Generated with Claude Code

Two compounding causes of the dedup sweep throttling flush drain (each dedup
replace_where held delta_commit_lock for a ~26k-file scan, ~7x per commit):

Lever 2 — conflict storm: the dedup retry loop took the commit lock but did NOT
refresh the snapshot under it, so it committed a base_version behind latest and
the OCC checker rebased every attempt (conflicts_checked>0), re-running the full
file-matching scan ~7x. Now refresh under the lock before cloning — the exact
pattern the flush staged-commit path already uses — so base_version == latest
and the commit lands first try.

Lever 1 — full-table scan: bump the delta-rs fork to 1de46a0, which coerces the
file-skipping predicate's literals to the column types. The dedup predicate
compares Date32/Timestamp columns to Utf8 string literals; uncoerced, the kernel
couldn't evaluate skipping and fell open, reading all ~26k files
(predicate_filtered=0) per commit. Coerced, it prunes to the target
project_id/date partition.

Together: each dedup commit goes from ~7 full-table scans to one partition-scoped
scan, collapsing the commit-lock hold that was starving flush drain.
@claude

claude Bot commented Jun 23, 2026

Copy link
Copy Markdown

Code Review

This PR fixes two distinct compounding causes of the dedup buffer backpressure issue: (1) a fork change to delta-rs that coerces Date32/Timestamp column types before converting file-skipping predicates, preventing the kernel from falling back to a full 26k-file scan; and (2) a one-line fix in database.rs that refreshes the table snapshot under delta_commit_lock before cloning, so the dedup replace_where always commits at base_version == latest and skips the OCC rebase storm. The fork rev bump and the database.rs fix each address a different layer of the same hotpath.


Finding 1 — commit_guard held across swap_and_refresh_cache in the success path

src/database.rs line 3759

The staged-commit flush path (the reference pattern) explicitly drops commit_guard before any post-commit work:

// flush staged-commit, line 2960–2962
new_table.state = Some(finalized.snapshot());
drop(commit_guard);      // ← lock released before record_committed_write
return Ok(self.record_committed_write(...).await);

The dedup success path never explicitly drops the guard; commit_guard is held for the entire swap_and_refresh_cache call and all subsequent logging before the implicit drop at break:

// dedup success, line 3754–3767
Ok(new_table) => {
    self.swap_and_refresh_cache(table_ref, new_table, &pre_uris).await;  // ← under lock
    crate::metrics::record_compaction_dedup_dropped(dropped);            // ← under lock
    info!("dedup compaction: ...");                                       // ← under lock
    committed = true;
    break;                   // commit_guard dropped here
}

swap_and_refresh_cache acquires a table_ref.write() lock and warms/evicts cache entries — the comment at line 3755–3758 documents a ~1.5 s first-read latency on OVH and the 26k-file table has been measured at 2–8 s for related operations. Every concurrent flush append that contends on delta_commit_lock is serialised behind this window on every successful dedup commit.

Suggested fix: insert drop(commit_guard); before line 3759, matching the staged-commit reference pattern.


Finding 2 — Dedup OCC classifier misses "already exists" and "Metadata changed"

src/database.rs line 3779

The is_conflict predicate in the dedup retry path:

let is_conflict = msg.contains("concurrent transaction")
    || msg.contains("Commit failed")
    || msg.contains("Transaction failed");

The canonical is_conflict_err closure in the flush path (line 2863) also matches:

msg.contains("already exists") || ... || msg.contains("Metadata changed")

The comments at lines 2855–2864 explain why those two substrings are necessary: "already exists" covers the VersionAlreadyExists delta-rs message ("Delta transaction failed, version N already exists.") and "Metadata changed" covers schema-evolution conflicts — both are retryable OCC errors that a single retry with a refreshed snapshot resolves.

Failure scenario: a VersionAlreadyExists error in the dedup path (which can occur when two process replicas compete) will fall through to return Err(...) on the first attempt instead of retrying. This leaves duplicates in place and forces the next scheduler tick to re-sweep the partition.

The fix (and a shared is_conflict_err free function the PR description already mentions as a follow-up) is straightforward: add the two missing substrings to the dedup classifier.


Finding 3 — No targeted test for the OCC fix in database.rs

The PR's test coverage note mentions refresh_incremental_matches_full_across_removes passing — that test validates snapshot refresh correctness, not the specific fix. There is no test that:

  1. starts a concurrent commit on the same partition while a dedup replace_where is in flight,
  2. verifies the conflict is detected,
  3. and asserts the retry (with the refreshed snapshot) succeeds without exhausting MAX_RETRIES.

Given the regression risk of the lock-hold interactions described above, a narrowly scoped integration test covering the retry path would give this fix durable coverage. The existing dedup_today_partitions test scaffolding is already in place and would support this.


Cargo.lock dependency changes — no action needed

The fork rev bump transitively downgrades prost-build/prost-types from itertools 0.13.0 to 0.12.1 and winapi-util's windows-sys from 0.61.2 to 0.48.0. Neither is a concern: prost-build is build-time codegen only, windows-sys is not compiled on Linux targets, and no RustSec advisories exist for any version involved. Cargo carries multiple windows-sys versions concurrently without conflict.


The fork-side fix (coerce_skipping_terms) and the base_version refresh are both the right fix at the right depth. Finding 1 is the most actionable item before merging — it re-introduces a lock-contention window that the PR set out to eliminate.

@tonyalaribe tonyalaribe merged commit b35b679 into master Jun 23, 2026
10 checks passed
tonyalaribe added a commit that referenced this pull request Jun 23, 2026
- Release delta_commit_lock BEFORE the dedup success path's post-commit work
  (swap_and_refresh_cache: get_file_uris over 26k paths + persist_snapshot +
  table write-lock), matching the flush staged-commit pattern, so concurrent
  flush appends no longer serialize behind it.
- Make swap_and_refresh_cache's handle swap version-guarded (only swap when
  newer), so it's safe to call without the commit lock — a concurrent committer
  may already have advanced table_ref past our version and a bare swap would
  regress it. Same guard as refresh_table_snapshot; also hardens the optimize
  callers.
- Extract a shared is_occ_conflict_err() classifier and use it in the flush,
  dedup, and light-optimize retry loops. The dedup/optimize loops previously
  omitted "already exists" (VersionAlreadyExists) and "Metadata changed", so a
  multi-replica version race would fail-fast instead of retrying and leave the
  duplicate for the next sweep. Unit test pins retryable-vs-permanent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant