perf(write): eliminate dedup full-table scan + OCC conflict storm#78
Conversation
Two compounding causes of the dedup sweep throttling flush drain (each dedup replace_where held delta_commit_lock for a ~26k-file scan, ~7x per commit): Lever 2 — conflict storm: the dedup retry loop took the commit lock but did NOT refresh the snapshot under it, so it committed a base_version behind latest and the OCC checker rebased every attempt (conflicts_checked>0), re-running the full file-matching scan ~7x. Now refresh under the lock before cloning — the exact pattern the flush staged-commit path already uses — so base_version == latest and the commit lands first try. Lever 1 — full-table scan: bump the delta-rs fork to 1de46a0, which coerces the file-skipping predicate's literals to the column types. The dedup predicate compares Date32/Timestamp columns to Utf8 string literals; uncoerced, the kernel couldn't evaluate skipping and fell open, reading all ~26k files (predicate_filtered=0) per commit. Coerced, it prunes to the target project_id/date partition. Together: each dedup commit goes from ~7 full-table scans to one partition-scoped scan, collapsing the commit-lock hold that was starving flush drain.
Code ReviewThis PR fixes two distinct compounding causes of the dedup buffer backpressure issue: (1) a fork change to Finding 1 —
|
- Release delta_commit_lock BEFORE the dedup success path's post-commit work (swap_and_refresh_cache: get_file_uris over 26k paths + persist_snapshot + table write-lock), matching the flush staged-commit pattern, so concurrent flush appends no longer serialize behind it. - Make swap_and_refresh_cache's handle swap version-guarded (only swap when newer), so it's safe to call without the commit lock — a concurrent committer may already have advanced table_ref past our version and a bare swap would regress it. Same guard as refresh_table_snapshot; also hardens the optimize callers. - Extract a shared is_occ_conflict_err() classifier and use it in the flush, dedup, and light-optimize retry loops. The dedup/optimize loops previously omitted "already exists" (VersionAlreadyExists) and "Metadata changed", so a multi-replica version race would fail-fast instead of retrying and leave the duplicate for the next sweep. Unit test pins retryable-vs-permanent.
Follow-up to #77 (Tier C). Measured in prod after #77 deployed (
e8bbb64): the drain didn't improve — the buffer climbed to 100% pressure and inserts backpressured. Investigation showed Tier C optimized the post-commit hook, but the dedupreplace_wheresweep's cost is upstream of it, in two compounding places. This PR fixes both.Lever 2 — OCC conflict storm (
database.rs)The dedup retry loop took
delta_commit_lockbut, unlike the flush staged-commit path, never refreshed the snapshot under the lock — so it committed abase_versionbehind latest and delta-rs's OCC checker rebased on every attempt (conflicts_checked=2,table updated during transactionin prod), re-running the full file-matching scan ~7× per commit. Fix:refresh_table_snapshotunder the lock before cloning, exactly the pattern the flush path already uses →base_version == latest→ commit lands first try.Lever 1 — full-table planning scan (delta-rs fork → rev
1de46a01)The dedup
replace_wherepredicate comparesDate32 date/Timestamp timestampcolumns to Utf8 string literals (the bare-string, commit-serializable form).parse_predicate_expressiondoesn't coerce; the row filter gets coerced by physical planning, but the file-skipping predicate was converted to a kernel predicate uncoerced — so the kernel couldn't evaluateDate32 <op> Utf8View, the scan fell open, and it read all ~26k files (predicate_filtered=0) on every dedup commit.Fork fix (
find_files.rscoerce_skipping_terms): coerce + const-fold the skipping conjuncts' literals to the column types (TypeCoercionRewriter+ExprSimplifier) before converting to the kernel predicate, so it prunes to the targetproject_id/datepartition. Terms that can't be coerced pass through unchanged.Combined effect
Each dedup commit drops from ~7 full-table (26k-file) scans to one partition-scoped scan, collapsing the
delta_commit_lockhold that was starving flush drain.Tests
find_files_partition_prunes_with_mixed_string_literal_predicate— errored (Date32 <= Utf8View) before the fix, now asserts the file-matching scan reads only the target partition's data files. 15 find_files + 89 write tests green.refresh_incremental_matches_full_across_removespasses against the new fork rev.Prod validation after deploy
conflicts_checked → 0,predicate_filtered > 0, planning-scanadd_files_seendrops 26k → one partition, buffer drains.Known follow-up
The partition-only path (
scan_memory_table, used by DELETE/UPDATE with aDate32-partition string predicate) isn't coerced yet — latent, not the dedup path.🤖 Generated with Claude Code