Skip to content

perf(interaction): fix write straggler + cache spill; integer interactionId#147

Open
d0choa wants to merge 8 commits into
refactor/interaction-optimisationfrom
perf/interaction-bottlenecks
Open

perf(interaction): fix write straggler + cache spill; integer interactionId#147
d0choa wants to merge 8 commits into
refactor/interaction-optimisationfrom
perf/interaction-bottlenecks

Conversation

@d0choa

@d0choa d0choa commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Builds on #145. Addresses the compute bottlenecks and storage bloat found by benchmarking the interaction step on the real pts Dataproc topology (1 master + 2× n1-standard-8), and aligns the interactionId key with the aggregation. No change to row-level results — interaction 14,617,906 rows, evidence 27,407,354 rows, full FK integrity, all runs.

Changes

  1. Write straggler fixed — the output writes used coalesce(2/6), which doesn't rebalance, so STRING's 13.2 M rows landed in one partition → a single ~900 s write task. Switched to repartition(...).
  2. Cache spill fixedstring-interactions.txt.gz is gzip (non-splittable → 1 partition), so its transform+explode ran in one task that spilled ~12 GB. Now repartitioned at read, before the explode.
  3. interactionId is now an 8-byte integer (xxhash64) instead of a 64-char SHA-256 — content-derived (no join), and it cut total output storage ~3× (see ladder). Zero collisions on all 14.6 M unique interactions (distinct == count).
  4. Key alignment — one shared _INTERACTION_KEY_COLS (now incl. speciesA/speciesB) drives both the aggregation groupBy and the id hash, so they can't drift (closes a latent duplicate-id risk).
  5. Exception-safe unpersist (try/finally).

Performance (vs main, dev topology)

Metric main this branch Δ
Wall-clock 1,657 s 971 s −41%
Summed stage time 2,285 s 1,010 s −56%
Spill 0 0 no regression

Storage (interaction + evidence)

total vs original release
original 26.06 353 MB 1.0×
#145 (hex id) 2,223 MB 6.3×
this branch (xxhash64 id) 648 MB 1.84×

The residual 1.84× is the inherent cost of carrying an id in both tables; removing it entirely would mean not normalising the evidence schema (out of scope here).

Tried and reverted (kept in history for context)

  • Evidence repartitionByRange for compression — won −36% evidence size but its global sort added ~+500 s. Compute-for-storage; reverted.
  • Bumping partition_count.interactions_evidence 6→16 — no speed gain and reintroduced spill, because AQE coalescePartitions merges the extra partitions back toward advisoryPartitionSizeInBytes. Reverted; the real lever is advisoryPartitionSizeInBytes, not the repartition count.

Test plan

  • pytest test/test_interaction.py — 13/13 (added: interactionId is LongType; two rows differing only by species get different ids)
  • Step runs end-to-end on 26.06 release data (dev topology)
  • Row counts + FK integrity identical to reference; interactionId unique, 0 collisions

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant