perf(interaction): fix write straggler + cache spill; integer interactionId by d0choa · Pull Request #147 · opentargets/pts

d0choa · 2026-06-30T15:35:04Z

Summary

Builds on #145. Addresses the compute bottlenecks and storage bloat found by benchmarking the interaction step on the real pts Dataproc topology (1 master + 2× n1-standard-8), and aligns the interactionId key with the aggregation. No change to row-level results — interaction 14,617,906 rows, evidence 27,407,354 rows, full FK integrity, all runs.

Changes

Write straggler fixed — the output writes used coalesce(2/6), which doesn't rebalance, so STRING's 13.2 M rows landed in one partition → a single ~900 s write task. Switched to repartition(...).
Cache spill fixed — string-interactions.txt.gz is gzip (non-splittable → 1 partition), so its transform+explode ran in one task that spilled ~12 GB. Now repartitioned at read, before the explode.
interactionId is now an 8-byte integer (xxhash64) instead of a 64-char SHA-256 — content-derived (no join), and it cut total output storage ~3× (see ladder). Zero collisions on all 14.6 M unique interactions (distinct == count).
Key alignment — one shared _INTERACTION_KEY_COLS (now incl. speciesA/speciesB) drives both the aggregation groupBy and the id hash, so they can't drift (closes a latent duplicate-id risk).
Exception-safe unpersist (try/finally).

Performance (vs `main`, dev topology)

Metric	main	this branch	Δ
Wall-clock	1,657 s	971 s	−41%
Summed stage time	2,285 s	1,010 s	−56%
Spill	0	0	no regression

Storage (interaction + evidence)

	total	vs original release
original 26.06	353 MB	1.0×
#145 (hex id)	2,223 MB	6.3×
this branch (xxhash64 id)	648 MB	1.84×

The residual 1.84× is the inherent cost of carrying an id in both tables; removing it entirely would mean not normalising the evidence schema (out of scope here).

Tried and reverted (kept in history for context)

Evidence repartitionByRange for compression — won −36% evidence size but its global sort added ~+500 s. Compute-for-storage; reverted.
Bumping partition_count.interactions_evidence 6→16 — no speed gain and reintroduced spill, because AQE coalescePartitions merges the extra partitions back toward advisoryPartitionSizeInBytes. Reverted; the real lever is advisoryPartitionSizeInBytes, not the repartition count.

Test plan

pytest test/test_interaction.py — 13/13 (added: interactionId is LongType; two rows differing only by species get different ids)
Step runs end-to-end on 26.06 release data (dev topology)
Row counts + FK integrity identical to reference; interactionId unique, 0 collisions

🤖 Generated with Claude Code

…ith aggregation

…safe unpersist

…onId

…vidence by interactionId

…ge tradeoff); keep STRING read repartition

…titions to 16

…no speed gain, adds spill)

d0choa added 8 commits June 30, 2026 11:04

perf(interaction): binary interactionId + single-source key aligned w…

38af0e1

…ith aggregation

perf(interaction): repartition writes to remove single-task straggler

ab38f92

perf(interaction): repartition cached frames before cache; exception-…

78e71e0

…safe unpersist

docs(interaction): update module docstring for 9-col binary interacti…

3926bc6

…onId

perf(interaction): parallelise STRING gzip at read; range-partition e…

3fbb837

…vidence by interactionId

perf(interaction): revert evidence range-partition (compute-for-stora…

607308f

…ge tradeoff); keep STRING read repartition

perf(interaction): integer xxhash64 interactionId + bump evidence par…

0919505

…titions to 16

perf(interaction): revert evidence partition bump (AQE coalesces it; …

a396f42

…no speed gain, adds spill)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(interaction): fix write straggler + cache spill; integer interactionId#147

perf(interaction): fix write straggler + cache spill; integer interactionId#147
d0choa wants to merge 8 commits into
refactor/interaction-optimisationfrom
perf/interaction-bottlenecks

d0choa commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

d0choa commented Jun 30, 2026

Summary

Changes

Performance (vs main, dev topology)

Storage (interaction + evidence)

Tried and reverted (kept in history for context)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Performance (vs `main`, dev topology)