refactor(interaction): normalise schema + performance optimisations by DSuveges · Pull Request #145 · opentargets/pts

DSuveges · 2026-06-29T15:57:18Z

Summary

1. Schema normalisation — `interactionId` synthetic key

The seven columns that uniquely identify an interaction pair (sourceDatabase, targetA, intA, intABiologicalRole, targetB, intB, intBBiologicalRole) were previously duplicated across both output datasets. The evidence table now carries only a synthetic interactionId (SHA-256 hash of those columns) as a foreign key into the interactions table.

interactions — unchanged dimension columns + new interactionId primary key
interactions_evidence — interactionId replaces the 10 dropped identity columns (targetA/B, intA/B, intASource/BSource, intABiologicalRole/intBBiologicalRole, speciesA/B)

Implementation:

_INTERACTION_KEY_COLS / _DIMENSION_COLS constants define the key and the dropped set
_interaction_id_expr() returns a sha2(concat_ws('\x01', coalesce(col, '')), 256) Column expression; null-safe, collision-resistant
sourceDatabase retained as a flat column in _generate_interactions output (previously dropped) so downstream steps can hash it without struct access
_select_fields renamed → _select_interaction_fields; new _select_evidence_fields() computes interactionId and drops _DIMENSION_COLS
_generate_interactions_agg appends interactionId post-aggregation
unionByName(allowMissingColumns=True) replaces the manual column-alignment loop

2. Performance optimisations

Three independent bottlenecks removed from the Spark execution plan:

Python UDF eliminated — _GET_CODE (crossed JVM/Python boundary per row, disabled join optimisations) replaced by _get_code() using native f.split / f.trim
Broadcast hash joins — f.broadcast(mapping_info) hint added inside _generate_interactions; both gene-mapping left joins now avoid a full shuffle on the large interaction table
Intermediate caching — intact_interactions_df / string_interactions_df cached after _generate_interactions so the expensive join+explode pipeline is not re-executed for each of the three write actions; intact_fields / string_fields cached so _select_interaction_fields output is shared between the aggregation and evidence paths; all caches unpersisted after writes

Validation — dataset comparison

Run against the 26.06 release data locally (spark.driver.memory: 16g).

Dataset	Reference rows	New rows	Delta
`interactions`	14,617,906	14,617,906	0
`interactions_evidence`	27,407,354	27,407,354	0

Schema changes (as expected):

interactions — 1 column added:

+ interactionId (String, SHA-256 hash)

interactions_evidence — 1 column added, 10 removed:

+ interactionId (String, FK into interactions)
- targetA, - targetB, - intA, - intB, - intABiologicalRole, - intBBiologicalRole, - intASource, - intBSource, - speciesA, - speciesB

Data integrity checks (all passed):

interactionId null count: 0 in both datasets
IDs in evidence not present in interactions: 0 (full referential integrity)
Unique IDs in interactions equals row count: 14,617,906 (every row is unique)
Core interaction columns (sourceDatabase, targetA/B, intA/B, biological roles, count, scoring) compared sorted vs reference: identical
16 shared evidence columns compared sorted vs reference: identical
interactionId hash verified against independent Python SHA-256 implementation on 5 sampled rows: all correct
Source database distribution unchanged: string 13,211,431 · intact 1,310,645 · reactome 56,103 · signor 39,727

Test plan

Unit tests pass (pytest test/test_interaction.py — 11/11)
Step runs end-to-end on 26.06 release data
Row counts identical to reference
All shared columns byte-identical to reference
interactionId referential integrity verified between both output datasets

🤖 Generated with Claude Code

Add a SHA-256 interactionId column (hash of the seven uniqueness key columns) to both output datasets. The interactions_evidence dataset now carries only interactionId as a foreign key into the interactions dataset, removing the duplicate sourceDatabase/targetA/intA/intABiologicalRole/ targetB/intB/intBBiologicalRole columns that were previously repeated in both outputs. Changes: - Add _INTERACTION_KEY_COLS and _DIMENSION_COLS constants - Add _interaction_id_expr() returning the SHA-256 hash Column expression - Retain sourceDatabase as a flat column in _generate_interactions output (previously dropped) so downstream steps can hash it without struct access - Rename _select_fields → _select_interaction_fields; add sourceDatabase - Add _select_evidence_fields() which computes interactionId and drops _DIMENSION_COLS from the evidence projection - _generate_interactions_agg now appends interactionId post-aggregation - interaction() uses _select_evidence_fields for evidence output and unionByName(allowMissingColumns=True) in place of the manual column loop Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Three independent bottlenecks removed from the Spark execution plan: 1. Python UDF (_GET_CODE) replaced with a native Spark SQL function (_get_code using f.split/f.trim). The UDF crossed the JVM/Python boundary on every row, disabled predicate pushdown, and prevented join strategy optimisation. 2. Broadcast hint added for the mapping lookup table in _generate_interactions. Both gene-mapping joins now use broadcast hash join instead of sort-merge join, eliminating the full shuffle on the large interaction table for each join. 3. Intermediate DataFrames cached at two levels: - intact_interactions_df / string_interactions_df: the output of _generate_interactions (expensive join + explode) is shared across three write actions (aggregated, evidence, unmatched) instead of being recomputed per action. - intact_fields / string_fields: the result of _select_interaction_fields is shared between _generate_interactions_agg and _select_evidence_fields instead of re-scanning the join output twice. All caches are unpersisted after the writes complete. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

DSuveges and others added 2 commits June 29, 2026 16:08

DSuveges requested a review from d0choa June 29, 2026 15:57

d0choa mentioned this pull request Jun 30, 2026

perf(interaction): fix write straggler + cache spill; integer interactionId #147

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(interaction): normalise schema + performance optimisations#145

refactor(interaction): normalise schema + performance optimisations#145
DSuveges wants to merge 2 commits into
mainfrom
refactor/interaction-optimisation

DSuveges commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

DSuveges commented Jun 29, 2026

Summary

1. Schema normalisation — interactionId synthetic key

2. Performance optimisations

Validation — dataset comparison

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Schema normalisation — `interactionId` synthetic key