Skip to content

refactor(interaction): normalise schema + performance optimisations#145

Open
DSuveges wants to merge 2 commits into
mainfrom
refactor/interaction-optimisation
Open

refactor(interaction): normalise schema + performance optimisations#145
DSuveges wants to merge 2 commits into
mainfrom
refactor/interaction-optimisation

Conversation

@DSuveges

Copy link
Copy Markdown
Contributor

Summary

1. Schema normalisation — interactionId synthetic key

The seven columns that uniquely identify an interaction pair (sourceDatabase, targetA, intA, intABiologicalRole, targetB, intB, intBBiologicalRole) were previously duplicated across both output datasets. The evidence table now carries only a synthetic interactionId (SHA-256 hash of those columns) as a foreign key into the interactions table.

interactions — unchanged dimension columns + new interactionId primary key
interactions_evidenceinteractionId replaces the 10 dropped identity columns (targetA/B, intA/B, intASource/BSource, intABiologicalRole/intBBiologicalRole, speciesA/B)

Implementation:

  • _INTERACTION_KEY_COLS / _DIMENSION_COLS constants define the key and the dropped set
  • _interaction_id_expr() returns a sha2(concat_ws('\x01', coalesce(col, '')), 256) Column expression; null-safe, collision-resistant
  • sourceDatabase retained as a flat column in _generate_interactions output (previously dropped) so downstream steps can hash it without struct access
  • _select_fields renamed → _select_interaction_fields; new _select_evidence_fields() computes interactionId and drops _DIMENSION_COLS
  • _generate_interactions_agg appends interactionId post-aggregation
  • unionByName(allowMissingColumns=True) replaces the manual column-alignment loop

2. Performance optimisations

Three independent bottlenecks removed from the Spark execution plan:

  • Python UDF eliminated_GET_CODE (crossed JVM/Python boundary per row, disabled join optimisations) replaced by _get_code() using native f.split / f.trim
  • Broadcast hash joinsf.broadcast(mapping_info) hint added inside _generate_interactions; both gene-mapping left joins now avoid a full shuffle on the large interaction table
  • Intermediate cachingintact_interactions_df / string_interactions_df cached after _generate_interactions so the expensive join+explode pipeline is not re-executed for each of the three write actions; intact_fields / string_fields cached so _select_interaction_fields output is shared between the aggregation and evidence paths; all caches unpersisted after writes

Validation — dataset comparison

Run against the 26.06 release data locally (spark.driver.memory: 16g).

Dataset Reference rows New rows Delta
interactions 14,617,906 14,617,906 0
interactions_evidence 27,407,354 27,407,354 0

Schema changes (as expected):

interactions — 1 column added:

  • + interactionId (String, SHA-256 hash)

interactions_evidence — 1 column added, 10 removed:

  • + interactionId (String, FK into interactions)
  • - targetA, - targetB, - intA, - intB, - intABiologicalRole, - intBBiologicalRole, - intASource, - intBSource, - speciesA, - speciesB

Data integrity checks (all passed):

  • interactionId null count: 0 in both datasets
  • IDs in evidence not present in interactions: 0 (full referential integrity)
  • Unique IDs in interactions equals row count: 14,617,906 (every row is unique)
  • Core interaction columns (sourceDatabase, targetA/B, intA/B, biological roles, count, scoring) compared sorted vs reference: identical
  • 16 shared evidence columns compared sorted vs reference: identical
  • interactionId hash verified against independent Python SHA-256 implementation on 5 sampled rows: all correct
  • Source database distribution unchanged: string 13,211,431 · intact 1,310,645 · reactome 56,103 · signor 39,727

Test plan

  • Unit tests pass (pytest test/test_interaction.py — 11/11)
  • Step runs end-to-end on 26.06 release data
  • Row counts identical to reference
  • All shared columns byte-identical to reference
  • interactionId referential integrity verified between both output datasets

🤖 Generated with Claude Code

DSuveges and others added 2 commits June 29, 2026 16:08
Add a SHA-256 interactionId column (hash of the seven uniqueness key
columns) to both output datasets.  The interactions_evidence dataset now
carries only interactionId as a foreign key into the interactions dataset,
removing the duplicate sourceDatabase/targetA/intA/intABiologicalRole/
targetB/intB/intBBiologicalRole columns that were previously repeated in
both outputs.

Changes:
- Add _INTERACTION_KEY_COLS and _DIMENSION_COLS constants
- Add _interaction_id_expr() returning the SHA-256 hash Column expression
- Retain sourceDatabase as a flat column in _generate_interactions output
  (previously dropped) so downstream steps can hash it without struct access
- Rename _select_fields → _select_interaction_fields; add sourceDatabase
- Add _select_evidence_fields() which computes interactionId and drops
  _DIMENSION_COLS from the evidence projection
- _generate_interactions_agg now appends interactionId post-aggregation
- interaction() uses _select_evidence_fields for evidence output and
  unionByName(allowMissingColumns=True) in place of the manual column loop

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three independent bottlenecks removed from the Spark execution plan:

1. Python UDF (_GET_CODE) replaced with a native Spark SQL function
   (_get_code using f.split/f.trim).  The UDF crossed the JVM/Python
   boundary on every row, disabled predicate pushdown, and prevented join
   strategy optimisation.

2. Broadcast hint added for the mapping lookup table in
   _generate_interactions.  Both gene-mapping joins now use broadcast hash
   join instead of sort-merge join, eliminating the full shuffle on the
   large interaction table for each join.

3. Intermediate DataFrames cached at two levels:
   - intact_interactions_df / string_interactions_df: the output of
     _generate_interactions (expensive join + explode) is shared across
     three write actions (aggregated, evidence, unmatched) instead of
     being recomputed per action.
   - intact_fields / string_fields: the result of _select_interaction_fields
     is shared between _generate_interactions_agg and _select_evidence_fields
     instead of re-scanning the join output twice.
   All caches are unpersisted after the writes complete.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant