refactor(interaction): normalise schema + performance optimisations#145
Open
DSuveges wants to merge 2 commits into
Open
refactor(interaction): normalise schema + performance optimisations#145DSuveges wants to merge 2 commits into
DSuveges wants to merge 2 commits into
Conversation
Add a SHA-256 interactionId column (hash of the seven uniqueness key columns) to both output datasets. The interactions_evidence dataset now carries only interactionId as a foreign key into the interactions dataset, removing the duplicate sourceDatabase/targetA/intA/intABiologicalRole/ targetB/intB/intBBiologicalRole columns that were previously repeated in both outputs. Changes: - Add _INTERACTION_KEY_COLS and _DIMENSION_COLS constants - Add _interaction_id_expr() returning the SHA-256 hash Column expression - Retain sourceDatabase as a flat column in _generate_interactions output (previously dropped) so downstream steps can hash it without struct access - Rename _select_fields → _select_interaction_fields; add sourceDatabase - Add _select_evidence_fields() which computes interactionId and drops _DIMENSION_COLS from the evidence projection - _generate_interactions_agg now appends interactionId post-aggregation - interaction() uses _select_evidence_fields for evidence output and unionByName(allowMissingColumns=True) in place of the manual column loop Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three independent bottlenecks removed from the Spark execution plan:
1. Python UDF (_GET_CODE) replaced with a native Spark SQL function
(_get_code using f.split/f.trim). The UDF crossed the JVM/Python
boundary on every row, disabled predicate pushdown, and prevented join
strategy optimisation.
2. Broadcast hint added for the mapping lookup table in
_generate_interactions. Both gene-mapping joins now use broadcast hash
join instead of sort-merge join, eliminating the full shuffle on the
large interaction table for each join.
3. Intermediate DataFrames cached at two levels:
- intact_interactions_df / string_interactions_df: the output of
_generate_interactions (expensive join + explode) is shared across
three write actions (aggregated, evidence, unmatched) instead of
being recomputed per action.
- intact_fields / string_fields: the result of _select_interaction_fields
is shared between _generate_interactions_agg and _select_evidence_fields
instead of re-scanning the join output twice.
All caches are unpersisted after the writes complete.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
1. Schema normalisation —
interactionIdsynthetic keyThe seven columns that uniquely identify an interaction pair (
sourceDatabase,targetA,intA,intABiologicalRole,targetB,intB,intBBiologicalRole) were previously duplicated across both output datasets. The evidence table now carries only a syntheticinteractionId(SHA-256 hash of those columns) as a foreign key into the interactions table.interactions— unchanged dimension columns + newinteractionIdprimary keyinteractions_evidence—interactionIdreplaces the 10 dropped identity columns (targetA/B,intA/B,intASource/BSource,intABiologicalRole/intBBiologicalRole,speciesA/B)Implementation:
_INTERACTION_KEY_COLS/_DIMENSION_COLSconstants define the key and the dropped set_interaction_id_expr()returns asha2(concat_ws('\x01', coalesce(col, '')), 256)Column expression; null-safe, collision-resistantsourceDatabaseretained as a flat column in_generate_interactionsoutput (previously dropped) so downstream steps can hash it without struct access_select_fieldsrenamed →_select_interaction_fields; new_select_evidence_fields()computesinteractionIdand drops_DIMENSION_COLS_generate_interactions_aggappendsinteractionIdpost-aggregationunionByName(allowMissingColumns=True)replaces the manual column-alignment loop2. Performance optimisations
Three independent bottlenecks removed from the Spark execution plan:
_GET_CODE(crossed JVM/Python boundary per row, disabled join optimisations) replaced by_get_code()using nativef.split/f.trimf.broadcast(mapping_info)hint added inside_generate_interactions; both gene-mapping left joins now avoid a full shuffle on the large interaction tableintact_interactions_df/string_interactions_dfcached after_generate_interactionsso the expensive join+explode pipeline is not re-executed for each of the three write actions;intact_fields/string_fieldscached so_select_interaction_fieldsoutput is shared between the aggregation and evidence paths; all caches unpersisted after writesValidation — dataset comparison
Run against the 26.06 release data locally (
spark.driver.memory: 16g).interactionsinteractions_evidenceSchema changes (as expected):
interactions— 1 column added:+ interactionId(String, SHA-256 hash)interactions_evidence— 1 column added, 10 removed:+ interactionId(String, FK into interactions)- targetA,- targetB,- intA,- intB,- intABiologicalRole,- intBBiologicalRole,- intASource,- intBSource,- speciesA,- speciesBData integrity checks (all passed):
interactionIdnull count: 0 in both datasetssourceDatabase,targetA/B,intA/B, biological roles,count,scoring) compared sorted vs reference: identicalinteractionIdhash verified against independent Python SHA-256 implementation on 5 sampled rows: all correctstring13,211,431 ·intact1,310,645 ·reactome56,103 ·signor39,727Test plan
pytest test/test_interaction.py— 11/11)interactionIdreferential integrity verified between both output datasets🤖 Generated with Claude Code