fix(session): provide spark-nlp package in-app on Dataproc (OnToma guard)#146
Draft
d0choa wants to merge 1 commit into
Draft
fix(session): provide spark-nlp package in-app on Dataproc (OnToma guard)#146d0choa wants to merge 1 commit into
d0choa wants to merge 1 commit into
Conversation
On Dataproc the Spark-NLP jars are loaded from a pre-staged fat jar via the cluster-level spark.jars property (orchestration clusters.yaml), avoiding per-job Ivy resolution against Maven Central. OnToma guards on spark.jars.packages containing 'spark-nlp', so set it in the application's SparkConf (post-SparkSubmit, no Ivy) to satisfy that check for all OnToma-using steps without relying on the cluster-level package. Companion to opentargets/issues#4453.
d0choa
added a commit
to opentargets/orchestration
that referenced
this pull request
Jun 30, 2026
Replaces the fixed-bucket staging with a PIS-managed flow so the Spark-NLP version is pinned in unified_pipeline.yaml and the jar travels with each release: - pis.yaml: new spark_nlp_jars step copies the JSL fat jar to input/jars/spark-nlp-assembly-<v>.jar in the release bucket. - unified_pipeline.yaml: add spark_nlp_version + the pis_spark_nlp_jars step. - unified_pipeline.py: expose spark_nlp_version to the pis context and release_uri + spark_nlp_version to the clusters context; gate the pts cluster's creation on pis_spark_nlp_jars so the jar exists before the cluster is brought up (pts_openfda / pts_association don't use Spark-NLP). - clusters.yaml: pts spark.jars now points at the release-staged jar. Still avoids per-job Ivy resolution against Maven Central (opentargets/issues#4453); pts session sets spark.jars.packages in-app to satisfy OnToma's guard. Pair with opentargets/pts#146 and opentargets/pis (spark_nlp_jars step).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
On Dataproc, set
spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:6.1.3in the application's
SparkConf(the Dataproc branch ofSession._create_config).Why
This is the pts half of the fix for opentargets/issues#4453 (PTS Dataproc jobs
re-run Spark-NLP Ivy resolution on every step).
The companion orchestration change makes the
ptscluster load Spark-NLP from apre-resolved fat jar via
spark:spark.jars(staged into the release bucket by anew PIS step) instead of
spark:spark.jars.packages(which triggers Ivyresolution against Maven Central on every
spark-submit).spark.jarsdoesnot populate
spark.jars.packages, butOnToma.__init__guards onspark.jars.packagescontaining'spark-nlp'(ontoma/ontoma.py) before itwill run — and ~9 PTS steps construct
OnToma(everyadd_efo_mappingcaller +ontoma_lut_generation).Setting it here, in the app's
SparkConf, happens after SparkSubmit hasalready started the driver JVM, so it satisfies the guard without triggering
Ivy package resolution. Same mechanism the literature-pipeline launcher already
relies on.
How it fits together (3 PRs)
spark_nlp_jarsPIS step — stage the fat jar into the release bucketptscluster'sspark.jarsat the staged jarspark.jars.packagesin-app so OnToma's guard passesMerge order: this PR is safe to merge first/independently — while the cluster
still sets
spark.jars.packages, it just sets the same value redundantly. Theorchestration switch should only ship against a
pts_versionthat includes thiscommit, so the OnToma steps never lose their guard value.
Test
uv run ruff checkpasses.PTS Dataproc jobs re-run Spark-NLP Ivy resolution on every step issues#4453): driver logs show no Ivy
:: resolution report ::block, OnToma steps still map diseases, non-OnToma steps start clean.
Refs opentargets/issues#4453