Skip to content

fix(session): provide spark-nlp package in-app on Dataproc (OnToma guard)#146

Draft
d0choa wants to merge 1 commit into
mainfrom
fix/spark-nlp-prestaged-jars
Draft

fix(session): provide spark-nlp package in-app on Dataproc (OnToma guard)#146
d0choa wants to merge 1 commit into
mainfrom
fix/spark-nlp-prestaged-jars

Conversation

@d0choa

@d0choa d0choa commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What

On Dataproc, set spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:6.1.3
in the application's SparkConf (the Dataproc branch of
Session._create_config).

Why

This is the pts half of the fix for opentargets/issues#4453 (PTS Dataproc jobs
re-run Spark-NLP Ivy resolution on every step).

The companion orchestration change makes the pts cluster load Spark-NLP from a
pre-resolved fat jar via spark:spark.jars (staged into the release bucket by a
new PIS step) instead of spark:spark.jars.packages (which triggers Ivy
resolution against Maven Central on every spark-submit). spark.jars does
not populate spark.jars.packages, but OnToma.__init__ guards on
spark.jars.packages containing 'spark-nlp' (ontoma/ontoma.py) before it
will run — and ~9 PTS steps construct OnToma (every add_efo_mapping caller +
ontoma_lut_generation).

Setting it here, in the app's SparkConf, happens after SparkSubmit has
already started the driver JVM, so it satisfies the guard without triggering
Ivy package resolution. Same mechanism the literature-pipeline launcher already
relies on.

How it fits together (3 PRs)

Repo PR Role
opentargets/pis #204 spark_nlp_jars PIS step — stage the fat jar into the release bucket
opentargets/orchestration #217 run the PIS step + point the pts cluster's spark.jars at the staged jar
this (pts) #146 set spark.jars.packages in-app so OnToma's guard passes

Merge order: this PR is safe to merge first/independently — while the cluster
still sets spark.jars.packages, it just sets the same value redundantly. The
orchestration switch should only ship against a pts_version that includes this
commit, so the OnToma steps never lose their guard value.

Test

Refs opentargets/issues#4453

On Dataproc the Spark-NLP jars are loaded from a pre-staged fat jar via the
cluster-level spark.jars property (orchestration clusters.yaml), avoiding
per-job Ivy resolution against Maven Central. OnToma guards on
spark.jars.packages containing 'spark-nlp', so set it in the application's
SparkConf (post-SparkSubmit, no Ivy) to satisfy that check for all
OnToma-using steps without relying on the cluster-level package.

Companion to opentargets/issues#4453.
d0choa added a commit to opentargets/orchestration that referenced this pull request Jun 30, 2026
Replaces the fixed-bucket staging with a PIS-managed flow so the Spark-NLP
version is pinned in unified_pipeline.yaml and the jar travels with each
release:

- pis.yaml: new spark_nlp_jars step copies the JSL fat jar to
  input/jars/spark-nlp-assembly-<v>.jar in the release bucket.
- unified_pipeline.yaml: add spark_nlp_version + the pis_spark_nlp_jars step.
- unified_pipeline.py: expose spark_nlp_version to the pis context and
  release_uri + spark_nlp_version to the clusters context; gate the pts
  cluster's creation on pis_spark_nlp_jars so the jar exists before the
  cluster is brought up (pts_openfda / pts_association don't use Spark-NLP).
- clusters.yaml: pts spark.jars now points at the release-staged jar.

Still avoids per-job Ivy resolution against Maven Central (opentargets/issues#4453);
pts session sets spark.jars.packages in-app to satisfy OnToma's guard.

Pair with opentargets/pts#146 and opentargets/pis (spark_nlp_jars step).
@d0choa d0choa marked this pull request as draft June 30, 2026 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant