fix(session): provide spark-nlp package in-app on Dataproc (OnToma guard) by d0choa · Pull Request #146 · opentargets/pts

d0choa · 2026-06-30T14:02:57Z

What

On Dataproc, set spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:6.1.3
in the application's SparkConf (the Dataproc branch of
Session._create_config).

Why

This is the pts half of the fix for opentargets/issues#4453 (PTS Dataproc jobs
re-run Spark-NLP Ivy resolution on every step).

The companion orchestration change makes the pts cluster load Spark-NLP from a
pre-resolved fat jar via spark:spark.jars (staged into the release bucket by a
new PIS step) instead of spark:spark.jars.packages (which triggers Ivy
resolution against Maven Central on every spark-submit). spark.jars does
not populate spark.jars.packages, but OnToma.__init__ guards on
spark.jars.packages containing 'spark-nlp' (ontoma/ontoma.py) before it
will run — and ~9 PTS steps construct OnToma (every add_efo_mapping caller +
ontoma_lut_generation).

Setting it here, in the app's SparkConf, happens after SparkSubmit has
already started the driver JVM, so it satisfies the guard without triggering
Ivy package resolution. Same mechanism the literature-pipeline launcher already
relies on.

How it fits together (3 PRs)

Repo	PR	Role
opentargets/pis	#204	`spark_nlp_jars` PIS step — stage the fat jar into the release bucket
opentargets/orchestration	#217	run the PIS step + point the `pts` cluster's `spark.jars` at the staged jar
this (pts)	#146	set `spark.jars.packages` in-app so OnToma's guard passes

Merge order: this PR is safe to merge first/independently — while the cluster
still sets spark.jars.packages, it just sets the same value redundantly. The
orchestration switch should only ship against a pts_version that includes this
commit, so the OnToma steps never lose their guard value.

Test

uv run ruff check passes.
Behavioural validation on the next Dataproc run (acceptance criteria in
PTS Dataproc jobs re-run Spark-NLP Ivy resolution on every step issues#4453): driver logs show no Ivy :: resolution report ::
block, OnToma steps still map diseases, non-OnToma steps start clean.

Refs opentargets/issues#4453

On Dataproc the Spark-NLP jars are loaded from a pre-staged fat jar via the cluster-level spark.jars property (orchestration clusters.yaml), avoiding per-job Ivy resolution against Maven Central. OnToma guards on spark.jars.packages containing 'spark-nlp', so set it in the application's SparkConf (post-SparkSubmit, no Ivy) to satisfy that check for all OnToma-using steps without relying on the cluster-level package. Companion to opentargets/issues#4453.

Replaces the fixed-bucket staging with a PIS-managed flow so the Spark-NLP version is pinned in unified_pipeline.yaml and the jar travels with each release: - pis.yaml: new spark_nlp_jars step copies the JSL fat jar to input/jars/spark-nlp-assembly-<v>.jar in the release bucket. - unified_pipeline.yaml: add spark_nlp_version + the pis_spark_nlp_jars step. - unified_pipeline.py: expose spark_nlp_version to the pis context and release_uri + spark_nlp_version to the clusters context; gate the pts cluster's creation on pis_spark_nlp_jars so the jar exists before the cluster is brought up (pts_openfda / pts_association don't use Spark-NLP). - clusters.yaml: pts spark.jars now points at the release-staged jar. Still avoids per-job Ivy resolution against Maven Central (opentargets/issues#4453); pts session sets spark.jars.packages in-app to satisfy OnToma's guard. Pair with opentargets/pts#146 and opentargets/pis (spark_nlp_jars step).

This was referenced Jun 30, 2026

fix(pts): load spark-nlp from a release-staged fat jar (PIS), not Ivy opentargets/orchestration#217

Draft

PTS Dataproc jobs re-run Spark-NLP Ivy resolution on every step opentargets/issues#4453

Open

d0choa mentioned this pull request Jun 30, 2026

feat(spark_nlp_jars): stage spark-nlp fat jar into the release bucket opentargets/pis#204

Draft

d0choa marked this pull request as draft June 30, 2026 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(session): provide spark-nlp package in-app on Dataproc (OnToma guard)#146

fix(session): provide spark-nlp package in-app on Dataproc (OnToma guard)#146
d0choa wants to merge 1 commit into
mainfrom
fix/spark-nlp-prestaged-jars

d0choa commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

d0choa commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How it fits together (3 PRs)

Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

d0choa commented Jun 30, 2026 •

edited

Loading