Skip to content

feat(engine): build lance-spark bundle from main (lance-core 8.x) + pyspark 4.1.2#94

Open
jiaoew1991 wants to merge 1 commit into
developfrom
enwei/engine-lancespark-main
Open

feat(engine): build lance-spark bundle from main (lance-core 8.x) + pyspark 4.1.2#94
jiaoew1991 wants to merge 1 commit into
developfrom
enwei/engine-lancespark-main

Conversation

@jiaoew1991

Copy link
Copy Markdown
Collaborator

What

The engine base now builds the lance-spark bundle from upstream main in-image (pinned commit 656c882, lance-core 8.0.0-beta.9) instead of wget-ing the released lance-spark-bundle-4.1_2.13/0.5.1 from Maven, pins pyspark==4.1.2, and drops the Spark-Connect client fat jar.

Why

Native lance-spark writes — df.write.format("lance") — under RayDP need lance #6946, the JNI dispatcher classloader fix (resolve AsyncScanner at JNI_OnLoad + hand the native dispatcher a GlobalRef). Without it the Rust dispatcher thread calls find_class on the system classloader, but Ray loads job jars in a child job classloader, so it panics AsyncScanner class not found and kills the executor on the distributed write.

The released 0.5.1 bundle still embeds pre-fix lance-core 7.0.0. lance-spark main pins lance-core 8.0.0-beta.9 (post-#6946) but no fixed bundle is on Maven yet → build it in-image from a pinned main commit. Swap back to a plain wget once a fixed bundle is published.

Notes

  • Both Maven skips are required on a clean CI .m2:
    • -Dspotless.skip=true — google-java-format breaks under JDK17
    • -Dmaven.javadoc.skip=true — javadoc-plugin 2.9.1 attach-javadocs fails building lance-spark-base (passes locally only because a dev .m2 has cached javadoc artifacts)
  • connect-repl removed: its shaded-Arrow ArrowUtils shadows the real one on the RayDP executor classpath → mapInPandas NoSuchMethodError: ArrowUtils.toArrowSchema.
  • pylance stays 7.0.0 (no 8.x on PyPI; SDK-7 reads format-8 fine).

Validation

End-to-end on RayDP / dev-sydney (RayJob enwei-lance8-smoke, SUCCEEDED):

  • RayDP Spark session up on the new image (pyspark 4.1.2)
  • native distributed df.write.format("lance") downstream of a mapInPandas Arrow UDF — no AsyncScanner panic, no NoSuchMethodError
  • pylance-7.0.0 readback of the 8.x-written table + assert rows == 200_000 — passed

Single-file change; the publish workflow + raydp-python-pkg + connect-repl-era changes are already on develop (#91/#92/#93).

🤖 Generated with Claude Code

…yspark 4.1.2

Native lance-spark writes (df.write.format("lance")) under RayDP need lance
#6946 — the JNI dispatcher classloader fix (resolve AsyncScanner at JNI_OnLoad
+ pass a GlobalRef to the native dispatcher). Without it the Rust dispatcher
thread does find_class on the *system* classloader, but Ray loads job jars in a
child job classloader, so it panics "AsyncScanner class not found" and kills
the executor on the distributed write.

The released lance-spark-bundle 0.5.1 on Maven still embeds pre-fix lance-core
7.0.0. lance-spark main now pins lance-core 8.0.0-beta.9 (post-#6946) but no
fixed bundle is published to Maven yet, so build it in-image from a pinned main
commit and drop the 0.5.1 wget. Both maven skips are required on a clean CI
.m2: -Dspotless.skip=true (google-java-format breaks on JDK17) and
-Dmaven.javadoc.skip=true (javadoc-plugin 2.9.1 attach-javadocs fails on
lance-spark-base).

Also:
- pin pyspark to ==4.1.2 (was >=4.1,<5) for a reproducible base.
- drop pyspark/jars/connect-repl: the Spark-Connect client fat jar shades Arrow
  under org.sparkproject.* and carries a second ArrowUtils whose toArrowSchema
  returns the shaded Schema, shadowing the real one on the RayDP executor
  classpath and breaking mapInPandas with NoSuchMethodError.

Validated end-to-end on RayDP/dev-sydney: native df.write.format("lance")
downstream of a mapInPandas Arrow UDF, plus a pylance-7.0.0 readback of the
8.x-written table, all pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cursor

cursor Bot commented Jun 11, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Image build now depends on cloning and compiling upstream lance-spark (longer, flakier CI) and changes executor classpath/JNI behavior for distributed Lance and Arrow UDFs, though scope is limited to the engine base Dockerfile.

Overview
Updates the nurion engine base image so RayDP can run native Lance writes and Arrow mapInPandas UDFs reliably.

PySpark is pinned to 4.1.2 (was a >=4.1,<5 range). During jar setup, bundled arrow-*.jar files are removed (Arrow comes from the Lance bundle), the Maven lance-spark-bundle 0.5.1 download is dropped, and connect-repl is removed so a shaded ArrowUtils cannot shadow the real Spark class and break pandas UDFs.

A new build step clones lance-spark at commit 656c882, runs Maven on lance-spark-bundle-4.1_2.13 (with spotless/javadoc skips for clean CI), and installs the resulting fat jar into pyspark/jars. That bundle ships lance-core 8.x with the JNI classloader fix needed for df.write.format("lance") under Ray’s child classloader; the old 0.5.1 artifact embedded pre-fix 7.0.0 and caused AsyncScanner class not found on executors.

Reviewed by Cursor Bugbot for commit 77a2ac5. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

Copy link
Copy Markdown

Coverage Report

Overall: 58%

Diff Coverage (changed files only)

Diff Coverage

Diff: origin/develop...HEAD, staged and unstaged changes

No lines with coverage information in this diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant