test: Spark core tests#19082
Draft
yihua wants to merge 4 commits into
Draft
Conversation
Introduce a curated 'core' test set that can run on every Spark version, while the full unit/functional suites run only on the latest Spark majors. - core-tests Maven profile: surefire groups=core plus scalatest tagsToInclude=SparkSQLCoreFlow (injected only here; an empty tagsToInclude would emit -n "" and run zero scala tests). - Exclude the core set from the normal tasks: add 'core' to excludedGroups in unit-tests/functional-tests/-b/-c, and drive the scalatest tagsToExclude via the hoodie.scalatest.tagsToExclude property (default SparkSQLCoreFlow) so core-flow blocks stay out of normal scala runs. - CoreFlow scalatest Tag object (name matches the @SparkSQLCoreFlow annotation) for per-test taggedAs(CoreFlow). - Tag the core set: @tag("core") on representative TestCOWDataSource / TestMORDataSource methods and parquet/orc/InternalRow writers; taggedAs(CoreFlow) on basic INSERT/UPDATE/DELETE/MERGE/CREATE blocks. - Trim the dead-wired TestSparkSqlCoreFlow anchor to a representative table-type/metadata/keygen/index spread so it fits the per-version budget.
Run the full unit/functional/scala suites only on the latest 3.x (spark3.5/Java11) and latest 4.x (spark4.2/Java17). Add test-spark-core-tests (spark3.3/3.4/3.5) and test-spark-java17-core-tests (spark3.5/4.0/4.1/4.2) that build once and run -Pcore-tests plus the quickstart on every version. Leaves Flink, bundle-validation, docker, and build-only jobs unchanged.
Collapse test-spark-core-tests and test-spark-java17-core-tests into a single matrix-driven job: each matrix entry carries javaVersion and mvnProfiles (-Pjava17 for the Java 17 rows, empty for Java 11), so the JDK setup and mvn invocations are shared. Standardize the module list to include hudi-common on both tracks (harmless on Java 11). Net: 28 -> 27 jobs, no behavior change to which tests run per version.
Set an explicit name template on test-spark-core-tests so the check name shows just scalaProfile, sparkProfile, and javaVersion, hiding the sparkModules and mvnProfiles matrix fields.
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Describe the issue this Pull Request addresses
The
Java CIworkflow runs the full Spark datasource test suite once per Spark version and Java/Scala track, producing a large number of checks and high compute cost. Measured per-job timings from a recent master run show ~2,119 runner-minutes per push, ~90% of it in the per-version Spark datasource jobs, most of which re-run Spark-version-independent tests on every version.This PR runs a curated core subset on every Spark version and the full suites only on the latest 3.x (spark3.5) and latest 4.x (spark4.2).
Summary and Changelog
core-testsMaven profile: surefiregroups=coreplus scalatesttagsToInclude=org.apache.hudi.functional.SparkSQLCoreFlow, injected only in this profile (an emptytagsToIncludewould emit-n ""and run zero scala tests).coreadded toexcludedGroupsofunit-tests/functional-tests/-b/-c, and the scalatest exclude is driven by thehoodie.scalatest.tagsToExcludeproperty (defaultSparkSQLCoreFlow).CoreFlowscalatestTagobject (name matches the existing@SparkSQLCoreFlow@TagAnnotation) for per-testtaggedAs(CoreFlow).@Tag("core")on representativeTestCOWDataSource/TestMORDataSourcemethods (read/write round-trip, vectorized reader, snapshot/read-optimized/incremental reads, schema evolution) and the parquet/orc/InternalRow writers;taggedAs(CoreFlow)on basic INSERT/UPDATE/DELETE/MERGE INTO/CREATE TABLE blocks.TestSparkSqlCoreFlowanchor (tagged@SparkSQLCoreFlow, excluded everywhere, included nowhere) is now run by the core job, trimmed to a representative table-type/metadata/keygen/index spread so it fits the per-version budget.bot.yml: full-suite datasource jobs reduced to a single version each; one matrix-driven jobtest-spark-core-tests(Java 11: spark3.3/3.4/3.5; Java 17: spark3.5/4.0/4.1/4.2, via per-entryjavaVersion/mvnProfiles) builds once and runs-Pcore-testsplus the quickstart on every version. Flink (already tiered), bundle-validation, docker, and build-only jobs are unchanged.Estimated impact: ~2,119 -> ~1,142 runner-minutes per push (~46%), and Spark datasource checks drop from 30 to ~14.
Impact
CI/test-execution only; no production code or public API change. Non-latest Spark versions trade the full per-version suite for a curated core subset; full coverage for all versions can still be restored via a scheduled full-matrix run.
Risk Level
low
CI/test-tiering change only, no source changes. Opened as draft to validate the new matrix in CI before review.
Documentation Update
none
Contributor's checklist