test: Spark core tests by yihua · Pull Request #19082 · apache/hudi

yihua · 2026-06-27T04:59:48Z

Describe the issue this Pull Request addresses

The Java CI workflow runs the full Spark datasource test suite once per Spark version and Java/Scala track, producing a large number of checks and high compute cost. Measured per-job timings from a recent master run show ~2,119 runner-minutes per push, ~90% of it in the per-version Spark datasource jobs, most of which re-run Spark-version-independent tests on every version.

This PR runs a curated core subset on every Spark version and the full suites only on the latest 3.x (spark3.5) and latest 4.x (spark4.2).

Summary and Changelog

New core-tests Maven profile: surefire groups=core plus scalatest tagsToInclude=org.apache.hudi.functional.SparkSQLCoreFlow, injected only in this profile (an empty tagsToInclude would emit -n "" and run zero scala tests).
The core set is excluded from the normal tasks so nothing runs twice on a version: core added to excludedGroups of unit-tests/functional-tests/-b/-c, and the scalatest exclude is driven by the hoodie.scalatest.tagsToExclude property (default SparkSQLCoreFlow).
CoreFlow scalatest Tag object (name matches the existing @SparkSQLCoreFlow @TagAnnotation) for per-test taggedAs(CoreFlow).
Tagged core set: @Tag("core") on representative TestCOWDataSource/TestMORDataSource methods (read/write round-trip, vectorized reader, snapshot/read-optimized/incremental reads, schema evolution) and the parquet/orc/InternalRow writers; taggedAs(CoreFlow) on basic INSERT/UPDATE/DELETE/MERGE INTO/CREATE TABLE blocks.
The previously dead-wired TestSparkSqlCoreFlow anchor (tagged @SparkSQLCoreFlow, excluded everywhere, included nowhere) is now run by the core job, trimmed to a representative table-type/metadata/keygen/index spread so it fits the per-version budget.
bot.yml: full-suite datasource jobs reduced to a single version each; one matrix-driven job test-spark-core-tests (Java 11: spark3.3/3.4/3.5; Java 17: spark3.5/4.0/4.1/4.2, via per-entry javaVersion/mvnProfiles) builds once and runs -Pcore-tests plus the quickstart on every version. Flink (already tiered), bundle-validation, docker, and build-only jobs are unchanged.

Estimated impact: ~2,119 -> ~1,142 runner-minutes per push (~46%), and Spark datasource checks drop from 30 to ~14.

Impact

CI/test-execution only; no production code or public API change. Non-latest Spark versions trade the full per-version suite for a curated core subset; full coverage for all versions can still be restored via a scheduled full-matrix run.

Risk Level

low

CI/test-tiering change only, no source changes. Opened as draft to validate the new matrix in CI before review.

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

@tag

Introduce a curated 'core' test set that can run on every Spark version, while the full unit/functional suites run only on the latest Spark majors. - core-tests Maven profile: surefire groups=core plus scalatest tagsToInclude=SparkSQLCoreFlow (injected only here; an empty tagsToInclude would emit -n "" and run zero scala tests). - Exclude the core set from the normal tasks: add 'core' to excludedGroups in unit-tests/functional-tests/-b/-c, and drive the scalatest tagsToExclude via the hoodie.scalatest.tagsToExclude property (default SparkSQLCoreFlow) so core-flow blocks stay out of normal scala runs. - CoreFlow scalatest Tag object (name matches the @SparkSQLCoreFlow annotation) for per-test taggedAs(CoreFlow). - Tag the core set: @tag("core") on representative TestCOWDataSource / TestMORDataSource methods and parquet/orc/InternalRow writers; taggedAs(CoreFlow) on basic INSERT/UPDATE/DELETE/MERGE/CREATE blocks. - Trim the dead-wired TestSparkSqlCoreFlow anchor to a representative table-type/metadata/keygen/index spread so it fits the per-version budget.

Run the full unit/functional/scala suites only on the latest 3.x (spark3.5/Java11) and latest 4.x (spark4.2/Java17). Add test-spark-core-tests (spark3.3/3.4/3.5) and test-spark-java17-core-tests (spark3.5/4.0/4.1/4.2) that build once and run -Pcore-tests plus the quickstart on every version. Leaves Flink, bundle-validation, docker, and build-only jobs unchanged.

Collapse test-spark-core-tests and test-spark-java17-core-tests into a single matrix-driven job: each matrix entry carries javaVersion and mvnProfiles (-Pjava17 for the Java 17 rows, empty for Java 11), so the JDK setup and mvn invocations are shared. Standardize the module list to include hudi-common on both tracks (harmless on Java 11). Net: 28 -> 27 jobs, no behavior change to which tests run per version.

Set an explicit name template on test-spark-core-tests so the check name shows just scalaProfile, sparkProfile, and javaVersion, hiding the sparkModules and mvnProfiles matrix fields.

hudi-bot · 2026-06-27T19:43:32Z

CI report:

3c9220a Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua added 4 commits June 26, 2026 18:17

Show only scala/spark/java in the core-tests job name

3c9220a

Set an explicit name template on test-spark-core-tests so the check name shows just scalaProfile, sparkProfile, and javaVersion, hiding the sparkModules and mvnProfiles matrix fields.

github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Jun 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: Spark core tests#19082

test: Spark core tests#19082
yihua wants to merge 4 commits into
apache:masterfrom
yihua:spark-core-tests

yihua commented Jun 27, 2026 •

edited

Loading

Uh oh!

hudi-bot commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yihua commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented Jun 27, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yihua commented Jun 27, 2026 •

edited

Loading