Skip to content

test: Spark core tests#19082

Draft
yihua wants to merge 4 commits into
apache:masterfrom
yihua:spark-core-tests
Draft

test: Spark core tests#19082
yihua wants to merge 4 commits into
apache:masterfrom
yihua:spark-core-tests

Conversation

@yihua

@yihua yihua commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

The Java CI workflow runs the full Spark datasource test suite once per Spark version and Java/Scala track, producing a large number of checks and high compute cost. Measured per-job timings from a recent master run show ~2,119 runner-minutes per push, ~90% of it in the per-version Spark datasource jobs, most of which re-run Spark-version-independent tests on every version.

This PR runs a curated core subset on every Spark version and the full suites only on the latest 3.x (spark3.5) and latest 4.x (spark4.2).

Summary and Changelog

  • New core-tests Maven profile: surefire groups=core plus scalatest tagsToInclude=org.apache.hudi.functional.SparkSQLCoreFlow, injected only in this profile (an empty tagsToInclude would emit -n "" and run zero scala tests).
  • The core set is excluded from the normal tasks so nothing runs twice on a version: core added to excludedGroups of unit-tests/functional-tests/-b/-c, and the scalatest exclude is driven by the hoodie.scalatest.tagsToExclude property (default SparkSQLCoreFlow).
  • CoreFlow scalatest Tag object (name matches the existing @SparkSQLCoreFlow @TagAnnotation) for per-test taggedAs(CoreFlow).
  • Tagged core set: @Tag("core") on representative TestCOWDataSource/TestMORDataSource methods (read/write round-trip, vectorized reader, snapshot/read-optimized/incremental reads, schema evolution) and the parquet/orc/InternalRow writers; taggedAs(CoreFlow) on basic INSERT/UPDATE/DELETE/MERGE INTO/CREATE TABLE blocks.
  • The previously dead-wired TestSparkSqlCoreFlow anchor (tagged @SparkSQLCoreFlow, excluded everywhere, included nowhere) is now run by the core job, trimmed to a representative table-type/metadata/keygen/index spread so it fits the per-version budget.
  • bot.yml: full-suite datasource jobs reduced to a single version each; one matrix-driven job test-spark-core-tests (Java 11: spark3.3/3.4/3.5; Java 17: spark3.5/4.0/4.1/4.2, via per-entry javaVersion/mvnProfiles) builds once and runs -Pcore-tests plus the quickstart on every version. Flink (already tiered), bundle-validation, docker, and build-only jobs are unchanged.

Estimated impact: ~2,119 -> ~1,142 runner-minutes per push (~46%), and Spark datasource checks drop from 30 to ~14.

Impact

CI/test-execution only; no production code or public API change. Non-latest Spark versions trade the full per-version suite for a curated core subset; full coverage for all versions can still be restored via a scheduled full-matrix run.

Risk Level

low

CI/test-tiering change only, no source changes. Opened as draft to validate the new matrix in CI before review.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

yihua added 4 commits June 26, 2026 18:17
Introduce a curated 'core' test set that can run on every Spark version,
while the full unit/functional suites run only on the latest Spark majors.

- core-tests Maven profile: surefire groups=core plus scalatest
  tagsToInclude=SparkSQLCoreFlow (injected only here; an empty
  tagsToInclude would emit -n "" and run zero scala tests).
- Exclude the core set from the normal tasks: add 'core' to excludedGroups
  in unit-tests/functional-tests/-b/-c, and drive the scalatest
  tagsToExclude via the hoodie.scalatest.tagsToExclude property (default
  SparkSQLCoreFlow) so core-flow blocks stay out of normal scala runs.
- CoreFlow scalatest Tag object (name matches the @SparkSQLCoreFlow
  annotation) for per-test taggedAs(CoreFlow).
- Tag the core set: @tag("core") on representative TestCOWDataSource /
  TestMORDataSource methods and parquet/orc/InternalRow writers;
  taggedAs(CoreFlow) on basic INSERT/UPDATE/DELETE/MERGE/CREATE blocks.
- Trim the dead-wired TestSparkSqlCoreFlow anchor to a representative
  table-type/metadata/keygen/index spread so it fits the per-version budget.
Run the full unit/functional/scala suites only on the latest 3.x
(spark3.5/Java11) and latest 4.x (spark4.2/Java17). Add test-spark-core-tests
(spark3.3/3.4/3.5) and test-spark-java17-core-tests (spark3.5/4.0/4.1/4.2)
that build once and run -Pcore-tests plus the quickstart on every version.
Leaves Flink, bundle-validation, docker, and build-only jobs unchanged.
Collapse test-spark-core-tests and test-spark-java17-core-tests into a
single matrix-driven job: each matrix entry carries javaVersion and
mvnProfiles (-Pjava17 for the Java 17 rows, empty for Java 11), so the
JDK setup and mvn invocations are shared. Standardize the module list to
include hudi-common on both tracks (harmless on Java 11). Net: 28 -> 27
jobs, no behavior change to which tests run per version.
Set an explicit name template on test-spark-core-tests so the check name
shows just scalaProfile, sparkProfile, and javaVersion, hiding the
sparkModules and mvnProfiles matrix fields.
@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Jun 27, 2026
@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants