Skip to content

feat: sort input for lsm write#19079

Open
danny0405 wants to merge 2 commits into
apache:masterfrom
danny0405:sort-input
Open

feat: sort input for lsm write#19079
danny0405 wants to merge 2 commits into
apache:masterfrom
danny0405:sort-input

Conversation

@danny0405

@danny0405 danny0405 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

This closes #19078

LSM storage layout write paths need incoming records to participate in sorted file-group merging instead of being handled through the regular merge-handle flow. This matters for preserving LSM ordering semantics across Spark, Flink, and Java write paths, including compaction and CDC-enabled updates.

The PR also keeps native CDC log writing scoped to LSM layout write paths. Native CDC files are written using the table base file format, while CDC reader-side native format support is intentionally left out for a follow-up PR.

Summary and Changelog

This PR adds LSM-specific merge handles and sorted input support, wires LSM-aware handle selection into Spark/Flink/Java write paths, and adds native CDC log writer support for LSM CDC writes.

  • Added LsmFileGroupReaderBasedMergeHandle, FlinkLsmFileGroupReaderBasedMergeHandle, and FlinkLsmFileGroupReaderBasedIncrementalMergeHandle for LSM write/merge paths.
  • Updated HoodieMergeHandleFactory and FlinkWriteHandleFactory to select LSM file-group-reader based merge handles for LSM storage layout, and to avoid fallback for LsmFileGroupReaderBasedMergeHandle in compaction paths with reader context/file-group reader.
  • Extended LsmFileGroupRecordIterator and InputSplit so incoming write records can be treated as an additional sorted run during LSM file-group reads.
  • Added HoodieCDCLogWriter abstraction and HoodieCDCLogWriterFactory to centralize CDC writer creation.
  • Added HoodieAvroNativeCDCLogger for Avro CDC records and updated native CDC/log writers to use the table base file format rather than hard-coded Parquet.
  • Updated Spark/Flink merge handles with changelog to use the shared CDC writer abstraction.
  • Updated TestHoodieMergeHandleFactory and TestHoodieRecordUtils for the new handle-selection and record utility behavior.

Validation evidence:

  • mvn -pl hudi-client/hudi-client-common -DskipITs -DskipTests=false -Dtest=TestHoodieMergeHandleFactory test
  • mvn -pl hudi-client/hudi-flink-client,hudi-client/hudi-client-common -DskipTests -DskipITs compile
  • mvn -pl hudi-common -DskipTests -DskipITs compile

Impact

This affects LSM storage layout write paths across common client, Flink client, Java client, and Flink datasource code. Non-LSM tables should continue to use the existing merge and CDC log paths because native CDC writing is gated on HoodieTableConfig.isLSMTreeStorageLayout().

There is no new public configuration. Native CDC file naming/writing follows the configured table base file format. CDC reader native-format support is not included in this PR and is expected to be addressed separately.

Risk Level

medium

The changes touch core write-path handle selection, compaction merge handle selection, LSM file-group reading, and CDC log writing. The risk is mitigated by keeping native CDC writes gated to LSM storage layout and by targeted validation of merge handle selection plus module compilation. Follow-up validation should include broader Spark/Flink CDC and LSM integration coverage.

Documentation Update

none

No new user-facing configuration is introduced. The changes refine internal LSM write-path behavior and native CDC writer selection.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions Bot added the size:XL PR with lines of changes > 1000 label Jun 26, 2026
@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sort write inputs for lsm layput

2 participants