Skip to content

feat(reader): Adapt the HoodieFileGroupReader to read the native form…#19072

Open
cshuo wants to merge 1 commit into
apache:masterfrom
cshuo:fg_reader_native_log
Open

feat(reader): Adapt the HoodieFileGroupReader to read the native form…#19072
cshuo wants to merge 1 commit into
apache:masterfrom
cshuo:fg_reader_native_log

Conversation

@cshuo

@cshuo cshuo commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

…at log files

Describe the issue this Pull Request addresses

Hudi is introducing RFC-103 native-format log files (e.g. .log.parquet, .deletes.parquet), where the log file is itself a native columnar file carrying records plus footer metadata, rather than legacy inline log blocks. For compatibility and migration, the existing HoodieFileGroupReader read path must be able to consume both legacy inline logs and the new native logs within the same file slice — automatically and with no user-facing config or API change.

This PR adapts the log-reading stack so native data and delete log files are detected and read transparently through the existing block-processing and merge machinery.

Fixes #19057.

Summary and Changelog

Native log reading path

  • Added HoodieNativeLogFileReader (implements HoodieLogFormat.Reader) that opens a native log file and exposes its records as a single synthetic block, reusing the standard reader plumbing.
  • Added HoodieNativeDataBlock (extends HoodieDataBlock) and HoodieNativeDeleteBlock (extends HoodieDeleteBlock) to surface native records/deletes through the existing data/delete-block processing.
  • Introduced a new synthetic block type NATIVE_DATA_BLOCK("native", v9) in HoodieLogBlock; wired it into the case handling in BaseHoodieLogRecordReader (both forward and last-block paths) alongside the other data block types.
  • HoodieLogFormatReader now takes HoodieReaderContext and HoodieTableMetaClient, holds a HoodieLogFormat.Reader (instead of a concrete HoodieLogFileReader), and dispatches to the native reader for native log files; the legacy constructor is preserved as an overload.

Footer metadata

  • Added NativeLogFooterMetadata to centralize encoding/decoding of log-block header metadata to/from the native file footer (key hudi.log.format.metadata, format version 2), including readSchemaFromNativeLogFile and format-version validation.
  • Refactored HoodieNativeLogFormatWriter to delegate footer construction to NativeLogFooterMetadata.toFooterMetadata, removing the duplicated inline metadata-building logic.
  • TableSchemaResolver.readSchemaFromLogFile now detects native log files via FSUtils.matchNativeLogFile and reads the schema from the footer instead of a log-block header.

Ordering-value (DeleteRecord) correctness

  • Added RecordContext.getOrderingValueAsJava / getValueAsJava, returning the engine-agnostic Java type expected by DeleteRecord (the merge path later converts back to the engine type via convertOrderingValueToEngineType).
  • Overrode getValueAsJava in FlinkRecordContext and BaseSparkInternalRecordContext; the Avro/Hive defaults inherit the base behavior.
  • HoodieNativeDeleteBlock.readRecordsToDelete uses the Java-typed ordering value when constructing DeleteRecords to prevent a double conversion (e.g. ClassCastException for Flink, wrong values for Spark).

Impact

  • Functional impact: The FG reader can now read native-format data and delete log files alongside legacy inline logs within the same file slice. No config or API changes; the adaptation is internal and automatic. Native CDC log reading is intentionally not yet supported and throws a clear exception.
  • Maintainability: Reuses the existing HoodieDataBlock/delete-block processing and merge paths via a synthetic block type, keeping the compaction/merge machinery untouched. Adds a small, well-scoped getValueAsJava extension point on RecordContext.

Risk Level

Low. The change is additive and gated on native-file detection, so legacy read paths are unaffected and fail-fast guards cover unsupported contexts. Mitigation: unit tests cover footer-header parsing/precedence/validation and per-engine getValueAsJava conversions, and an end-to-end test validates reading a mixed legacy + native parquet data/delete file slice through the HoodieFileGroupReader.

Documentation Update

None. This is an internal read-path adaptation with no new user-facing config, API, or storage-format change introduced by this PR (the native log format itself is covered by RFC-103).

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@cshuo cshuo marked this pull request as draft June 26, 2026 06:36
@cshuo cshuo force-pushed the fg_reader_native_log branch from ecadd6c to 2a800b4 Compare June 26, 2026 07:27
@cshuo cshuo marked this pull request as ready for review June 26, 2026 07:28
@github-actions github-actions Bot added the size:XL PR with lines of changes > 1000 label Jun 26, 2026
@cshuo cshuo force-pushed the fg_reader_native_log branch from 2a800b4 to 6946a65 Compare June 26, 2026 07:52
@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR adapts the HoodieFileGroupReader stack to transparently read RFC-103 native-format data and delete log files alongside legacy inline logs, adding a synthetic NATIVE_DATA_BLOCK type and a shared NativeLogFooterMetadata contract. The native-delete ordering-value handling is well-reasoned and is backed by solid tests (the double-conversion guard in particular). One edge case around native CDC log files is worth double-checking in the inline comment. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small readability nits — magic regex group numbers and a double accessor call — otherwise the code is well-structured and the new classes follow established Hudi conventions cleanly.

String fileName = pathInfo.getPath().getName();
return FSUtils.isLogFile(pathInfo.getPath()) && fileName.contains(logFileExtension);
return FSUtils.isLogFile(pathInfo.getPath())
&& (FSUtils.matchNativeLogFile(fileName).isPresent() || fileName.contains(logFileExtension));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This now surfaces native .cdc.parquet files in a file slice's log files (the new test asserts this). But the read path excludes CDC logs in InputSplit via getFileName().endsWith(CDC_LOGFILE_SUFFIX) (".cdc"), which won't match .cdc.parquet. Since HoodieNativeLogFileReader treats any non-delete native file as a data block, could native CDC logs leak into snapshot/merge reads here? @yihua

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

*/
public static HoodieSchema readSchemaFromNativeLogFile(
HoodieStorage storage, StoragePath path, Matcher nativeLogMatcher) {
String nativeLogType = nativeLogMatcher.group(8);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: group(8) and group(9) below are opaque without cross-referencing the regex in FSUtils. Could you extract named constants (e.g. NATIVE_LOG_TYPE_GROUP = 8, NATIVE_LOG_EXTENSION_GROUP = 9) here or in FSUtils, so a reader can understand what each group captures at a glance?

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

*/
public static long getFileSize(HoodieStorage storage, HoodieLogFile logFile) {
try {
return logFile.getFileSize() >= 0 ? logFile.getFileSize() : storage.getPathInfo(logFile.getPath()).getLength();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: logFile.getFileSize() is called twice in the same expression — could you extract it to a local variable (long fileSize = logFile.getFileSize()) so a reader doesn't have to wonder whether the double call is intentional?

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

}

@Override
protected <T> ClosableIterator<T> readRecordsFromBlockPayload(HoodieReaderContext<T> readerContext) throws IOException {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key-filtered scans will fail for native data logs here. FileGroupRecordBuffer.getRecordsIterator calls dataBlock.getEngineRecordIterator(readerContext, keys, fullKey) whenever HoodieMergedLogRecordReader.createKeySpec(...) builds an IN/STARTS_WITH filter. The base HoodieDataBlock implementation routes that to lookupEngineRecords, which this block does not override, so .log.parquet files throw Point lookups are not supported by this Data block type (NATIVE_DATA_BLOCK) instead of returning the matching rows. Since native logs are meant to be transparent to the file-group reader, please add a scan-and-filter fallback (or native predicate pushdown) for lookupEngineRecords here and cover an IN/prefix-filtered read test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adapter the existing HoodieFileGroupReader to read the new native format log files

4 participants