Skip to content

feat(writer): add COMMIT_TIME_ONLY meta-fields mode for incremental queries on minimal-meta-fields tables#19047

Open
nsivabalan wants to merge 1 commit into
apache:masterfrom
nsivabalan:selective_meta_fields_approach2
Open

feat(writer): add COMMIT_TIME_ONLY meta-fields mode for incremental queries on minimal-meta-fields tables#19047
nsivabalan wants to merge 1 commit into
apache:masterfrom
nsivabalan:selective_meta_fields_approach2

Conversation

@nsivabalan

Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

Today the only way to opt out of populating Hudi's five meta columns is the all-or-nothing hoodie.populate.meta.fields=false. That saves storage but disables incremental queries (which require _hoodie_commit_time).

A community user surfaced this trade-off (#18383, also discussed at #17959). The concrete ask was: "give me the storage saving without giving up incremental queries." A separate exploratory PR (#18384) attempted a fully orthogonal exclude-list with per-field branching across the writer/reader paths; that surface ended up being ~2300 lines across 87 files. This PR proposes a simpler, scoped alternative: three named modes instead of the full 2^5 matrix.

Closes #18383.

Summary and Changelog

Adds an additive opt-in flag, hoodie.meta.fields.commit.time.enabled, that — when set together with hoodie.populate.meta.fields=false — additionally populates _hoodie_commit_time so incremental queries remain functional. The remaining four meta columns stay null on disk, preserving the storage saving.

The three resulting modes:

populate.meta.fields meta.fields.commit.time.enabled Effective mode
true (default) ignored ALL — today's default
false false (default) NONE — today's populate.meta.fields=false
false true COMMIT_TIME_ONLY — new
true true rejected at writer init (ambiguous)

Why a separate boolean instead of a single enum

  • Bit-identical backward compatibility. Every existing table on disk resolves to ALL or NONE without any new property being read. No reader-side migration. No precedence rules.
  • Pre-1.3.0 readers behave correctly. They don't know the new property exists. They open a COMMIT_TIME_ONLY table, see populate.meta.fields=false, and behave as a NONE reader — they cannot do incremental queries on the table, but they don't produce silent wrong results either.
  • Encodes "additive" structurally. The new flag only modifies a NONE table — it's literally a NONE table plus one populated column. Most code paths that branch on populate.meta.fields keep working unchanged; only paths that specifically need commit_time consult the new accessor.

Plug points

Config + accessors (hudi-common / hudi-client-common):

  • New HoodieTableConfig.META_FIELDS_COMMIT_TIME_ENABLED property.
  • New accessors: isCommitTimeOnlyMetaFieldsMode(), isCommitTimePopulated(), isRecordKeyPopulated() — three named predicates.
  • HoodieWriteConfig pass-throughs + Builder.withMetaFieldsCommitTimeEnabled().
  • HoodieWriteConfig.validate() rejects the populate=true + commit.time=true combination at build time.
  • HoodieTableMetaClient.TableBuilder.setMetaFieldsCommitTimeEnabled() persists the flag on hoodie.properties at table init.
  • HoodieSparkSqlWriter wires both fresh-table and bootstrap creation paths.

Writer engines:

  • HoodieAvroParquetWriter, HoodieSparkParquetWriter, HoodieRowCreateHandle each gain a commitTimeOnly constructor overload. When commitTimeOnly && !populateMetaFields, they populate _hoodie_commit_time and the derived seq id; the other four columns stay null. Bloom-filter / record-key index registration is intentionally skipped (the record-key column is not populated).

Read path (incremental query rejection):

  • IncrementalRelationV1/V2, MergeOnReadIncrementalRelationV1/V2 now check isCommitTimePopulated() rather than populateMetaFields() — COMMIT_TIME_ONLY tables are accepted, NONE tables remain rejected with a clearer message.

Scope

  • ✅ Spark Avro / Spark Row writer paths.
  • ✅ Spark Parquet bulk-insert.
  • ✅ Incremental query rejection logic across V1 / V2 / CoW / MoR.
  • ❌ Flink RowData writer — out of scope for this patch; behaves as NONE under COMMIT_TIME_ONLY (no commit_time populated). Tracked as a follow-up.
  • ❌ ORC / HFile writers — ORC continues to populate all meta fields unconditionally (legacy behavior); HFile is used only by MDT which is always ALL.

Impact

  • Storage layout: no change for tables that don't opt in. New optional mode for tables that do. Default behavior unchanged.
  • API: no public API breakage. New table property, new accessors, new builder method — all additive.
  • Configuration:
    • hoodie.meta.fields.commit.time.enabled (default false). Only meaningful when hoodie.populate.meta.fields=false. Persisted on hoodie.properties at table init.
  • Performance: writer hot path adds one boolean check per row when in the new mode; the bool is final and cached in the writer constructor.
  • Forward-compat: pre-1.3.0 readers ignore the new flag and treat the table as NONE — no silent wrong results.

Risk Level

low

Additive change with a narrow scope. The default path is untouched. The validation guard rejects the ambiguous combination loudly at writer init. Existing TestHoodieTableConfig regression coverage (93 tests) passes unchanged.

Documentation Update

  • New config hoodie.meta.fields.commit.time.enabled documented via @ConfigProperty annotation on HoodieTableConfig.
  • No public-facing docs update needed in this patch; if the website page on meta fields exists, a separate docs PR will add the three-mode table.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

…ueries on minimal-meta-fields tables

Today the only way to opt out of populating Hudi's five meta columns
(_hoodie_commit_time, _hoodie_commit_seqno, _hoodie_record_key,
_hoodie_partition_path, _hoodie_file_name) is the all-or-nothing
hoodie.populate.meta.fields=false. That saves storage but disables
incremental queries (which require _hoodie_commit_time).

This patch adds a small, additive opt-in -- hoodie.meta.fields.commit.time.enabled --
that, when set together with hoodie.populate.meta.fields=false, additionally populates
_hoodie_commit_time on every row so incremental queries remain functional. The other
four meta columns stay null on disk, preserving the storage saving.

The three resulting modes:
  - ALL (default): hoodie.populate.meta.fields=true. All five meta fields populated.
  - NONE: hoodie.populate.meta.fields=false. No meta fields populated; incremental
    queries rejected (existing behavior).
  - COMMIT_TIME_ONLY: hoodie.populate.meta.fields=false +
    hoodie.meta.fields.commit.time.enabled=true. Only _hoodie_commit_time populated.

The combination populate.meta.fields=true + commit.time.enabled=true is
ambiguous and is rejected up-front at writer init.

Default behavior is unchanged. The new property defaults to false; tables that
opt out of populate.meta.fields without setting the new flag continue to behave
exactly as before (NONE mode).

Why a separate boolean instead of a single enum:
- Bit-identical backward compatibility. Every existing table on disk resolves to
  ALL or NONE without any new property being read. No reader-side migration.
- Pre-1.3.0 readers ignore the new property safely: they see
  populate.meta.fields=false and behave as a NONE reader, which is consistent
  (they cannot do incremental queries; the _hoodie_commit_time column they see
  is just unobserved).
- Encodes "additive" structurally. The new flag only modifies a NONE table -- it
  is literally a NONE table plus one populated column. Most existing code paths
  that branch on populate.meta.fields keep working unchanged; only paths that
  specifically need commit_time need to consult the new accessor.

Changes:
- New table property HoodieTableConfig.META_FIELDS_COMMIT_TIME_ENABLED.
- New accessors HoodieTableConfig.isCommitTimeOnlyMetaFieldsMode(),
  isCommitTimePopulated(), isRecordKeyPopulated() -- three named predicates so
  callers do not need bool[5]-style per-field plumbing.
- HoodieWriteConfig pass-through accessors + Builder.withMetaFieldsCommitTimeEnabled().
- HoodieWriteConfig.validate() rejects the populate=true + commit.time=true combination.
- HoodieTableMetaClient.TableBuilder.setMetaFieldsCommitTimeEnabled() persists the flag.
- HoodieSparkSqlWriter wires both fresh-table and bootstrap creation paths.
- Writer-engine touchpoints (HoodieAvroParquetWriter, HoodieSparkParquetWriter,
  HoodieRowCreateHandle): when commitTimeOnly is true, populate _hoodie_commit_time
  and the derived seq id; leave the other four columns null. Bloom-filter / record-
  key index registration is intentionally skipped because the record-key column is
  not populated.
- Incremental query rejection sites (IncrementalRelationV1/V2,
  MergeOnReadIncrementalRelationV1/V2) now check isCommitTimePopulated() rather
  than populateMetaFields(), so COMMIT_TIME_ONLY tables are accepted.

Tests:
- TestHoodieMetaFieldsMode (7 tests): all three modes resolve correctly from
  HoodieTableConfig; the ambiguous combo is reported as ALL mode for accessor
  defensiveness; HoodieConfig view yields identical semantics.
- TestHoodieWriteConfigMetaFieldsMode (6 tests): writer-builder surface; the
  invalid combination is rejected at build() with a message naming both
  properties.
- TestMetaFieldsCommitTimeOnly (4 tests, end-to-end through Spark datasource):
  ALL / NONE / COMMIT_TIME_ONLY all persist the right properties and the
  read-back table config reports the right mode; the invalid combination is
  rejected at writer init.
- TestHoodieTableConfig: definedTableConfigs() count assertion updated to 45.

Closes apache#18383

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Jun 22, 2026
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 56.25000% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.68%. Comparing base (8933224) to head (3f158f5).
⚠️ Report is 51 commits behind head on master.

Files with missing lines Patch % Lines
...ache/hudi/io/storage/HoodieSparkParquetWriter.java 0.00% 7 Missing and 1 partial ⚠️
...che/hudi/io/storage/row/HoodieRowCreateHandle.java 64.70% 4 Missing and 2 partials ⚠️
...udi/io/storage/hadoop/HoodieAvroParquetWriter.java 25.00% 4 Missing and 2 partials ⚠️
.../scala/org/apache/hudi/IncrementalRelationV1.scala 0.00% 3 Missing ⚠️
.../scala/org/apache/hudi/IncrementalRelationV2.scala 0.00% 3 Missing ⚠️
...apache/hudi/MergeOnReadIncrementalRelationV1.scala 0.00% 2 Missing and 1 partial ⚠️
...pache/hudi/common/table/HoodieTableMetaClient.java 60.00% 2 Missing ⚠️
...apache/hudi/MergeOnReadIncrementalRelationV2.scala 33.33% 1 Missing and 1 partial ⚠️
...java/org/apache/hudi/config/HoodieWriteConfig.java 90.90% 0 Missing and 1 partial ⚠️
.../hudi/io/storage/HoodieSparkFileWriterFactory.java 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #19047      +/-   ##
============================================
- Coverage     68.26%   67.68%   -0.59%     
- Complexity    29513    29941     +428     
============================================
  Files          2542     2567      +25     
  Lines        142627   146012    +3385     
  Branches      17788    18499     +711     
============================================
+ Hits          97369    98826    +1457     
- Misses        37253    38878    +1625     
- Partials       8005     8308     +303     
Flag Coverage Δ
common-and-other-modules 44.83% <26.25%> (+0.04%) ⬆️
hadoop-mr-java-client 44.60% <47.05%> (-0.15%) ⬇️
spark-client-hadoop-common 48.31% <35.48%> (+0.27%) ⬆️
spark-java-tests 48.10% <52.50%> (-0.67%) ⬇️
spark-scala-tests 44.65% <30.00%> (-0.19%) ⬇️
utilities 37.08% <25.00%> (-0.17%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...rg/apache/hudi/common/table/HoodieTableConfig.java 94.34% <100.00%> (-0.15%) ⬇️
...io/storage/hadoop/HoodieAvroFileWriterFactory.java 92.06% <100.00%> (-3.94%) ⬇️
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala 78.74% <100.00%> (+0.20%) ⬆️
...java/org/apache/hudi/config/HoodieWriteConfig.java 90.05% <90.90%> (+0.07%) ⬆️
.../hudi/io/storage/HoodieSparkFileWriterFactory.java 88.70% <66.66%> (-1.30%) ⬇️
...pache/hudi/common/table/HoodieTableMetaClient.java 92.66% <60.00%> (+0.33%) ⬆️
...apache/hudi/MergeOnReadIncrementalRelationV2.scala 60.95% <33.33%> (-0.59%) ⬇️
.../scala/org/apache/hudi/IncrementalRelationV1.scala 0.00% <0.00%> (ø)
.../scala/org/apache/hudi/IncrementalRelationV2.scala 0.00% <0.00%> (ø)
...apache/hudi/MergeOnReadIncrementalRelationV1.scala 50.89% <0.00%> (-0.46%) ⬇️
... and 3 more

... and 189 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR adds an additive hoodie.meta.fields.commit.time.enabled flag so that tables written with populate.meta.fields=false can still populate _hoodie_commit_time and keep incremental queries functional, wiring it through the write/table configs, the Spark and Avro parquet writers, the row bulk-insert handle, and the incremental relations. The CoW write/read paths look consistent; the main thing worth double-checking is how this interacts with MoR tables, since the incremental guards are relaxed for MoR too but the log-write path isn't updated for this mode (details in the inline comments). Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A few small readability nits below — an unnecessary string concat in the doc, a bypassed accessor in validate(), and inconsistent assertion style in the test.


if (!this.tableConfig.populateMetaFields()) {
throw new HoodieException("Incremental queries are not supported when meta fields are disabled")
if (!this.tableConfig.isCommitTimePopulated()) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This also enables incremental queries on MoR tables in COMMIT_TIME_ONLY mode, but the MoR log-write path (HoodieAppendHandle.populateMetadataFields) still only writes _hoodie_commit_time when populateMetaFields() is true — so log-file records get a null commit-time and would be silently dropped by the IsNotNull/range filters in incrementalSpanRecordFilters (and the merge also needs _hoodie_record_key, which is null in this mode). Should the mode be restricted to CoW, or should the append handle populate commit-time too? Same applies to V2. @yihua does this match your read of the MoR incremental path?

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

.withDocumentation("When enabled, populates all meta fields. When disabled, no meta fields are populated "
+ "and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing");

public static final ConfigProperty<Boolean> META_FIELDS_COMMIT_TIME_ENABLED = ConfigProperty

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Is enabling this flag on a pre-existing populate.meta.fields=false table meant to be guarded? HoodieWriterUtils.validateTableConfig only flags a conflict when the existing value is non-null, so a null→true transition on an older NONE table isn't blocked. After that, incremental queries over a range that includes pre-enablement commits would silently drop those rows since their _hoodie_commit_time is null on disk.

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

public static final ConfigProperty<Boolean> META_FIELDS_COMMIT_TIME_ENABLED = ConfigProperty
.key("hoodie.meta.fields.commit.time.enabled")
.defaultValue(false)
.withDocumentation("Only meaningful when " + "hoodie.populate.meta.fields=false. When set to true, the writer "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the "Only meaningful when " + splits a single thought across two adjacent string literals for no reason — could you merge them into one string so the concatenation isn't surprising to whoever reads this next?

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

// COMMIT_TIME_ONLY mode is an additive opt-in on top of populate.meta.fields=false. Setting
// both flags to true is ambiguous (the COMMIT_TIME_ONLY flag has no effect when all meta
// fields are already populated) so reject it explicitly rather than silently ignore.
boolean populateMetaFields = writeConfig.getBooleanOrDefault(HoodieTableConfig.POPULATE_META_FIELDS);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: writeConfig.populateMetaFields() already encodes this read — could you use it here instead of reaching through getBooleanOrDefault directly?

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

.withMetaFieldsCommitTimeEnabled(false)
.build();
assertEquals(false, cfg.populateMetaFields());
assertEquals(false, cfg.isCommitTimeOnlyMetaFieldsMode());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the three assertEquals(false, ...) calls here are inconsistent with assertFalse(...) used everywhere else in this class — could you swap them for assertFalse to match?

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support selective meta field population

4 participants