Skip to content

feat(partition-ttl): Introduce KeepByEventTimeStrategy to expire partitions by event time#19081

Open
wangxianghu wants to merge 1 commit into
apache:masterfrom
wangxianghu:event_time_ttl
Open

feat(partition-ttl): Introduce KeepByEventTimeStrategy to expire partitions by event time#19081
wangxianghu wants to merge 1 commit into
apache:masterfrom
wangxianghu:event_time_ttl

Conversation

@wangxianghu

@wangxianghu wangxianghu commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

Today Hudi ships two partition TTL strategies, both keyed off processing time:

  • KeepByCreationTimeStrategy uses the partition creation commit time recorded in .hoodie_partition_metadata. Backfilled historical partitions look "newly created" and never expire.
  • KeepByTimeStrategy uses the last commit time that touched the partition. Any late-arriving write extends the partition's lifetime, silently disabling TTL.

In practice, what most users actually want is event-time semantics: the date encoded in the partition path (e.g. dt=2026-04-24) is the business event date, and anything older than N days should be dropped —
regardless of when/how writes land.

This PR introduces a new strategy KeepByEventTimeStrategy that derives the partition's reference time from the partition path itself, decoupling TTL from write behavior.

Summary and Changelog

For users: a new KEEP_BY_EVENT_TIME partition TTL strategy that expires partitions based on the date encoded in the partition path. TTL is no longer affected by backfill, late-arriving data, or commit timing.

Supported partition path shapes (v1 scope):

Day, format=yyyy-MM-dd
time only: 2026-06-27, dt=2026-06-27 — startIndex 0
prefix + time: region=us/2026-06-27, region=us/dt=2026-06-27 — startIndex 1
time + suffix: 2026-06-27/source=app, dt=2026-06-27/source=app — startIndex 0
prefix + time + suffix: region=us/dt=2026-06-27/source=app — startIndex 1

Day, format=yyyyMMdd
time only: 20260627, dt=20260627 — startIndex 0
prefix + time: region=us/20260627, region=us/dt=20260627 — startIndex 1
time + suffix: 20260627/source=app, dt=20260627/source=app — startIndex 0
prefix + time + suffix: region=us/dt=20260627/source=app — startIndex 1

Hour, format=yyyy-MM-dd/HH
time only: 2026-06-27/12, dt=2026-04-05/hh=12 — startIndex 0
prefix + time: region=us/2026-06-27/12, region=us/dt=2026-04-05/hh=12 — startIndex 1
time + suffix: 2026-06-27/12/source=app, dt=2026-04-05/hh=12/source=app — startIndex 0
prefix + time + suffix: region=us/dt=2026-04-05/hh=12/source=app — startIndex 1

Hour, format=yyyyMMdd/HH
time only: 20260627/12, dt=20260405/hh=12 — startIndex 0
prefix + time: region=us/20260627/12, region=us/dt=20260405/hh=12 — startIndex 1
time + suffix: 20260627/12/source=app, dt=20260405/hh=12/source=app — startIndex 0
prefix + time + suffix: region=us/dt=20260405/hh=12/source=app — startIndex 1

Hive-style key names are not constrained: dt=, day=, event_date=, hh=, hour= all work; only the value after = is parsed.

Hive-style is honored from the table's hoodie.datasource.write.hive_style_partitioning flag (authoritative table-level setting, not guessed per-segment).

Impact

  • Public API: none broken. New enum value, three new configs (all with defaults), one new class — strictly additive.
  • User-facing: a new opt-in TTL strategy. Existing strategies and configs are untouched. Users adopt it by setting hoodie.partition.ttl.management.strategy.type=KEEP_BY_EVENT_TIME.
  • Performance: negligible — partition path parsing is in-memory string ops, runs only inside the existing TTL replace commit, no extra metadata reads.

Risk Level

low

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR adds a KEEP_BY_EVENT_TIME partition TTL strategy that derives the partition's reference time from the date encoded in the partition path, decoupling TTL from write/commit timing. The change is cleanly additive (new enum value, three defaulted configs, one new class) and comes with good unit coverage of the parsing logic. A couple of edge cases worth double-checking in the inline comments — primarily the timezone handling and the fail-fast behavior across the partition set. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.

@wangxianghu wangxianghu force-pushed the event_time_ttl branch 2 times, most recently from c8df5d8 to d64fbc2 Compare June 27, 2026 04:20
@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Jun 27, 2026
@wangxianghu

Copy link
Copy Markdown
Contributor Author

hi @stream2000 do you have time to take a look ?

@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants