feat(partition-ttl): Introduce KeepByEventTimeStrategy to expire partitions by event time#19081
feat(partition-ttl): Introduce KeepByEventTimeStrategy to expire partitions by event time#19081wangxianghu wants to merge 1 commit into
Conversation
hudi-agent
left a comment
There was a problem hiding this comment.
Thanks for working on this! This PR adds a KEEP_BY_EVENT_TIME partition TTL strategy that derives the partition's reference time from the date encoded in the partition path, decoupling TTL from write/commit timing. The change is cleanly additive (new enum value, three defaulted configs, one new class) and comes with good unit coverage of the parsing logic. A couple of edge cases worth double-checking in the inline comments — primarily the timezone handling and the fail-fast behavior across the partition set. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.
c8df5d8 to
d64fbc2
Compare
…itions by event time
16d6d0b to
b301702
Compare
|
hi @stream2000 do you have time to take a look ? |
Describe the issue this Pull Request addresses
Today Hudi ships two partition TTL strategies, both keyed off processing time:
In practice, what most users actually want is event-time semantics: the date encoded in the partition path (e.g. dt=2026-04-24) is the business event date, and anything older than N days should be dropped —
regardless of when/how writes land.
This PR introduces a new strategy KeepByEventTimeStrategy that derives the partition's reference time from the partition path itself, decoupling TTL from write behavior.
Summary and Changelog
For users: a new KEEP_BY_EVENT_TIME partition TTL strategy that expires partitions based on the date encoded in the partition path. TTL is no longer affected by backfill, late-arriving data, or commit timing.
Supported partition path shapes (v1 scope):
Day, format=yyyy-MM-dd
time only: 2026-06-27, dt=2026-06-27 — startIndex 0
prefix + time: region=us/2026-06-27, region=us/dt=2026-06-27 — startIndex 1
time + suffix: 2026-06-27/source=app, dt=2026-06-27/source=app — startIndex 0
prefix + time + suffix: region=us/dt=2026-06-27/source=app — startIndex 1
Day, format=yyyyMMdd
time only: 20260627, dt=20260627 — startIndex 0
prefix + time: region=us/20260627, region=us/dt=20260627 — startIndex 1
time + suffix: 20260627/source=app, dt=20260627/source=app — startIndex 0
prefix + time + suffix: region=us/dt=20260627/source=app — startIndex 1
Hour, format=yyyy-MM-dd/HH
time only: 2026-06-27/12, dt=2026-04-05/hh=12 — startIndex 0
prefix + time: region=us/2026-06-27/12, region=us/dt=2026-04-05/hh=12 — startIndex 1
time + suffix: 2026-06-27/12/source=app, dt=2026-04-05/hh=12/source=app — startIndex 0
prefix + time + suffix: region=us/dt=2026-04-05/hh=12/source=app — startIndex 1
Hour, format=yyyyMMdd/HH
time only: 20260627/12, dt=20260405/hh=12 — startIndex 0
prefix + time: region=us/20260627/12, region=us/dt=20260405/hh=12 — startIndex 1
time + suffix: 20260627/12/source=app, dt=20260405/hh=12/source=app — startIndex 0
prefix + time + suffix: region=us/dt=20260405/hh=12/source=app — startIndex 1
Hive-style key names are not constrained: dt=, day=, event_date=, hh=, hour= all work; only the value after = is parsed.
Hive-style is honored from the table's hoodie.datasource.write.hive_style_partitioning flag (authoritative table-level setting, not guessed per-segment).
Impact
Risk Level
low
Documentation Update
none
Contributor's checklist