fix(spark): reject BLOB columns as record key / ordering / partition path fields by ZZZxDong · Pull Request #19056 · apache/hudi

ZZZxDong · 2026-06-24T07:24:42Z

Describe the issue this Pull Request addresses

A BLOB column (a struct under the hood) silently passed all existing key-field
validation. CREATE TABLE ... TBLPROPERTIES (primaryKey = '<blob_col>') and the
equivalent DataSource writes succeeded, producing a JSON-stringified struct as the
_hoodie_record_key (e.g. {"type":"INLINE","data":"hello-0","reference":null}).

BLOB holds large binary payloads — INLINE bytes or EXTERNAL references. Using it
as a key balloons the record key / shuffle / metadata index (INLINE), or ties
record identity to a storage path rather than content (EXTERNAL). It is not a
valid record key, ordering/preCombine, or partition path field.

Summary and Changelog

Reject BLOB-typed columns used as record key, ordering(preCombine), or partition
path fields, failing fast with a clear message on both write paths:

DDL: HoodieOptionConfig.validateTable (record key + ordering) and
HoodieCatalogTable (partition path).
DataSource: HoodieSparkSqlWriter writeInternal and bootstrap.

Adds HoodieSchemaUtils.isBlobField / findBlobFields helpers (top-level,
case-insensitive, comma-separated multi-field aware). INLINE and EXTERNAL blobs
are treated identically. Note: PARTITIONED BY (<blob>) is already rejected
earlier by Spark (struct partition columns are disallowed); this change covers
the hoodie.datasource.write.partitionpath.field route.

Impact

User-facing: a CREATE TABLE / write that previously succeeded with a BLOB key
field now fails fast with a clear error. This is the intended fix; such tables
were already semantically broken.

Risk Level

low — validation-only, no storage format or read-path change. Covered by new
tests in TestBlobDataType, TestHoodieSparkSqlWriter, TestHoodieOptionConfig.

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…path fields

hudi-agent

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds fail-fast validation rejecting BLOB-typed columns used as record key, ordering/preCombine, or partition path fields across the Spark DDL (HoodieOptionConfig.validateTable, HoodieCatalogTable) and DataSource (HoodieSparkSqlWriter write/bootstrap) paths, with shared isBlobField/findBlobFields helpers. I traced the validation logic, confirmed the BLOB type-descriptor comparison matches what the real schema converter attaches ("BLOB"), verified the config keys line up with the write path, and checked null/empty and col:type partition handling — no issues flagged from this automated pass. A Hudi committer or PMC member can take it from here for a final review. A few small readability improvements possible — a redundant null guard that could mislead readers, an opaque inner function name, and a repeated error-message string that would benefit from being a shared constant.

cc @yihua

hudi-agent · 2026-06-24T09:16:48Z

+   */
+  def isBlobField(field: StructField): Boolean = {
+    val md = field.metadata
+    md != null &&


🤖 nit: StructField.metadata is typed as Metadata in Spark (never null — it defaults to Metadata.empty), so the md != null && guard is a dead check. It may confuse future readers into thinking null metadata is possible here; could you drop it and just start with md.contains(...)?

_{⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-06-24T09:16:48Z

+  private def validateBlobFieldUsage(schema: StructType,
+                                     optParams: Map[String, String],
+                                     resolver: Resolver): Unit = {
+    def check(spec: Option[String], usage: String): Unit = spec.foreach { s =>


🤖 nit: the inner function check is a bit opaque at the call sites — something like rejectIfBlobField would make lines 840–843 self-documenting without having to look up the implementation.

_{⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-06-24T09:16:49Z

+      if (blobs.nonEmpty) {
+        throw new HoodieException(
+          s"BLOB type column(s) ${blobs.mkString("[", ", ", "]")} cannot be used as $usage. " +
+            "BLOB columns hold large binary payloads (INLINE or EXTERNAL) and are not supported as " +


🤖 nit: the sentence "BLOB columns hold large binary payloads (INLINE or EXTERNAL) and are not supported as record key, ordering/preCombine, or partition path fields." is copy-pasted into four places across three files (HoodieSparkSqlWriter, HoodieCatalogTable, and twice in HoodieOptionConfig). Extracting it as a constant in HoodieSchemaUtils would mean a single edit keeps all messages in sync.

_{⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-bot · 2026-06-24T11:07:31Z

CI report:

14997e8 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

github-actions Bot added the size:M PR with lines of changes in (100, 300] label Jun 24, 2026

ZZZxDong force-pushed the fix-HUDI-18819-reject-blob-key-fields branch from c62225d to 4d184a5 Compare June 24, 2026 07:35

fix(spark): reject BLOB columns as record key / ordering / partition …

14997e8

…path fields

ZZZxDong force-pushed the fix-HUDI-18819-reject-blob-key-fields branch from 4d184a5 to 14997e8 Compare June 24, 2026 08:53

hudi-agent reviewed Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(spark): reject BLOB columns as record key / ordering / partition path fields#19056

fix(spark): reject BLOB columns as record key / ordering / partition path fields#19056
ZZZxDong wants to merge 1 commit into
apache:masterfrom
ZZZxDong:fix-HUDI-18819-reject-blob-key-fields

ZZZxDong commented Jun 24, 2026 •

edited

Loading

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent Jun 24, 2026

Uh oh!

hudi-agent Jun 24, 2026

Uh oh!

hudi-agent Jun 24, 2026

Uh oh!

hudi-bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ZZZxDong commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jun 24, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ZZZxDong commented Jun 24, 2026 •

edited

Loading