Skip to content

fix(spark): reject BLOB columns as record key / ordering / partition path fields#19056

Open
ZZZxDong wants to merge 1 commit into
apache:masterfrom
ZZZxDong:fix-HUDI-18819-reject-blob-key-fields
Open

fix(spark): reject BLOB columns as record key / ordering / partition path fields#19056
ZZZxDong wants to merge 1 commit into
apache:masterfrom
ZZZxDong:fix-HUDI-18819-reject-blob-key-fields

Conversation

@ZZZxDong

@ZZZxDong ZZZxDong commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

Closes #18819

A BLOB column (a struct under the hood) silently passed all existing key-field
validation. CREATE TABLE ... TBLPROPERTIES (primaryKey = '<blob_col>') and the
equivalent DataSource writes succeeded, producing a JSON-stringified struct as the
_hoodie_record_key (e.g. {"type":"INLINE","data":"hello-0","reference":null}).

BLOB holds large binary payloads — INLINE bytes or EXTERNAL references. Using it
as a key balloons the record key / shuffle / metadata index (INLINE), or ties
record identity to a storage path rather than content (EXTERNAL). It is not a
valid record key, ordering/preCombine, or partition path field.

Summary and Changelog

Reject BLOB-typed columns used as record key, ordering(preCombine), or partition
path fields, failing fast with a clear message on both write paths:

  • DDL: HoodieOptionConfig.validateTable (record key + ordering) and
    HoodieCatalogTable (partition path).
  • DataSource: HoodieSparkSqlWriter writeInternal and bootstrap.

Adds HoodieSchemaUtils.isBlobField / findBlobFields helpers (top-level,
case-insensitive, comma-separated multi-field aware). INLINE and EXTERNAL blobs
are treated identically. Note: PARTITIONED BY (<blob>) is already rejected
earlier by Spark (struct partition columns are disallowed); this change covers
the hoodie.datasource.write.partitionpath.field route.

Impact

User-facing: a CREATE TABLE / write that previously succeeded with a BLOB key
field now fails fast with a clear error. This is the intended fix; such tables
were already semantically broken.

Risk Level

low — validation-only, no storage format or read-path change. Covered by new
tests in TestBlobDataType, TestHoodieSparkSqlWriter, TestHoodieOptionConfig.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label Jun 24, 2026
@ZZZxDong ZZZxDong force-pushed the fix-HUDI-18819-reject-blob-key-fields branch from c62225d to 4d184a5 Compare June 24, 2026 07:35
@ZZZxDong ZZZxDong force-pushed the fix-HUDI-18819-reject-blob-key-fields branch from 4d184a5 to 14997e8 Compare June 24, 2026 08:53

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds fail-fast validation rejecting BLOB-typed columns used as record key, ordering/preCombine, or partition path fields across the Spark DDL (HoodieOptionConfig.validateTable, HoodieCatalogTable) and DataSource (HoodieSparkSqlWriter write/bootstrap) paths, with shared isBlobField/findBlobFields helpers. I traced the validation logic, confirmed the BLOB type-descriptor comparison matches what the real schema converter attaches ("BLOB"), verified the config keys line up with the write path, and checked null/empty and col:type partition handling — no issues flagged from this automated pass. A Hudi committer or PMC member can take it from here for a final review. A few small readability improvements possible — a redundant null guard that could mislead readers, an opaque inner function name, and a repeated error-message string that would benefit from being a shared constant.

cc @yihua

*/
def isBlobField(field: StructField): Boolean = {
val md = field.metadata
md != null &&

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: StructField.metadata is typed as Metadata in Spark (never null — it defaults to Metadata.empty), so the md != null && guard is a dead check. It may confuse future readers into thinking null metadata is possible here; could you drop it and just start with md.contains(...)?

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

private def validateBlobFieldUsage(schema: StructType,
optParams: Map[String, String],
resolver: Resolver): Unit = {
def check(spec: Option[String], usage: String): Unit = spec.foreach { s =>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the inner function check is a bit opaque at the call sites — something like rejectIfBlobField would make lines 840–843 self-documenting without having to look up the implementation.

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

if (blobs.nonEmpty) {
throw new HoodieException(
s"BLOB type column(s) ${blobs.mkString("[", ", ", "]")} cannot be used as $usage. " +
"BLOB columns hold large binary payloads (INLINE or EXTERNAL) and are not supported as " +

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the sentence "BLOB columns hold large binary payloads (INLINE or EXTERNAL) and are not supported as record key, ordering/preCombine, or partition path fields." is copy-pasted into four places across three files (HoodieSparkSqlWriter, HoodieCatalogTable, and twice in HoodieOptionConfig). Extracting it as a constant in HoodieSchemaUtils would mean a single edit keeps all messages in sync.

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BLOB column should be rejected as primaryKey / recordKey

3 participants