Skip to content

feat: Variant Support#2188

Open
c-thiel wants to merge 22 commits into
apache:mainfrom
c-thiel:feat/variant-support
Open

feat: Variant Support#2188
c-thiel wants to merge 22 commits into
apache:mainfrom
c-thiel:feat/variant-support

Conversation

@c-thiel
Copy link
Copy Markdown
Collaborator

@c-thiel c-thiel commented Feb 28, 2026

Which issue does this PR close?

Variant Support.
Arrow value support is currently missing as I am unsure how we want to extend Literal

What changes are included in this PR?

Core: Variant Type

  • crates/iceberg/src/spec/datatypes.rs — new Variant type
  • crates/iceberg/src/spec/values/literal.rsVariant literal value
  • crates/iceberg/src/spec/schema/ — visitor, index, pruning, mod, id reassigner all handle Variant
  • crates/iceberg/src/spec/table_metadata.rs — metadata support

Avro

  • crates/iceberg/src/avro/schema.rs — read/write Variant in Avro

Arrow

  • crates/iceberg/src/arrow/schema.rs — map Variant to Arrow type
  • crates/iceberg/src/arrow/reader.rs — read Variant from Arrow
  • crates/iceberg/src/arrow/value.rs — Arrow value conversion
  • Minor fixes in caching_delete_file_loader.rs and nan_val_cnt_visitor.rs

Parquet

  • crates/iceberg/src/writer/file_writer/parquet_writer.rs — write Variant columns

Tests & Dev

  • crates/integration_tests/tests/read_variant.rs — new integration test for reading Variant data
  • dev/spark/provision.py — Spark provisioning to generate Variant test data

Are these changes tested?

Sure! Even integration tested :)

let table_creation = TableCreation::builder()
.name(name.clone())
.schema(iceberg_schema)
.format_version(format_version)
Copy link
Copy Markdown
Collaborator Author

@c-thiel c-thiel Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this change existing tests where rightfully failing as we used to create a V2 table with a NS Timestamp column:
https://github.com/apache/iceberg-rust/actions/runs/22522306915/job/65248930667

This new logic determines the min format version required and uses that - but at least V2. Thus we switch now to V3 for ns timestamps.

@brgr-s brgr-s mentioned this pull request Mar 4, 2026
@c-thiel
Copy link
Copy Markdown
Collaborator Author

c-thiel commented Mar 18, 2026

@CTTY @liurenjie1024 @Xuanwo this would be ready for review!

Copy link
Copy Markdown
Collaborator

@CTTY CTTY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feature! Just took a look.

Also the test seems to be failing

Comment thread crates/iceberg/src/avro/schema.rs Outdated
Comment thread crates/iceberg/src/arrow/reader.rs Outdated
Comment thread crates/iceberg/src/avro/schema.rs
Comment thread crates/catalog/glue/src/schema.rs Outdated
@c-thiel
Copy link
Copy Markdown
Collaborator Author

c-thiel commented Apr 10, 2026

@CTTY ready for another round!

Copy link
Copy Markdown
Collaborator

@CTTY CTTY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM! Left some minor comments

Comment thread crates/iceberg/src/spec/schema/mod.rs Outdated
Comment thread crates/iceberg/src/spec/datatypes.rs Outdated
Comment thread crates/iceberg/src/spec/table_metadata.rs Outdated
@c-thiel
Copy link
Copy Markdown
Collaborator Author

c-thiel commented May 13, 2026

@CTTY ready for another round!

Copy link
Copy Markdown
Collaborator

@CTTY CTTY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working on this!

cc @blackmwk to also take a pass

Comment thread crates/iceberg/src/avro/schema.rs Outdated
Comment thread crates/iceberg/src/spec/schema/mod.rs Outdated

/// Check that all types in this schema are supported by the given format version.
///
/// Mirrors Java's `Schema.checkCompatibility()`. Returns an error listing every
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accuracy fix on the doc comment: this says it "Mirrors Java's Schema.checkCompatibility()", but Java's version also rejects a non-null initial-default when formatVersion < 3, which isn't checked here (iceberg-go #932 guards variant defaults too). Could we narrow it to "Mirrors the type-version portion of Java's Schema.checkCompatibility()" so it doesn't read as full parity? Adding the actual default check is fine as a follow-up.

Copy link
Copy Markdown
Collaborator Author

@c-thiel c-thiel May 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! The additional check was easy to add. I added it in a separate Commit so its easy to revert if it turns out to require more discussions:

abaa832

Comment thread crates/catalog/hms/src/schema.rs Outdated
Ok(hive_type)
}

fn variant(&mut self, _v: &iceberg::spec::VariantType) -> Result<Self::T> {
Copy link
Copy Markdown
Collaborator

@mbutrovich mbutrovich May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this was already discussed back in March and the call was to use "variant" (@CTTY's suggestion, with the Glue rendering as unknown: variant), citing apache/iceberg#15220 as open. iceberg-java has since landed on "unknown" for VARIANT in HiveSchemaUtil.java:170 (PR apache/iceberg#15964, April 28). Worth revisiting? Or is the decision still that the rendered shape here is fine? Not pushing either way, just flagging that Java's direction shifted after the original discussion. CC @nssalian

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would lean towards matching with Java. "unknown" is what the metastore-facing type string should be here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch to "unknown" in 3c229b1


/// Returns the minimum format version required for the type.
#[inline(always)]
pub fn min_format_version(&self) -> FormatVersion {
Copy link
Copy Markdown
Collaborator

@mbutrovich mbutrovich May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently doesn't recurse: Type::Struct(...).min_format_version() returns V1 even if the struct contains a Variant or TimestampNs. It happens to give the right answer at the schema level only because Schema::min_format_version iterates id_to_field.values(), which already includes every nested field as its own entry.

A SchemaVisitor would fit the codebase's pattern (the PR already extends visitors in seven places) and make the per-Type method correct for any input. Roughly:

struct MinFormatVersionVisitor;
impl SchemaVisitor for MinFormatVersionVisitor {
    type T = FormatVersion;
    fn schema(&mut self, _: &Schema, v: FormatVersion) -> Result<FormatVersion> { Ok(v) }
    fn field(&mut self, _: &NestedFieldRef, v: FormatVersion) -> Result<FormatVersion> { Ok(v) }
    fn r#struct(&mut self, _: &StructType, results: Vec<FormatVersion>) -> Result<FormatVersion> {
        Ok(results.into_iter().max().unwrap_or(FormatVersion::V1))
    }
    fn list(&mut self, _: &ListType, v: FormatVersion) -> Result<FormatVersion> { Ok(v) }
    fn map(&mut self, _: &MapType, k: FormatVersion, v: FormatVersion) -> Result<FormatVersion> {
        Ok(k.max(v))
    }
    fn primitive(&mut self, p: &PrimitiveType) -> Result<FormatVersion> {
        Ok(match p {
            PrimitiveType::TimestampNs | PrimitiveType::TimestamptzNs => FormatVersion::V3,
            _ => FormatVersion::V1,
        })
    }
    fn variant(&mut self, _: &VariantType) -> Result<FormatVersion> { Ok(FormatVersion::V3) }
}

Then Type::min_format_version is visit_type(self, &mut MinFormatVersionVisitor) and Schema::min_format_version reuses the same visitor over the whole tree. Net effect: same public surface, recursion is correct for any Type, and the per-variant table lives in one place rather than split between the flat match here and the id_to_field iteration in Schema::min_format_version.

Copy link
Copy Markdown
Collaborator Author

@c-thiel c-thiel May 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — I fixed the recursion. I went with a direct recursive match in Type::min_format_version (struct/list/map arms recurse) rather than a SchemaVisitor: it achieves both of your goals — correct for any Type and the per-type version table living in one place — without the extra visitor struct and Result plumbing (this fold is infallible). Schema::min_format_version now just maps over the top-level fields and relies on that recursion, so the old id_to_field split is gone too. Added a unit test for the nested case.

Let me know what you think!

fn assign_fresh_ids_to_type(field_type: &Type, next_id: &mut i32) -> Type {
match field_type {
Type::Primitive(_) => field_type.clone(),
Type::Primitive(_) | Type::Variant(_) => field_type.clone(),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style mismatch with id_reassigner.rs:105: that file does Type::Variant(v) => Ok(Type::Variant(v)), this one folds Variant into the primitive arm with field_type.clone(). Both work since VariantType is Copy. Picking one would be slightly easier on future readers.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — split it into an explicit Type::Variant(v) => Type::Variant(*v) arm so it matches id_reassigner.rs.

Comment thread crates/iceberg/src/spec/schema/index.rs Outdated
Ok(())
}

fn variant(&mut self, _v: &crate::spec::VariantType) -> Result<Self::T> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applies to crates/iceberg/src/spec/schema/index.rs:57, 153, 305 and crates/iceberg/src/spec/schema/prune_columns.rs:242 and crates/catalog/glue/src/schema.rs:186 and crates/catalog/hms/src/schema.rs:143

Minor: crate::spec::VariantType inline rather than use'd. The codebase has mixed precedent here (some files use inline FQNs too) so feel free to ignore, but use would match the surrounding files' imports more closely.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useing it now :)

}

Ok(())
fn variant(&mut self, _v: &VariantType) -> Result<Self::T> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really a comment on this line, but: iceberg-java's TypeToMessageType#variant writes the Parquet group with LogicalTypeAnnotation.variantType(VARIANT_SPEC_VERSION). The Rust write path here doesn't add that annotation, so files written by iceberg-rust would carry a plain Struct(Binary, Binary) without the variant logical type marker. The integration tests are read-only against Spark-written data, so it isn't caught. Worth a tracking issue, or already on the roadmap?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR is unshredded variants only, with two follow-ups:

  1. Write annotation (your comment): we emit a plain Struct(Binary,Binary) without variantType(...) since variant_experimental is off. Doesn't break Java read-back — it resolves variant by field-id, not the annotation — but I'll track adding it.
  2. Shredded reads: a typed_value sub-field was being silently dropped (corrupt data). Added a guard that returns FeatureUnsupported + a test; full shredding reconstruction is a follow-up.

Guard added here:
b702f5a

Copy link
Copy Markdown
Collaborator Author

@c-thiel c-thiel May 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding 1)
iceberg-rust writes via AsyncArrowWriter, which derives the Parquet schema from the Arrow schema. In parquet 58.1.0, that path only emits the VARIANT annotation when the field carries the parquet_variant_compute::VariantType extension type and variant_experimental is enabled (otherwise logical_type_for_struct is a stub returning None). I couldn't find a public per-field hook to inject the annotation onto a plain Struct(Binary,Binary).

So the real cost is: enable variant_experimental + attach the extension type to the field. Two risksthat I se:

  • Turning on the feature may change how the reader decodes a VARIANT-annotated group (native VariantArray instead of Struct{metadata,value}) — could break the current read path that expects the struct.
  • New experimental dep surface.

Not sure how we should proceed. I think this maybe should be a separate issue?

I created this for now:
#2546

Copy link
Copy Markdown
Collaborator

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the feedback @c-thiel! This LGTM! Looking forward to putting some queries through this.

Copy link
Copy Markdown

@nssalian nssalian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Thanks for the work @c-thiel!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants