Use this issue to track InQL RFC 025, proposed at docs/rfcs/025_typed_sketch_logical_values.md.
Area
- Specification (RFCs)
- Package & tests
- Documentation
Summary
RFC 025 defines typed sketch logical values for InQL. Sketch helpers must not be modeled as ordinary strings or binary blobs; they must produce and consume logical sketch values that record sketch family, input value domain, parameterization, merge compatibility, and serialization format identity.
Motivation
RFC 023 covers portable approximate aggregates that directly return scalar estimates. Stored and mergeable sketch state is a separate problem: authors may want to materialize a sketch, merge sketches across partitions or files, estimate from a stored sketch later, or serialize sketch state for transport. Those operations require compatibility rules that bytes or str alone cannot represent.
Without a typed sketch value model, InQL cannot reject invalid operations such as merging a HyperLogLog sketch with a KLL sketch, merging sketches over incompatible value domains, or deserializing a payload with an incompatible format version. That would push semantic validation into backend-specific runtime failures and weaken the Substrait boundary.
Proposal sketch
Define sketch logical values as first-class typed values. A sketch value should carry at least:
- sketch family identity, such as HyperLogLog, KLL, theta, count-min, or bitmap;
- input value domain, such as string identifiers, integer identifiers, numeric values, or categorical values;
- family parameters that affect merge compatibility, such as precision, accuracy, nominal entries, width/depth, seed, or ordering policy;
- format identity and version when the value can be serialized;
- nullability and ordinary column-position metadata needed by existing InQL expression and relation surfaces.
Sketch construction helpers that summarize rows should be aggregate measures. Merge helpers should validate sketch family compatibility before lowering. Estimate helpers should declare their scalar result type. Serialization/deserialization should be explicit and should require enough metadata to identify family, domain, parameters, and format version.
The Substrait boundary must preserve sketch logical type identity through extension type metadata or reject unsupported operations before execution. Backend adapters may implement, emulate, or reject sketch operations, but they must not redefine ordinary binary or string expressions as sketch states.
Alternatives considered
- Treat sketches as
bytes: rejected because it prevents typechecking merge compatibility and moves semantic errors into backend runtime failures.
- Expose only scalar approximate aggregates: rejected as a complete long-term answer because stored and mergeable sketches are a legitimate analytics need.
- Copy one backend's sketch catalog directly: rejected because InQL needs backend-neutral semantics and capability reporting.
- Make sketch values ordinary structs: rejected unless the struct carries a distinct logical type; ordinary structs do not by themselves encode family-specific compatibility rules.
Impact / compatibility
This RFC is additive. RFC 023 approximate scalar-result aggregates remain valid. Existing string or binary columns should not be retroactively treated as sketch values. If a backend or existing dataset stores sketch bytes, authors should use explicit deserialization with the required sketch type metadata.
This affects the InQL specification, public sketch helper metadata, typechecking/diagnostics for sketch-valued expressions, Substrait lowering or rejection, backend adapter capability handling, and documentation that distinguishes typed approximate state from backend-specific blobs.
Implementation notes (optional)
Resolve the RFC 025 Draft questions before implementation: public type spelling, first sketch family, merge-compatible parameters, serialization portability, and whether sketch values are legal in table schemas before InQL has a broader logical type model beyond primitive row columns.
This is intentionally split from RFC 023 so PRs implementing portable approximate aggregates do not smuggle in untyped sketch payload semantics.
Checklist
Use this issue to track InQL RFC 025, proposed at
docs/rfcs/025_typed_sketch_logical_values.md.Area
Summary
RFC 025 defines typed sketch logical values for InQL. Sketch helpers must not be modeled as ordinary strings or binary blobs; they must produce and consume logical sketch values that record sketch family, input value domain, parameterization, merge compatibility, and serialization format identity.
Motivation
RFC 023 covers portable approximate aggregates that directly return scalar estimates. Stored and mergeable sketch state is a separate problem: authors may want to materialize a sketch, merge sketches across partitions or files, estimate from a stored sketch later, or serialize sketch state for transport. Those operations require compatibility rules that
bytesorstralone cannot represent.Without a typed sketch value model, InQL cannot reject invalid operations such as merging a HyperLogLog sketch with a KLL sketch, merging sketches over incompatible value domains, or deserializing a payload with an incompatible format version. That would push semantic validation into backend-specific runtime failures and weaken the Substrait boundary.
Proposal sketch
Define sketch logical values as first-class typed values. A sketch value should carry at least:
Sketch construction helpers that summarize rows should be aggregate measures. Merge helpers should validate sketch family compatibility before lowering. Estimate helpers should declare their scalar result type. Serialization/deserialization should be explicit and should require enough metadata to identify family, domain, parameters, and format version.
The Substrait boundary must preserve sketch logical type identity through extension type metadata or reject unsupported operations before execution. Backend adapters may implement, emulate, or reject sketch operations, but they must not redefine ordinary binary or string expressions as sketch states.
Alternatives considered
bytes: rejected because it prevents typechecking merge compatibility and moves semantic errors into backend runtime failures.Impact / compatibility
This RFC is additive. RFC 023 approximate scalar-result aggregates remain valid. Existing string or binary columns should not be retroactively treated as sketch values. If a backend or existing dataset stores sketch bytes, authors should use explicit deserialization with the required sketch type metadata.
This affects the InQL specification, public sketch helper metadata, typechecking/diagnostics for sketch-valued expressions, Substrait lowering or rejection, backend adapter capability handling, and documentation that distinguishes typed approximate state from backend-specific blobs.
Implementation notes (optional)
Resolve the RFC 025 Draft questions before implementation: public type spelling, first sketch family, merge-compatible parameters, serialization portability, and whether sketch values are legal in table schemas before InQL has a broader logical type model beyond primitive row columns.
This is intentionally split from RFC 023 so PRs implementing portable approximate aggregates do not smuggle in untyped sketch payload semantics.
Checklist