From 7aefd78675638430cdbc2c9c61ad65fca67ea5aa Mon Sep 17 00:00:00 2001 From: Danny Meijer Date: Tue, 26 May 2026 02:03:37 +0200 Subject: [PATCH 01/13] feature - implement RFC 022 hashing functions (#39) --- docs/language/reference/functions/format.md | 31 ++++++++ docs/language/reference/functions/index.md | 4 +- docs/release_notes/v0_1.md | 1 + .../022_semi_structured_format_functions.md | 39 ++++++++-- docs/rfcs/README.md | 2 +- src/functions/hashing/md5.incn | 51 +++++++++++++ src/functions/hashing/sha2.incn | 76 +++++++++++++++++++ src/functions/hashing/sha224.incn | 51 +++++++++++++ src/functions/hashing/sha256.incn | 51 +++++++++++++ src/functions/hashing/sha384.incn | 51 +++++++++++++ src/functions/hashing/sha512.incn | 51 +++++++++++++ src/functions/mod.incn | 6 ++ src/lib.incn | 6 ++ src/substrait/function_extensions.incn | 5 ++ tests/test_function_registry.incn | 35 ++++++++- tests/test_hashing_functions.incn | 55 ++++++++++++++ tests/test_session_projection.incn | 39 ++++++++++ tests/test_substrait_plan.incn | 12 +++ 18 files changed, 556 insertions(+), 10 deletions(-) create mode 100644 docs/language/reference/functions/format.md create mode 100644 src/functions/hashing/md5.incn create mode 100644 src/functions/hashing/sha2.incn create mode 100644 src/functions/hashing/sha224.incn create mode 100644 src/functions/hashing/sha256.incn create mode 100644 src/functions/hashing/sha384.incn create mode 100644 src/functions/hashing/sha512.incn create mode 100644 tests/test_hashing_functions.incn diff --git a/docs/language/reference/functions/format.md b/docs/language/reference/functions/format.md new file mode 100644 index 0000000..568e349 --- /dev/null +++ b/docs/language/reference/functions/format.md @@ -0,0 +1,31 @@ +# Format Functions (Reference) + +Format helpers operate on scalar payloads that are already present in a relation. They do not read files, infer source +schemas from external locations, or change relation cardinality. + +The current implemented slice is deterministic string hashing: + +| Function | Meaning | +| --- | --- | +| `md5(expr)` | Return the lowercase hexadecimal MD5 digest for one string expression. | +| `sha224(expr)` | Return the lowercase hexadecimal SHA-224 digest for one string expression. | +| `sha256(expr)` | Return the lowercase hexadecimal SHA-256 digest for one string expression. | +| `sha384(expr)` | Return the lowercase hexadecimal SHA-384 digest for one string expression. | +| `sha512(expr)` | Return the lowercase hexadecimal SHA-512 digest for one string expression. | +| `sha2(expr, bit_length)` | Compatibility helper that rewrites to `sha224`, `sha256`, `sha384`, or `sha512` for supported literal bit lengths. | + +```incan +from pub::inql.functions import col, md5, sha2 + +projected = ( + events + .with_column("user_hash", sha2(col("user_id"), 256)) + .with_column("payload_md5", md5(col("payload"))) +) +``` + +Hash helpers operate on UTF-8 string bytes and return lowercase hexadecimal strings. `sha2(...)` accepts `224`, `256`, +`384`, and `512`; unsupported digest lengths are rejected by the helper rather than being passed through to a backend. + +JSON, CSV, URL, and dynamic-value predicate helpers remain future format-function slices until their schema arguments, +option records, path validation rules, and dynamic value model are specified. diff --git a/docs/language/reference/functions/index.md b/docs/language/reference/functions/index.md index 77861a2..e34af0d 100644 --- a/docs/language/reference/functions/index.md +++ b/docs/language/reference/functions/index.md @@ -10,10 +10,11 @@ Today the concrete shipped surfaces are documented here: - [Generator and table-valued functions](generators.md) - [Nested data functions](nested.md) - [Window functions](windows.md) +- [Format functions](format.md) The canonical scalar literal helper is `lit(...)`. Typed literal helpers construct the same scalar-expression representation. -The current registry-backed helper surface covers references, literals, casts, operators, predicates, conditionals, math, ordering, aggregates, generators, nested data, and windows. Each runtime entry exposes a stable function reference such as `inql.functions.col`, namespace, canonical name, typed lifecycle metadata (`since`, versioned changes, and optional deprecation), function policy category, function class, null behavior, alias policy, aggregate modifier policy, and Substrait mapping metadata. Checked public helpers provide the signature and, by default, the canonical name; metadata may override the canonical name only for source spelling constraints such as the reserved-word `mod` case. +The current registry-backed helper surface covers references, literals, casts, operators, predicates, conditionals, math, ordering, aggregates, generators, nested data, windows, and format helpers. Each runtime entry exposes a stable function reference such as `inql.functions.col`, namespace, canonical name, typed lifecycle metadata (`since`, versioned changes, and optional deprecation), function policy category, function class, null behavior, alias policy, aggregate modifier policy, and Substrait mapping metadata. Checked public helpers provide the signature and, by default, the canonical name; metadata may override the canonical name only for source spelling constraints such as the reserved-word `mod` case. The registry is the source for non-derivable machine facts. Public helper declarations are the source for argument names, argument types, and return types. Docstrings remain human-facing explanation, examples, and parameter intent. The `registry-metadata` check validates the checked API metadata projections produced from public facade aliases, registry decorators, and decorated callable signatures. Runtime registry entries are lazy and process-local: they support helper execution and lowering for loaded helpers, while the complete public catalog comes from checked metadata. This matters for generated docs, diagnostics, Prism lowering, and backend capability checks as the catalog grows. @@ -37,6 +38,7 @@ The registered helper surface currently includes: | `array(...)`, `cardinality(...)`, `array_contains(...)`, `arrays_overlap(...)`, `array_position(...)`, `array_range(...)`, `element_at(...)`, `array_sort(...)`, `array_distinct(...)`, `array_except(...)`, `array_intersect(...)`, `array_union(...)`, `array_join(...)`, `array_slice(...)`, `array_reverse(...)`, `array_flatten(...)`, `map_from_arrays(...)`, `map_extract(...)`, `map_contains_key(...)`, `map_keys(...)`, `map_values(...)`, `map_entries(...)`, `named_struct(...)` | scalar | registered nested scalar helpers backed by Substrait extension mappings; `array_range(...)` registers canonical `range` for positional generator lowering and `map_contains_key(...)` lowers as a documented predicate rewrite | | `explode(...)`, `explode_outer(...)`, `posexplode(...)`, `posexplode_outer(...)`, `inline(...)`, `inline_outer(...)`, `flatten(...)`, `stack(...)` | generator | relation-extension mappings consumed by `generate(...)`; positional forms use zero-based positions | | `window()`, `unbounded_preceding()`, `preceding(...)`, `current_row()`, `following(...)`, `unbounded_following()`, `row_number()`, `rank()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile(...)`, `lag(...)`, `lead(...)`, `first_value(...)`, `last_value(...)`, `nth_value(...)` | window | `window()` and bound helpers build structural window-spec metadata; placed ranking, distribution, offset, value, and aggregate-over-window helpers lower through `ConsistentPartitionWindowRel` and execute through the DataFusion session adapter | +| `md5(...)`, `sha224(...)`, `sha256(...)`, `sha384(...)`, `sha512(...)`, `sha2(...)` | scalar | registered format/hash helpers; concrete helpers lower through Substrait extension mappings, while `sha2(...)` rewrites to a supported concrete SHA-2 helper | | `asc(...)`, `desc(...)`, `asc_nulls_first(...)`, `asc_nulls_last(...)`, `desc_nulls_first(...)`, `desc_nulls_last(...)` | ordering | structural sort-field helpers consumed by `order_by(...)` and lowered to Substrait `SortRel.sorts` | | `sum(...)`, `count(...)`, `count_expr(...)`, `count_distinct(...)`, `count_if(...)`, `avg(...)`, `min(...)`, `max(...)` | aggregate | registered Substrait extension functions for core aggregates plus compatibility rewrites for `count_expr(...)`, `count_distinct(...)`, and `count_if(...)`; core aggregates allow `DISTINCT` and aggregate-local `FILTER` where the aggregate shape is valid | diff --git a/docs/release_notes/v0_1.md b/docs/release_notes/v0_1.md index 00824cb..355e9e4 100644 --- a/docs/release_notes/v0_1.md +++ b/docs/release_notes/v0_1.md @@ -17,6 +17,7 @@ Entries will be filled in as work lands (link RFCs and PRs when applicable). - **Nested data functions:** RFC 020 adds registry-backed scalar helpers for array construction/access, cardinality, containment, overlap, sorting, set-like operations, joining, slicing, reversing, scalar array flattening, map construction/access, map key/value/entry extraction, map key containment, and named struct construction. These helpers lower through Substrait extension metadata without introducing generator semantics, with representative DataFusion-backed Session coverage for composable array projection paths. - **Generator functions:** RFC 021 adds registry-backed generator applications for `explode(...)`, `explode_outer(...)`, `posexplode(...)`, `posexplode_outer(...)`, `inline(...)`, `inline_outer(...)`, portable `flatten(...)`, and `stack(...)`. Generators remain relation-shaping operations applied with `generate(...)`; they preserve input columns, require explicit output aliases, lower through the current Substrait extension-relation gap encoding, and execute through the DataFusion Session adapter with concrete output-column materialization. - **Window functions:** RFC 019 adds `window()` specs, explicit row/range frame bounds, ranking and distribution helpers (`row_number`, `rank`, `dense_rank`, `percent_rank`, `cume_dist`, `ntile`), offset and value helpers (`lag`, `lead`, `first_value`, `last_value`, `nth_value`), and aggregate-over-window placement through `with_window_column(...)`. Portable window helpers require explicit ordering where appropriate, lower through Substrait `ConsistentPartitionWindowRel`, and execute through the DataFusion session adapter. +- **Format functions:** RFC 022 adds the first deterministic hashing slice with `md5(...)`, `sha224(...)`, `sha256(...)`, `sha384(...)`, `sha512(...)`, and `sha2(...)`. Hash helpers operate on UTF-8 string bytes, return lowercase hexadecimal strings, lower through registry-owned Substrait metadata, and execute through the DataFusion-backed Session path. - **Function registry:** RFC 014 adds declaration-site registry decorators for the current public helper surface, including stable function references, checked signature projection, lifecycle metadata, behavior categories, alias policy, Substrait mapping categories, and checked API metadata drift validation. - **Function extension policy:** InQL RFC 024 policy metadata now distinguishes portable core functions, namespaced extension-only functions, opt-in compatibility aliases, engine-specific functions, and rejected compatibility requests without adding an extension plugin system or backend-owned semantics. - **Projection:** builder-based `with_column`, `add`, `mul`, and literal expression helpers now lower derived columns through Prism, Substrait, and Session execution. diff --git a/docs/rfcs/022_semi_structured_format_functions.md b/docs/rfcs/022_semi_structured_format_functions.md index 8df68b7..a07650d 100644 --- a/docs/rfcs/022_semi_structured_format_functions.md +++ b/docs/rfcs/022_semi_structured_format_functions.md @@ -1,6 +1,6 @@ # InQL RFC 022: Semi-structured and format functions -- **Status:** Draft +- **Status:** In Progress - **Created:** 2026-04-27 - **Author(s):** Danny Meijer (@dannymeijer) - **Related:** @@ -12,7 +12,7 @@ - InQL RFC 020 (nested data functions) - **Issue:** [InQL #39](https://github.com/dannys-code-corner/InQL/issues/39) - **RFC PR:** — -- **Written against:** Incan v0.2 +- **Written against:** Incan v0.3-era InQL - **Shipped in:** — ## Summary @@ -115,12 +115,41 @@ This RFC is additive. It should not change existing CSV ingestion behavior. - **Execution / interchange** — Prism and Substrait lowering must preserve parser options, hash encodings, and structured return values or diagnose unsupported functions. - **Documentation** — docs should distinguish scalar format functions from session read/write APIs. -## Unresolved questions +## Design Decisions + +### Resolved + +- The first implementation slice is deterministic hashing. JSON, CSV, URL, dynamic-value predicates, and structured parser helpers remain future slices because their schema arguments, option records, path validation, and dynamic value model are not settled here. +- Hash helpers in this slice operate on UTF-8 string bytes and return lowercase hexadecimal strings. +- Portable concrete hash helpers are `md5`, `sha224`, `sha256`, `sha384`, and `sha512`, each with an honest Substrait extension mapping and DataFusion-backed execution coverage. +- `sha2(expr, bit_length)` is a compatibility helper, not a separate backend mapping. It rewrites to `sha224`, `sha256`, `sha384`, or `sha512` for supported literal bit lengths and rejects unsupported values. +- `sha1`, `crc32`, and `xxhash64` are not implemented in the first slice because no honest Substrait/DataFusion mapping was validated for this branch. + +### Remaining - Should `from_json` accept model types directly as schema arguments, or only explicit schema values? - Should invalid JSON path expressions be compile-time errors when literal and runtime errors otherwise? - What option-record shape should CSV and JSON scalar parsers use? -- Should hash functions return binary values or lowercase hexadecimal strings by default? +- Should future binary-oriented hash helpers return binary values, lowercase hexadecimal strings, or an explicit typed encoding wrapper? - Which variant-style type predicates are portable enough for InQL core, and which should stay in a Snowflake-compatibility extension? - +## Implementation Plan + +1. Add registry-backed hashing helpers under a logical function family. +2. Add stable Substrait extension anchors for concrete hash helpers. +3. Keep `sha2(...)` as a compatibility rewrite over concrete helpers rather than a second mapping. +4. Add focused helper, registry, Substrait lowering, and DataFusion session tests with concrete digest values. +5. Add user-facing format-function docs and release notes. +6. Leave parser, URL, and dynamic-value helpers for later RFC 022 slices once their remaining design questions are resolved. + +## Progress Checklist + +- [x] RFC 022 moved to In Progress with a first implementation slice and recorded design decisions. +- [x] `md5`, `sha224`, `sha256`, `sha384`, `sha512`, and `sha2` helpers added under the function catalog. +- [x] Concrete hash helpers registered with Substrait extension metadata. +- [x] `sha2(...)` implemented as a literal-bit-length rewrite with invalid-input diagnostics. +- [x] Focused helper, registry, Substrait lowering, and DataFusion-backed session tests added. +- [x] User-facing format-function docs and release notes added. +- [ ] JSON and CSV scalar parser helpers specified and implemented. +- [ ] URL helper semantics specified and implemented. +- [ ] Dynamic-value predicate semantics specified and implemented. diff --git a/docs/rfcs/README.md b/docs/rfcs/README.md index 523de58..619fe45 100644 --- a/docs/rfcs/README.md +++ b/docs/rfcs/README.md @@ -28,7 +28,7 @@ InQL uses its **own** RFC series (starting at 000), independent of the [Incan la | [019][rfc-019] | Implemented | Window functions | | | [020][rfc-020] | Implemented | Nested data functions | | | [021][rfc-021] | Implemented | Generator and table-valued functions | | -| [022][rfc-022] | Draft | Semi-structured and format functions | | +| [022][rfc-022] | In Progress | Semi-structured and format functions | | | [023][rfc-023] | Draft | Approximate and sketch functions | | | [024][rfc-024] | Implemented | Function extension policy | | diff --git a/src/functions/hashing/md5.incn b/src/functions/hashing/md5.incn new file mode 100644 index 0000000..6d4b1cc --- /dev/null +++ b/src/functions/hashing/md5.incn @@ -0,0 +1,51 @@ +""" +MD5 hash helper. + +`md5` hashes a string expression and returns its lowercase hexadecimal digest. +""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import function_registry, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import MD5_FUNCTION_ANCHOR + + +@function_registry.add("md5", deterministic_spec( + FunctionClass.Scalar, + FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + FunctionNullBehavior.DependsOnInputs, + extension_mapping("md5", MD5_FUNCTION_ANCHOR), +)) +pub def md5(expr: ColumnExpr) -> ColumnExpr: + """ + Build an MD5 hexadecimal digest expression. + + Examples: + user_digest = md5(col("user_id")) + + Parameters: + expr: String expression whose UTF-8 bytes should be hashed. + """ + return registered_application("md5", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_md5_builds_registered_application() -> None: + expr = md5(col("payload")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "md5" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/hashing/sha2.incn b/src/functions/hashing/sha2.incn new file mode 100644 index 0000000..154d424 --- /dev/null +++ b/src/functions/hashing/sha2.incn @@ -0,0 +1,76 @@ +""" +SHA-2 compatibility helper. + +`sha2(expr, bits)` rewrites to the matching concrete SHA-2 helper for supported digest lengths. +""" + +from rust::incan_stdlib::errors import raise_value_error +from function_registry import ( + FunctionClass, + FunctionDeterminism, + FunctionErrorBehavior, + FunctionLifecycle, + FunctionNullBehavior, + compatibility_alias_spec, + core_function_namespace, + rewrite_mapping, + v0_1, +) +from functions.hashing.sha224 import sha224 +from functions.hashing.sha256 import sha256 +from functions.hashing.sha384 import sha384 +from functions.hashing.sha512 import sha512 +from functions.registry import function_registry +from projection_builders import ColumnExpr + + +@function_registry.add("sha2", compatibility_alias_spec( + core_function_namespace(), + FunctionClass.Scalar, + ["sha224", "sha256", "sha384", "sha512"], + FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + FunctionDeterminism.Deterministic, + FunctionNullBehavior.DependsOnInputs, + FunctionErrorBehavior.InvalidInputDiagnostic, + rewrite_mapping("sha2(expr, bits) -> sha224/sha256/sha384/sha512(expr) for supported literal bit lengths"), +)) +pub def sha2(expr: ColumnExpr, bit_length: int) -> ColumnExpr: + """ + Build a SHA-2 hexadecimal digest expression for a supported digest length. + + Examples: + user_digest = sha2(col("user_id"), 256) + + Parameters: + expr: String expression whose UTF-8 bytes should be hashed. + bit_length: Supported digest size: 224, 256, 384, or 512. + """ + if bit_length == 224: + return sha224(expr) + if bit_length == 256: + return sha256(expr) + if bit_length == 384: + return sha384(expr) + if bit_length == 512: + return sha512(expr) + return raise_value_error("sha2 bit_length must be one of 224, 256, 384, or 512") + + +module tests: + from std.testing import assert_raises + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_sha2_rewrites_to_supported_sha2_helper() -> None: + expr = sha2(col("payload"), 256) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "sha256" + assert column_expr_argument_count(expr) == 1 + def _call_sha2_with_unsupported_length() -> None: + sha2(col("payload"), 1) + def test_sha2_rejects_unsupported_bit_length() -> None: + assert_raises[ValueError](_call_sha2_with_unsupported_length) diff --git a/src/functions/hashing/sha224.incn b/src/functions/hashing/sha224.incn new file mode 100644 index 0000000..4b209d1 --- /dev/null +++ b/src/functions/hashing/sha224.incn @@ -0,0 +1,51 @@ +""" +SHA-224 hash helper. + +`sha224` hashes a string expression and returns its lowercase hexadecimal digest. +""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import function_registry, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import SHA224_FUNCTION_ANCHOR + + +@function_registry.add("sha224", deterministic_spec( + FunctionClass.Scalar, + FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + FunctionNullBehavior.DependsOnInputs, + extension_mapping("sha224", SHA224_FUNCTION_ANCHOR), +)) +pub def sha224(expr: ColumnExpr) -> ColumnExpr: + """ + Build a SHA-224 hexadecimal digest expression. + + Examples: + payload_digest = sha224(col("payload")) + + Parameters: + expr: String expression whose UTF-8 bytes should be hashed. + """ + return registered_application("sha224", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_sha224_builds_registered_application() -> None: + expr = sha224(col("payload")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "sha224" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/hashing/sha256.incn b/src/functions/hashing/sha256.incn new file mode 100644 index 0000000..32d0963 --- /dev/null +++ b/src/functions/hashing/sha256.incn @@ -0,0 +1,51 @@ +""" +SHA-256 hash helper. + +`sha256` hashes a string expression and returns its lowercase hexadecimal digest. +""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import function_registry, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import SHA256_FUNCTION_ANCHOR + + +@function_registry.add("sha256", deterministic_spec( + FunctionClass.Scalar, + FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + FunctionNullBehavior.DependsOnInputs, + extension_mapping("sha256", SHA256_FUNCTION_ANCHOR), +)) +pub def sha256(expr: ColumnExpr) -> ColumnExpr: + """ + Build a SHA-256 hexadecimal digest expression. + + Examples: + payload_digest = sha256(col("payload")) + + Parameters: + expr: String expression whose UTF-8 bytes should be hashed. + """ + return registered_application("sha256", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_sha256_builds_registered_application() -> None: + expr = sha256(col("payload")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "sha256" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/hashing/sha384.incn b/src/functions/hashing/sha384.incn new file mode 100644 index 0000000..c7afab1 --- /dev/null +++ b/src/functions/hashing/sha384.incn @@ -0,0 +1,51 @@ +""" +SHA-384 hash helper. + +`sha384` hashes a string expression and returns its lowercase hexadecimal digest. +""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import function_registry, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import SHA384_FUNCTION_ANCHOR + + +@function_registry.add("sha384", deterministic_spec( + FunctionClass.Scalar, + FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + FunctionNullBehavior.DependsOnInputs, + extension_mapping("sha384", SHA384_FUNCTION_ANCHOR), +)) +pub def sha384(expr: ColumnExpr) -> ColumnExpr: + """ + Build a SHA-384 hexadecimal digest expression. + + Examples: + payload_digest = sha384(col("payload")) + + Parameters: + expr: String expression whose UTF-8 bytes should be hashed. + """ + return registered_application("sha384", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_sha384_builds_registered_application() -> None: + expr = sha384(col("payload")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "sha384" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/hashing/sha512.incn b/src/functions/hashing/sha512.incn new file mode 100644 index 0000000..193fe54 --- /dev/null +++ b/src/functions/hashing/sha512.incn @@ -0,0 +1,51 @@ +""" +SHA-512 hash helper. + +`sha512` hashes a string expression and returns its lowercase hexadecimal digest. +""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import function_registry, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import SHA512_FUNCTION_ANCHOR + + +@function_registry.add("sha512", deterministic_spec( + FunctionClass.Scalar, + FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + FunctionNullBehavior.DependsOnInputs, + extension_mapping("sha512", SHA512_FUNCTION_ANCHOR), +)) +pub def sha512(expr: ColumnExpr) -> ColumnExpr: + """ + Build a SHA-512 hexadecimal digest expression. + + Examples: + payload_digest = sha512(col("payload")) + + Parameters: + expr: String expression whose UTF-8 bytes should be hashed. + """ + return registered_application("sha512", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_sha512_builds_registered_application() -> None: + expr = sha512(col("payload")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "sha512" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/mod.incn b/src/functions/mod.incn index 9292041..deb93a6 100644 --- a/src/functions/mod.incn +++ b/src/functions/mod.incn @@ -81,6 +81,12 @@ pub from functions.windows.first_value import first_value pub from functions.windows.last_value import last_value pub from functions.windows.nth_value import nth_value pub from functions.windows.bounds import current_row, following, preceding, unbounded_following, unbounded_preceding +pub from functions.hashing.md5 import md5 +pub from functions.hashing.sha2 import sha2 +pub from functions.hashing.sha224 import sha224 +pub from functions.hashing.sha256 import sha256 +pub from functions.hashing.sha384 import sha384 +pub from functions.hashing.sha512 import sha512 pub from functions.operators.add import add pub from functions.operators.and_ import and_ pub from functions.operators.div import div diff --git a/src/lib.incn b/src/lib.incn index 08cddb9..6aa2798 100644 --- a/src/lib.incn +++ b/src/lib.incn @@ -133,6 +133,12 @@ pub from functions.windows.first_value import first_value pub from functions.windows.last_value import last_value pub from functions.windows.nth_value import nth_value pub from functions.windows.bounds import current_row, following, preceding, unbounded_following, unbounded_preceding +pub from functions.hashing.md5 import md5 +pub from functions.hashing.sha2 import sha2 +pub from functions.hashing.sha224 import sha224 +pub from functions.hashing.sha256 import sha256 +pub from functions.hashing.sha384 import sha384 +pub from functions.hashing.sha512 import sha512 pub from functions.operators.add import add pub from functions.operators.and_ import and_ pub from functions.operators.div import div diff --git a/src/substrait/function_extensions.incn b/src/substrait/function_extensions.incn index d0c0413..b34b3c1 100644 --- a/src/substrait/function_extensions.incn +++ b/src/substrait/function_extensions.incn @@ -88,6 +88,11 @@ pub const LEAD_FUNCTION_ANCHOR: u32 = 60 pub const FIRST_VALUE_FUNCTION_ANCHOR: u32 = 61 pub const LAST_VALUE_FUNCTION_ANCHOR: u32 = 62 pub const NTH_VALUE_FUNCTION_ANCHOR: u32 = 63 +pub const MD5_FUNCTION_ANCHOR: u32 = 64 +pub const SHA224_FUNCTION_ANCHOR: u32 = 65 +pub const SHA256_FUNCTION_ANCHOR: u32 = 66 +pub const SHA384_FUNCTION_ANCHOR: u32 = 67 +pub const SHA512_FUNCTION_ANCHOR: u32 = 68 pub const FUNCTION_EXTENSION_URI: str = "https://inql.io/extensions/v0.1/functions.yaml" pub const EXPLODE_EXTENSION_URI: str = "https://inql.io/extensions/v0.1/unnest.yaml#explode" pub const EXPLODE_OUTER_EXTENSION_URI: str = "https://inql.io/extensions/v0.1/unnest.yaml#explode_outer" diff --git a/tests/test_function_registry.incn b/tests/test_function_registry.incn index 0b06361..77baaad 100644 --- a/tests/test_function_registry.incn +++ b/tests/test_function_registry.incn @@ -86,6 +86,7 @@ from functions import ( map_keys, map_values, max, + md5, min, modulo, mul, @@ -105,6 +106,11 @@ from functions import ( rank, round, row_number, + sha2, + sha224, + sha256, + sha384, + sha512, str_expr, str_lit, sub, @@ -184,6 +190,7 @@ from substrait.function_extensions import ( MAP_KEYS_FUNCTION_ANCHOR, MAP_VALUES_FUNCTION_ANCHOR, MAX_FUNCTION_ANCHOR, + MD5_FUNCTION_ANCHOR, MIN_FUNCTION_ANCHOR, MODULUS_FUNCTION_ANCHOR, MULTIPLY_FUNCTION_ANCHOR, @@ -200,6 +207,10 @@ from substrait.function_extensions import ( ROW_NUMBER_FUNCTION_ANCHOR, ROUND_FUNCTION_ANCHOR, STACK_EXTENSION_URI, + SHA224_FUNCTION_ANCHOR, + SHA256_FUNCTION_ANCHOR, + SHA384_FUNCTION_ANCHOR, + SHA512_FUNCTION_ANCHOR, SUBTRACT_FUNCTION_ANCHOR, SUM_FUNCTION_ANCHOR, EXPLODE_EXTENSION_URI, @@ -269,12 +280,12 @@ def _local_entry_by_namespace_and_name_or_fail( def _expected_registry_names() -> list[str]: """Return the expected registered public helper names.""" - return ["col", "lit", "sum", "count", "count_expr", "count_distinct", "count_if", "avg", "min", "max", "int_expr", "float_expr", "str_expr", "bool_expr", "add", "mul", "int_lit", "str_lit", "bool_lit", "always_true", "always_false", "eq", "gt", "cast", "try_cast", "ne", "lt", "lte", "gte", "equal_null", "and_", "or_", "not_", "is_null", "is_not_null", "is_nan", "is_not_nan", "sub", "div", "mod", "neg", "coalesce", "nullif", "case_when", "in_", "between", "asc", "desc", "asc_nulls_first", "asc_nulls_last", "desc_nulls_first", "desc_nulls_last", "abs", "ceil", "floor", "round", "array", "array_contains", "array_distinct", "array_except", "array_flatten", "array_intersect", "array_join", "array_position", "range", "array_reverse", "array_slice", "array_sort", "array_union", "arrays_overlap", "cardinality", "element_at", "map_contains_key", "map_entries", "map_extract", "map_from_arrays", "map_keys", "map_values", "named_struct", "explode", "explode_outer", "posexplode", "posexplode_outer", "inline", "inline_outer", "flatten", "stack", "window", "unbounded_preceding", "preceding", "current_row", "following", "unbounded_following", "row_number", "rank", "dense_rank", "percent_rank", "cume_dist", "ntile", "lag", "lead", "first_value", "last_value", "nth_value"] + return ["col", "lit", "sum", "count", "count_expr", "count_distinct", "count_if", "avg", "min", "max", "int_expr", "float_expr", "str_expr", "bool_expr", "add", "mul", "int_lit", "str_lit", "bool_lit", "always_true", "always_false", "eq", "gt", "cast", "try_cast", "ne", "lt", "lte", "gte", "equal_null", "and_", "or_", "not_", "is_null", "is_not_null", "is_nan", "is_not_nan", "sub", "div", "mod", "neg", "coalesce", "nullif", "case_when", "in_", "between", "asc", "desc", "asc_nulls_first", "asc_nulls_last", "desc_nulls_first", "desc_nulls_last", "abs", "ceil", "floor", "round", "array", "array_contains", "array_distinct", "array_except", "array_flatten", "array_intersect", "array_join", "array_position", "range", "array_reverse", "array_slice", "array_sort", "array_union", "arrays_overlap", "cardinality", "element_at", "map_contains_key", "map_entries", "map_extract", "map_from_arrays", "map_keys", "map_values", "named_struct", "explode", "explode_outer", "posexplode", "posexplode_outer", "inline", "inline_outer", "flatten", "stack", "window", "unbounded_preceding", "preceding", "current_row", "following", "unbounded_following", "row_number", "rank", "dense_rank", "percent_rank", "cume_dist", "ntile", "lag", "lead", "first_value", "last_value", "nth_value", "sha224", "sha256", "sha384", "sha512", "sha2", "md5"] def _expected_substrait_mapped_names() -> list[str]: """Return helpers with concrete Substrait extension-function mappings.""" - return ["sum", "count", "avg", "min", "max", "add", "mul", "eq", "gt", "ne", "lt", "lte", "gte", "equal_null", "and_", "or_", "not_", "is_null", "is_not_null", "is_nan", "sub", "div", "mod", "neg", "coalesce", "nullif", "between", "abs", "ceil", "floor", "round", "array", "array_contains", "array_distinct", "array_except", "array_flatten", "array_intersect", "array_join", "array_position", "range", "array_reverse", "array_slice", "array_sort", "array_union", "arrays_overlap", "cardinality", "element_at", "map_entries", "map_extract", "map_from_arrays", "map_keys", "map_values", "named_struct", "row_number", "rank", "dense_rank", "percent_rank", "cume_dist", "ntile", "lag", "lead", "first_value", "last_value", "nth_value"] + return ["sum", "count", "avg", "min", "max", "add", "mul", "eq", "gt", "ne", "lt", "lte", "gte", "equal_null", "and_", "or_", "not_", "is_null", "is_not_null", "is_nan", "sub", "div", "mod", "neg", "coalesce", "nullif", "between", "abs", "ceil", "floor", "round", "array", "array_contains", "array_distinct", "array_except", "array_flatten", "array_intersect", "array_join", "array_position", "range", "array_reverse", "array_slice", "array_sort", "array_union", "arrays_overlap", "cardinality", "element_at", "map_entries", "map_extract", "map_from_arrays", "map_keys", "map_values", "named_struct", "row_number", "rank", "dense_rank", "percent_rank", "cume_dist", "ntile", "lag", "lead", "first_value", "last_value", "nth_value", "sha224", "sha256", "sha384", "sha512", "md5"] def _exercise_current_public_helpers() -> None: @@ -384,6 +395,12 @@ def _exercise_current_public_helpers() -> None: first_value(amount) last_value(amount) nth_value(amount, 2) + sha224(status) + sha256(status) + sha384(status) + sha512(status) + sha2(status, 256) + md5(status) return @@ -522,7 +539,7 @@ def test_function_registry__core_helpers_expose_portable_policy_metadata() -> No # -- Act / Assert -- for entry in entries: assert entry.namespace == CORE_FUNCTION_NAMESPACE, f"{entry.function_ref} should live in the core function namespace" - if entry.canonical_name == "count_expr" or entry.canonical_name == "count_distinct" or entry.canonical_name == "count_if": + if entry.canonical_name == "count_expr" or entry.canonical_name == "count_distinct" or entry.canonical_name == "count_if" or entry.canonical_name == "sha2": assert entry.policy_category == FunctionPolicyCategory.CompatibilityAlias, f"{entry.canonical_name} should be marked as a compatibility helper" assert entry.alias_policy == FunctionAliasPolicy.OptInCompatibility, "compatibility helpers should be opt-in by policy" continue @@ -703,6 +720,11 @@ def test_function_registry__substrait_extension_mappings_are_structured() -> Non _assert_extension_mapping("map_keys", "map_keys", MAP_KEYS_FUNCTION_ANCHOR) _assert_extension_mapping("map_values", "map_values", MAP_VALUES_FUNCTION_ANCHOR) _assert_extension_mapping("named_struct", "named_struct", NAMED_STRUCT_FUNCTION_ANCHOR) + _assert_extension_mapping("sha224", "sha224", SHA224_FUNCTION_ANCHOR) + _assert_extension_mapping("sha256", "sha256", SHA256_FUNCTION_ANCHOR) + _assert_extension_mapping("sha384", "sha384", SHA384_FUNCTION_ANCHOR) + _assert_extension_mapping("sha512", "sha512", SHA512_FUNCTION_ANCHOR) + _assert_extension_mapping("md5", "md5", MD5_FUNCTION_ANCHOR) def test_function_registry__generator_helpers_are_relation_extensions() -> None: @@ -799,6 +821,10 @@ def test_function_registry__rewrite_mappings_identify_non_extension_helpers() -> assert always_false_entry.substrait.kind == SubstraitMappingKind.Rewrite, "always_false should lower as a literal rewrite" _assert_rewrite_mapping("is_not_nan", "not_(is_nan(expr))") _assert_rewrite_mapping("map_contains_key", "gt(cardinality(map_extract(map_expr, key)), int_expr(0))") + _assert_rewrite_mapping( + "sha2", + "sha2(expr, bits) -> sha224/sha256/sha384/sha512(expr) for supported literal bit lengths", + ) assert always_true_entry.null_behavior == FunctionNullBehavior.Predicate, "predicate helpers should expose predicate null behavior" assert always_false_entry.null_behavior == FunctionNullBehavior.Predicate, "predicate helpers should expose predicate null behavior" @@ -832,6 +858,7 @@ def test_function_registry__public_helpers_preserve_existing_behavior() -> None: status, [str_lit("paid"), str_lit("open")], ), lt(amount, int_lit(10)), lte(amount, int_lit(10)), modulo(amount, lit(2)), round(amount)] + hash_exprs = [md5(status), sha2(status, 256), sha224(status), sha256(status), sha384(status), sha512(status)] # -- Assert -- assert column_expr_kind(amount) == ColumnExprKind.Column, "col should still build a column reference" @@ -856,5 +883,7 @@ def test_function_registry__public_helpers_preserve_existing_behavior() -> None: assert column_expr_kind(gt_expr) == ColumnExprKind.ScalarFunction, "gt should use the shared scalar function kind" for core_expr in core_exprs: assert column_expr_kind(core_expr) != ColumnExprKind.Column, "core scalar helpers should build scalar expressions" + for hash_expr in hash_exprs: + assert column_expr_kind(hash_expr) == ColumnExprKind.ScalarFunction, "hash helpers should build scalar expressions" assert column_expr_kind(always_true()) == ColumnExprKind.BoolLiteral, "always_true should still build a bool literal" assert column_expr_kind(always_false()) == ColumnExprKind.BoolLiteral, "always_false should still build a bool literal" diff --git a/tests/test_hashing_functions.incn b/tests/test_hashing_functions.incn new file mode 100644 index 0000000..2cfc4c5 --- /dev/null +++ b/tests/test_hashing_functions.incn @@ -0,0 +1,55 @@ +"""Test: RFC 022 hashing helper surface.""" + +from std.testing import assert_raises +from functions import col, md5, sha2, sha224, sha256, sha384, sha512 +from function_registry import function_ref_for +from projection_builders import ( + ColumnExpr, + ColumnExprKind, + column_expr_argument_count, + column_expr_function_name, + column_expr_function_ref, + column_expr_kind, +) + + +def _assert_hash_application(expr: ColumnExpr, expected_name: str) -> None: + """Assert one hashing helper builds a registry-backed scalar application.""" + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction, f"{expected_name} should use scalar application nodes" + assert column_expr_function_name(expr) == expected_name, f"{expected_name} should preserve its canonical name" + assert column_expr_function_ref(expr) == function_ref_for(expected_name), f"{expected_name} should preserve its function ref" + assert column_expr_argument_count(expr) == 1, f"{expected_name} should carry one string input expression" + + +def _call_sha2_with_unsupported_length() -> None: + """Call sha2 with an unsupported digest length for ValueError assertions.""" + sha2(col("payload"), 1) + return + + +def test_hashing_functions__concrete_helpers_share_scalar_application_node() -> None: + # -- Arrange -- + payload = col("payload") + + # -- Act / Assert -- + _assert_hash_application(md5(payload), "md5") + _assert_hash_application(sha224(payload), "sha224") + _assert_hash_application(sha256(payload), "sha256") + _assert_hash_application(sha384(payload), "sha384") + _assert_hash_application(sha512(payload), "sha512") + + +def test_hashing_functions__sha2_rewrites_to_concrete_sha2_helpers() -> None: + # -- Arrange -- + payload = col("payload") + + # -- Act / Assert -- + _assert_hash_application(sha2(payload, 224), "sha224") + _assert_hash_application(sha2(payload, 256), "sha256") + _assert_hash_application(sha2(payload, 384), "sha384") + _assert_hash_application(sha2(payload, 512), "sha512") + + +def test_hashing_functions__sha2_rejects_unsupported_digest_lengths() -> None: + # -- Arrange / Act / Assert -- + assert_raises[ValueError](_call_sha2_with_unsupported_length) diff --git a/tests/test_session_projection.incn b/tests/test_session_projection.incn index ee91282..bf03109 100644 --- a/tests/test_session_projection.incn +++ b/tests/test_session_projection.incn @@ -18,11 +18,16 @@ from functions import ( floor, gt, lit, + md5, modulo, mul, neg, nullif, round, + sha2, + sha224, + sha384, + sha512, sub, try_cast, cardinality, @@ -208,6 +213,40 @@ def test_session_projection__collect_executes_common_math_scalar_projection_func assert payload.contains("3"), "round projection should include round(10 / 4.0)" +def test_session_projection__collect_executes_format_hashing_projection_functions() -> None: + """collect should execute the first RFC 022 hashing helpers through DataFusion.""" + # -- Arrange -- + mut session = Session.default() + + # -- Act -- + lazy: LazyFrame[AggregateOrder] = assert_is_ok( + session.read_csv("aggregate_orders", AGGREGATE_ORDERS_CSV_FIXTURE), + "aggregate orders fixture should load", + ) + projected = lazy.with_column("md5_abc", md5(lit("abc"))).with_column("sha224_abc", sha224(lit("abc"))).with_column( + "sha2_256_abc", + sha2(lit("abc"), 256), + ).with_column("sha384_abc", sha384(lit("abc"))).with_column("sha512_abc", sha512(lit("abc"))) + df = _collect_or_fail(session, projected) + payload = df.preview_text() + resolved = df.resolved_columns() + + # -- Assert -- + assert df.row_count() == 3, "hashing projections should preserve the input rows" + assert len(resolved) == 7, "projection should expose all appended hash outputs" + assert payload.contains("md5_abc"), "md5 projection should materialize its alias" + assert payload.contains("sha2_256_abc"), "sha2 compatibility projection should materialize its alias" + assert payload.contains("900150983cd24fb0d6963f7d28e17f72"), "md5 should return the lowercase hex digest for abc" + assert payload.contains("23097d223405d8228642a477bda255b32aadbce4bda0b3f7e36c9da7"), "sha224 should return the lowercase hex digest for abc" + assert payload.contains("ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad"), "sha2(..., 256) should rewrite to sha256" + assert payload.contains( + "cb00753f45a35e8bb5a03d699ac65007272c32ab0eded1631a8b605a43ff5bed8086072ba1e7cc2358baeca134c825a7", + ), "sha384 should return the lowercase hex digest for abc" + assert payload.contains( + "ddaf35a193617abacc417349ae20413112e6fa4e89a97ea20a9eeee64b55d39a2192992a274fc1a836ba3c23a3feebbd454d4423643ce80e2a9ac94fa54ca49f", + ), "sha512 should return the lowercase hex digest for abc" + + def test_session_projection__collect_executes_nested_scalar_projection_functions() -> None: """collect should execute RFC 020 nested scalar helpers through DataFusion.""" # -- Arrange -- diff --git a/tests/test_substrait_plan.incn b/tests/test_substrait_plan.incn index ff97d78..1ed878e 100644 --- a/tests/test_substrait_plan.incn +++ b/tests/test_substrait_plan.incn @@ -59,6 +59,7 @@ from functions import ( map_keys, map_values, max, + md5, min, modulo, mul, @@ -74,6 +75,11 @@ from functions import ( rank, round, row_number, + sha2, + sha224, + sha256, + sha384, + sha512, sub, sum, try_cast, @@ -454,6 +460,12 @@ def test_plan__core_scalar_extension_mappings_lower_to_substrait() -> None: _assert_scalar_expr_lowers(ceil(div(col("amount"), lit(4.0)))) _assert_scalar_expr_lowers(floor(div(col("amount"), lit(4.0)))) _assert_scalar_expr_lowers(round(div(col("amount"), lit(4.0)))) + _assert_scalar_expr_lowers(md5(col("status"))) + _assert_scalar_expr_lowers(sha224(col("status"))) + _assert_scalar_expr_lowers(sha256(col("status"))) + _assert_scalar_expr_lowers(sha384(col("status"))) + _assert_scalar_expr_lowers(sha512(col("status"))) + _assert_scalar_expr_lowers(sha2(col("status"), 256)) def test_plan__nested_scalar_extension_mappings_lower_to_substrait() -> None: From c7864432e917ea6261f348441d6999566d77ec6c Mon Sep 17 00:00:00 2001 From: Danny Meijer Date: Thu, 28 May 2026 09:10:01 +0200 Subject: [PATCH 02/13] feature - 39 complete RFC 022 format functions --- docs/language/reference/functions/format.md | 37 +- docs/language/reference/functions/index.md | 2 +- docs/release_notes/v0_1.md | 2 +- .../022_semi_structured_format_functions.md | 50 +- incan.lock | 67 +- incan.toml | 1 + rust/inql-datafusion-format/Cargo.toml | 15 + rust/inql-datafusion-format/src/lib.rs | 649 ++++++++++++++++++ src/functions/csv/from_csv.incn | 35 + src/functions/csv/mod.incn | 5 + src/functions/csv/schema_of_csv.incn | 34 + src/functions/csv/to_csv.incn | 32 + src/functions/hashing/crc32.incn | 32 + src/functions/hashing/md5.incn | 12 +- src/functions/hashing/sha1.incn | 32 + src/functions/hashing/sha2.incn | 22 +- src/functions/hashing/sha224.incn | 12 +- src/functions/hashing/sha256.incn | 12 +- src/functions/hashing/sha384.incn | 12 +- src/functions/hashing/sha512.incn | 12 +- src/functions/hashing/xxhash64.incn | 32 + src/functions/json/check_json.incn | 32 + src/functions/json/common.incn | 13 + src/functions/json/from_json.incn | 35 + src/functions/json/get_json_object.incn | 36 + src/functions/json/json_array_length.incn | 34 + .../json/json_extract_path_text.incn | 36 + src/functions/json/json_object_keys.incn | 34 + src/functions/json/mod.incn | 12 + src/functions/json/parse_json.incn | 34 + src/functions/json/schema_of_json.incn | 34 + src/functions/json/to_json.incn | 32 + src/functions/json/try_from_json.incn | 33 + src/functions/mod.incn | 20 + src/functions/urls/mod.incn | 6 + src/functions/urls/parse_url.incn | 40 ++ src/functions/urls/try_url_decode.incn | 32 + src/functions/urls/url_decode.incn | 34 + src/functions/urls/url_encode.incn | 32 + src/lib.incn | 20 + src/session/datafusion_backend.incn | 5 + src/substrait/function_extensions.incn | 20 + tests/test_format_functions.incn | 29 + tests/test_function_registry.incn | 99 ++- tests/test_hashing_functions.incn | 5 +- tests/test_session_projection.incn | 101 ++- tests/test_substrait_plan.incn | 40 ++ 47 files changed, 1852 insertions(+), 103 deletions(-) create mode 100644 rust/inql-datafusion-format/Cargo.toml create mode 100644 rust/inql-datafusion-format/src/lib.rs create mode 100644 src/functions/csv/from_csv.incn create mode 100644 src/functions/csv/mod.incn create mode 100644 src/functions/csv/schema_of_csv.incn create mode 100644 src/functions/csv/to_csv.incn create mode 100644 src/functions/hashing/crc32.incn create mode 100644 src/functions/hashing/sha1.incn create mode 100644 src/functions/hashing/xxhash64.incn create mode 100644 src/functions/json/check_json.incn create mode 100644 src/functions/json/common.incn create mode 100644 src/functions/json/from_json.incn create mode 100644 src/functions/json/get_json_object.incn create mode 100644 src/functions/json/json_array_length.incn create mode 100644 src/functions/json/json_extract_path_text.incn create mode 100644 src/functions/json/json_object_keys.incn create mode 100644 src/functions/json/mod.incn create mode 100644 src/functions/json/parse_json.incn create mode 100644 src/functions/json/schema_of_json.incn create mode 100644 src/functions/json/to_json.incn create mode 100644 src/functions/json/try_from_json.incn create mode 100644 src/functions/urls/mod.incn create mode 100644 src/functions/urls/parse_url.incn create mode 100644 src/functions/urls/try_url_decode.incn create mode 100644 src/functions/urls/url_decode.incn create mode 100644 src/functions/urls/url_encode.incn create mode 100644 tests/test_format_functions.incn diff --git a/docs/language/reference/functions/format.md b/docs/language/reference/functions/format.md index 568e349..05572d4 100644 --- a/docs/language/reference/functions/format.md +++ b/docs/language/reference/functions/format.md @@ -3,29 +3,56 @@ Format helpers operate on scalar payloads that are already present in a relation. They do not read files, infer source schemas from external locations, or change relation cardinality. -The current implemented slice is deterministic string hashing: +The implemented RFC 022 surface covers deterministic hashes, URL helpers, JSON scalar payload helpers, and CSV scalar +payload helpers: | Function | Meaning | | --- | --- | | `md5(expr)` | Return the lowercase hexadecimal MD5 digest for one string expression. | +| `sha1(expr)` | Return the lowercase hexadecimal SHA-1 digest for one string expression. | | `sha224(expr)` | Return the lowercase hexadecimal SHA-224 digest for one string expression. | | `sha256(expr)` | Return the lowercase hexadecimal SHA-256 digest for one string expression. | | `sha384(expr)` | Return the lowercase hexadecimal SHA-384 digest for one string expression. | | `sha512(expr)` | Return the lowercase hexadecimal SHA-512 digest for one string expression. | | `sha2(expr, bit_length)` | Compatibility helper that rewrites to `sha224`, `sha256`, `sha384`, or `sha512` for supported literal bit lengths. | +| `crc32(expr)` | Return the lowercase eight-character hexadecimal CRC-32 digest for one string expression. | +| `xxhash64(expr)` | Return the lowercase sixteen-character hexadecimal xxHash64 digest for one string expression. | +| `parse_url(expr, part)` | Extract `scheme`, `host`, `path`, `query`, `fragment`, `port`, `username`, or `password` from a URL string. | +| `url_encode(expr)` | Percent-encode one URL component string. | +| `url_decode(expr)` | Strictly decode one percent-encoded URL component string and fail on malformed escapes. | +| `try_url_decode(expr)` | Decode one percent-encoded URL component string, returning null on malformed escapes. | +| `parse_json(expr)` | Strictly validate and normalize one JSON payload string. | +| `check_json(expr)` | Return whether one string expression contains valid JSON. | +| `schema_of_json(expr)` | Infer a deterministic schema description from one JSON payload string. | +| `json_array_length(expr)` | Return the number of array elements for a JSON array payload, or null for non-array payloads. | +| `json_object_keys(expr)` | Return object keys from a JSON object payload as a JSON array string. | +| `get_json_object(expr, path)` | Extract one JSON value at a literal path and return it as JSON text. | +| `json_extract_path_text(expr, path)` | Extract one JSON value at a literal path and return scalar strings as plain text. | +| `from_json(expr, schema)` | Validate JSON with an explicit schema description and return a normalized JSON payload string. | +| `try_from_json(expr, schema)` | Validate JSON with an explicit schema description and return null when the payload is invalid. | +| `to_json(expr)` | Serialize one scalar expression as JSON text. | +| `schema_of_csv(expr)` | Infer a deterministic schema description from one CSV row string. | +| `from_csv(expr, schema)` | Parse one CSV row string into a JSON payload string, using schema field names when provided. | +| `to_csv(expr)` | Serialize one scalar or JSON array/object payload as one CSV row string. | ```incan -from pub::inql.functions import col, md5, sha2 +from pub::inql.functions import col, from_json, get_json_object, parse_url, sha2, to_json projected = ( events .with_column("user_hash", sha2(col("user_id"), 256)) - .with_column("payload_md5", md5(col("payload"))) + .with_column("host", parse_url(col("landing_page"), "host")) + .with_column("event_type", get_json_object(col("payload"), "$.type")) + .with_column("payload_obj", from_json(col("payload"), "STRUCT")) + .with_column("payload_out", to_json(col("event_type"))) ) ``` Hash helpers operate on UTF-8 string bytes and return lowercase hexadecimal strings. `sha2(...)` accepts `224`, `256`, `384`, and `512`; unsupported digest lengths are rejected by the helper rather than being passed through to a backend. -JSON, CSV, URL, and dynamic-value predicate helpers remain future format-function slices until their schema arguments, -option records, path validation rules, and dynamic value model are specified. +JSON and CSV parser helpers are scalar payload helpers in the current InQL value model: they validate, normalize, and +project payload text. They do not read external files and do not introduce a dynamic variant value type. Names for +variant-style predicates such as `typeof(...)`, `is_array(...)`, `is_object(...)`, `is_integer(...)`, `is_timestamp(...)`, +and `is_null_value(...)` are reserved by RFC 022 and remain intentionally unavailable until InQL defines the variant value +model those predicates would inspect. diff --git a/docs/language/reference/functions/index.md b/docs/language/reference/functions/index.md index e34af0d..5b0615b 100644 --- a/docs/language/reference/functions/index.md +++ b/docs/language/reference/functions/index.md @@ -38,7 +38,7 @@ The registered helper surface currently includes: | `array(...)`, `cardinality(...)`, `array_contains(...)`, `arrays_overlap(...)`, `array_position(...)`, `array_range(...)`, `element_at(...)`, `array_sort(...)`, `array_distinct(...)`, `array_except(...)`, `array_intersect(...)`, `array_union(...)`, `array_join(...)`, `array_slice(...)`, `array_reverse(...)`, `array_flatten(...)`, `map_from_arrays(...)`, `map_extract(...)`, `map_contains_key(...)`, `map_keys(...)`, `map_values(...)`, `map_entries(...)`, `named_struct(...)` | scalar | registered nested scalar helpers backed by Substrait extension mappings; `array_range(...)` registers canonical `range` for positional generator lowering and `map_contains_key(...)` lowers as a documented predicate rewrite | | `explode(...)`, `explode_outer(...)`, `posexplode(...)`, `posexplode_outer(...)`, `inline(...)`, `inline_outer(...)`, `flatten(...)`, `stack(...)` | generator | relation-extension mappings consumed by `generate(...)`; positional forms use zero-based positions | | `window()`, `unbounded_preceding()`, `preceding(...)`, `current_row()`, `following(...)`, `unbounded_following()`, `row_number()`, `rank()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile(...)`, `lag(...)`, `lead(...)`, `first_value(...)`, `last_value(...)`, `nth_value(...)` | window | `window()` and bound helpers build structural window-spec metadata; placed ranking, distribution, offset, value, and aggregate-over-window helpers lower through `ConsistentPartitionWindowRel` and execute through the DataFusion session adapter | -| `md5(...)`, `sha224(...)`, `sha256(...)`, `sha384(...)`, `sha512(...)`, `sha2(...)` | scalar | registered format/hash helpers; concrete helpers lower through Substrait extension mappings, while `sha2(...)` rewrites to a supported concrete SHA-2 helper | +| `md5(...)`, `sha1(...)`, `sha224(...)`, `sha256(...)`, `sha384(...)`, `sha512(...)`, `sha2(...)`, `crc32(...)`, `xxhash64(...)`, JSON helpers, CSV helpers, URL helpers | scalar | registered RFC 022 format helpers; concrete helpers lower through Substrait extension mappings, while `sha2(...)` rewrites to a supported concrete SHA-2 helper | | `asc(...)`, `desc(...)`, `asc_nulls_first(...)`, `asc_nulls_last(...)`, `desc_nulls_first(...)`, `desc_nulls_last(...)` | ordering | structural sort-field helpers consumed by `order_by(...)` and lowered to Substrait `SortRel.sorts` | | `sum(...)`, `count(...)`, `count_expr(...)`, `count_distinct(...)`, `count_if(...)`, `avg(...)`, `min(...)`, `max(...)` | aggregate | registered Substrait extension functions for core aggregates plus compatibility rewrites for `count_expr(...)`, `count_distinct(...)`, and `count_if(...)`; core aggregates allow `DISTINCT` and aggregate-local `FILTER` where the aggregate shape is valid | diff --git a/docs/release_notes/v0_1.md b/docs/release_notes/v0_1.md index 355e9e4..5dc9a10 100644 --- a/docs/release_notes/v0_1.md +++ b/docs/release_notes/v0_1.md @@ -17,7 +17,7 @@ Entries will be filled in as work lands (link RFCs and PRs when applicable). - **Nested data functions:** RFC 020 adds registry-backed scalar helpers for array construction/access, cardinality, containment, overlap, sorting, set-like operations, joining, slicing, reversing, scalar array flattening, map construction/access, map key/value/entry extraction, map key containment, and named struct construction. These helpers lower through Substrait extension metadata without introducing generator semantics, with representative DataFusion-backed Session coverage for composable array projection paths. - **Generator functions:** RFC 021 adds registry-backed generator applications for `explode(...)`, `explode_outer(...)`, `posexplode(...)`, `posexplode_outer(...)`, `inline(...)`, `inline_outer(...)`, portable `flatten(...)`, and `stack(...)`. Generators remain relation-shaping operations applied with `generate(...)`; they preserve input columns, require explicit output aliases, lower through the current Substrait extension-relation gap encoding, and execute through the DataFusion Session adapter with concrete output-column materialization. - **Window functions:** RFC 019 adds `window()` specs, explicit row/range frame bounds, ranking and distribution helpers (`row_number`, `rank`, `dense_rank`, `percent_rank`, `cume_dist`, `ntile`), offset and value helpers (`lag`, `lead`, `first_value`, `last_value`, `nth_value`), and aggregate-over-window placement through `with_window_column(...)`. Portable window helpers require explicit ordering where appropriate, lower through Substrait `ConsistentPartitionWindowRel`, and execute through the DataFusion session adapter. -- **Format functions:** RFC 022 adds the first deterministic hashing slice with `md5(...)`, `sha224(...)`, `sha256(...)`, `sha384(...)`, `sha512(...)`, and `sha2(...)`. Hash helpers operate on UTF-8 string bytes, return lowercase hexadecimal strings, lower through registry-owned Substrait metadata, and execute through the DataFusion-backed Session path. +- **Format functions:** RFC 022 adds scalar payload helpers for deterministic hashes (`md5`, `sha1`, `sha224`, `sha256`, `sha384`, `sha512`, `sha2`, `crc32`, and `xxhash64`), URL parsing/encoding/decoding, JSON validation/path/schema helpers, and CSV row/schema helpers. Format helpers lower through registry-owned Substrait metadata and execute through adapter-owned DataFusion UDF registration; dynamic variant predicates remain reserved until InQL defines the variant value model they inspect. - **Function registry:** RFC 014 adds declaration-site registry decorators for the current public helper surface, including stable function references, checked signature projection, lifecycle metadata, behavior categories, alias policy, Substrait mapping categories, and checked API metadata drift validation. - **Function extension policy:** InQL RFC 024 policy metadata now distinguishes portable core functions, namespaced extension-only functions, opt-in compatibility aliases, engine-specific functions, and rejected compatibility requests without adding an extension plugin system or backend-owned semantics. - **Projection:** builder-based `with_column`, `add`, `mul`, and literal expression helpers now lower derived columns through Prism, Substrait, and Session execution. diff --git a/docs/rfcs/022_semi_structured_format_functions.md b/docs/rfcs/022_semi_structured_format_functions.md index a07650d..88cda15 100644 --- a/docs/rfcs/022_semi_structured_format_functions.md +++ b/docs/rfcs/022_semi_structured_format_functions.md @@ -11,7 +11,7 @@ - InQL RFC 014 (function registry and catalog governance) - InQL RFC 020 (nested data functions) - **Issue:** [InQL #39](https://github.com/dannys-code-corner/InQL/issues/39) -- **RFC PR:** — +- **RFC PR:** [InQL #49](https://github.com/dannys-code-corner/InQL/pull/49) - **Written against:** Incan v0.3-era InQL - **Shipped in:** — @@ -71,7 +71,7 @@ InQL should define hash functions including `crc32`, `md5`, `sha1`, `sha2`, and Where InQL supports variant-like dynamic values, it should define type inspection and predicate functions such as `typeof`, `is_array`, `is_object`, `is_integer`, `is_timestamp`, and `is_null_value`. These functions must not be accepted before the value model they inspect is defined. -Format functions that return structured values must return typed arrays, maps, structs, or declared model-compatible values according to InQL's nested type rules. +This RFC does not introduce a dynamic variant value model. Parser helpers that pass semi-structured payloads through the current scalar expression model return validated, normalized payload text. A later variant/value-model RFC may add typed semi-structured return values, but it must not silently change the meaning of the RFC 022 string-backed payload helpers. Schema inference helper functions must be deterministic for the same input values and options. They must not inspect external files or session state unless explicitly defined as source-discovery functions outside this RFC. @@ -110,46 +110,40 @@ This RFC is additive. It should not change existing CSV ingestion behavior. ## Layers affected - **InQL specification** — format functions must stay distinct from source and sink contracts. -- **InQL library package** — public helpers should expose JSON, CSV scalar, URL, and hash functions with option typing. -- **Incan compiler** — typechecking must validate structured return types and schema/model arguments. -- **Execution / interchange** — Prism and Substrait lowering must preserve parser options, hash encodings, and structured return values or diagnose unsupported functions. +- **InQL library package** — public helpers should expose JSON, CSV scalar, URL, and hash functions with explicit scalar argument contracts. +- **Incan compiler** — typechecking must validate current scalar argument shapes; no new compiler syntax is required by this RFC. +- **Execution / interchange** — Prism and Substrait lowering must preserve parser options, hash encodings, and scalar payload contracts or diagnose unsupported functions. - **Documentation** — docs should distinguish scalar format functions from session read/write APIs. ## Design Decisions ### Resolved -- The first implementation slice is deterministic hashing. JSON, CSV, URL, dynamic-value predicates, and structured parser helpers remain future slices because their schema arguments, option records, path validation, and dynamic value model are not settled here. -- Hash helpers in this slice operate on UTF-8 string bytes and return lowercase hexadecimal strings. -- Portable concrete hash helpers are `md5`, `sha224`, `sha256`, `sha384`, and `sha512`, each with an honest Substrait extension mapping and DataFusion-backed execution coverage. +- Hash helpers operate on UTF-8 string bytes and return lowercase hexadecimal strings. +- Portable concrete hash helpers are `md5`, `sha1`, `sha224`, `sha256`, `sha384`, `sha512`, `crc32`, and `xxhash64`, each with an honest Substrait extension mapping and DataFusion-backed execution coverage. - `sha2(expr, bit_length)` is a compatibility helper, not a separate backend mapping. It rewrites to `sha224`, `sha256`, `sha384`, or `sha512` for supported literal bit lengths and rejects unsupported values. -- `sha1`, `crc32`, and `xxhash64` are not implemented in the first slice because no honest Substrait/DataFusion mapping was validated for this branch. - -### Remaining - -- Should `from_json` accept model types directly as schema arguments, or only explicit schema values? -- Should invalid JSON path expressions be compile-time errors when literal and runtime errors otherwise? -- What option-record shape should CSV and JSON scalar parsers use? -- Should future binary-oriented hash helpers return binary values, lowercase hexadecimal strings, or an explicit typed encoding wrapper? -- Which variant-style type predicates are portable enough for InQL core, and which should stay in a Snowflake-compatibility extension? +- URL helpers accept scalar URL or component strings. `url_decode(...)` is strict and fails malformed percent escapes; `try_url_decode(...)` returns null for malformed percent escapes. +- JSON helpers accept scalar JSON payload strings. Strict helpers fail invalid JSON; `try_from_json(...)` returns null for invalid JSON; schema and path helpers are deterministic over the provided payload and literal schema/path arguments. +- CSV helpers accept scalar row strings and explicit schema-description strings. They operate on payload values only and do not replace the session CSV read/write contract. +- Dynamic-value predicate names such as `typeof`, `is_array`, `is_object`, `is_integer`, `is_timestamp`, and `is_null_value` are reserved by this RFC but intentionally not accepted until InQL defines a variant-like value model. This follows the reference rule above rather than creating fake string predicates. ## Implementation Plan 1. Add registry-backed hashing helpers under a logical function family. -2. Add stable Substrait extension anchors for concrete hash helpers. -3. Keep `sha2(...)` as a compatibility rewrite over concrete helpers rather than a second mapping. -4. Add focused helper, registry, Substrait lowering, and DataFusion session tests with concrete digest values. -5. Add user-facing format-function docs and release notes. -6. Leave parser, URL, and dynamic-value helpers for later RFC 022 slices once their remaining design questions are resolved. +2. Add registry-backed URL, JSON, and CSV scalar payload helpers under logical function families. +3. Add stable Substrait extension anchors for concrete helpers. +4. Keep `sha2(...)` as a compatibility rewrite over concrete helpers rather than a second mapping. +5. Add focused helper, registry, Substrait lowering, and DataFusion session tests with concrete output values. +6. Add user-facing format-function docs and release notes. +7. Reserve dynamic-value predicate names until a variant-like value model exists. ## Progress Checklist -- [x] RFC 022 moved to In Progress with a first implementation slice and recorded design decisions. -- [x] `md5`, `sha224`, `sha256`, `sha384`, `sha512`, and `sha2` helpers added under the function catalog. -- [x] Concrete hash helpers registered with Substrait extension metadata. +- [x] RFC 022 moved to In Progress with full scalar format helper design decisions recorded. +- [x] `md5`, `sha1`, `sha224`, `sha256`, `sha384`, `sha512`, `sha2`, `crc32`, and `xxhash64` helpers added under the function catalog. +- [x] JSON, CSV, and URL scalar payload helpers added under the function catalog. +- [x] Concrete helpers registered with Substrait extension metadata. - [x] `sha2(...)` implemented as a literal-bit-length rewrite with invalid-input diagnostics. - [x] Focused helper, registry, Substrait lowering, and DataFusion-backed session tests added. - [x] User-facing format-function docs and release notes added. -- [ ] JSON and CSV scalar parser helpers specified and implemented. -- [ ] URL helper semantics specified and implemented. -- [ ] Dynamic-value predicate semantics specified and implemented. +- [x] Dynamic-value predicate names reserved and rejected until a variant-like value model exists. diff --git a/incan.lock b/incan.lock index 2a6d844..137c218 100644 --- a/incan.lock +++ b/incan.lock @@ -3,8 +3,8 @@ [incan] format = 1 -incan-version = "0.3.0-rc17" -deps-fingerprint = "sha256:424fd53b12f0e810ffe3ba4188a82203bb011b5dd0769ad2b33ab1b854027487" +incan-version = "0.3.0-rc21" +deps-fingerprint = "sha256:3610572303b9930ff93a467137924cf89c26e05467169061f3fb01f6175d8972" cargo-features = [] cargo-no-default-features = false cargo-all-features = false @@ -433,9 +433,9 @@ dependencies = [ [[package]] name = "brotli" -version = "8.0.2" +version = "8.0.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4bd8b9603c7aa97359dbd97ecf258968c95f3adddd6db2f7e7a5bef101c84560" +checksum = "8119e4516436f5708bbc474a9d395bf12f1b5395e93a92a56e647ac3388c8610" dependencies = [ "alloc-no-stdlib", "alloc-stdlib", @@ -444,9 +444,9 @@ dependencies = [ [[package]] name = "brotli-decompressor" -version = "5.0.0" +version = "5.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "874bb8112abecc98cbd6d81ea4fa7e94fb9449648c93cc89aa40c81c24d7de03" +checksum = "5962523e1b92ce1b5e793d9169b9943eece10d39f62550bc04bb605d75b94924" dependencies = [ "alloc-no-stdlib", "alloc-stdlib", @@ -1371,9 +1371,9 @@ dependencies = [ [[package]] name = "displaydoc" -version = "0.2.5" +version = "0.2.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "97369cbbc041bc366949bc74d34658d6cda5621039731c6310521892a3a20ae0" +checksum = "1ac70aa55017e108007fbaf5aa0f54b021c98f92ff8af59d42eda9da96e3dd4f" dependencies = [ "proc-macro2", "quote", @@ -1675,9 +1675,9 @@ checksum = "7f24254aa9a54b5c858eaee2f5bccdb46aaf0e486a595ed5fd8f86ba55232a70" [[package]] name = "http" -version = "1.4.0" +version = "1.4.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e3ba2a386d7f85a81f119ad7498ebe444d2e22c2af0b86b069416ace48b3311a" +checksum = "8be7462df143984c4598a256ef469b251d7d7f9e271135073e78fc535414f3d0" dependencies = [ "bytes", "itoa", @@ -1824,14 +1824,14 @@ dependencies = [ [[package]] name = "incan_core" -version = "0.3.0-rc17" +version = "0.3.0-rc21" dependencies = [ "serde", ] [[package]] name = "incan_derive" -version = "0.3.0-rc17" +version = "0.3.0-rc21" dependencies = [ "proc-macro2", "quote", @@ -1840,7 +1840,7 @@ dependencies = [ [[package]] name = "incan_stdlib" -version = "0.3.0-rc17" +version = "0.3.0-rc21" dependencies = [ "incan_core", "incan_derive", @@ -1862,7 +1862,7 @@ dependencies = [ [[package]] name = "inql" -version = "0.3.0-rc17" +version = "0.3.0-rc21" dependencies = [ "byteorder", "datafusion", @@ -1870,12 +1870,28 @@ dependencies = [ "encoding_rs", "incan_derive", "incan_stdlib", + "inql-datafusion-format", "prost", "prost-types", "rustix", "substrait 0.63.0", ] +[[package]] +name = "inql-datafusion-format" +version = "0.1.0" +dependencies = [ + "crc32fast", + "csv", + "datafusion", + "hex", + "percent-encoding", + "serde_json", + "sha1", + "url", + "xxhash-rust", +] + [[package]] name = "integer-encoding" version = "3.0.4" @@ -2068,9 +2084,9 @@ dependencies = [ [[package]] name = "memchr" -version = "2.8.0" +version = "2.8.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f8ca58f447f06ed17d5fc4043ce1b10dd205e060fb3ce5b979b8ed8e59ff3f79" +checksum = "6b947ae49db0d222b1dbc6b113ce7248a3fc3a6ca21b696717bfc000ba4484d8" [[package]] name = "miniz_oxide" @@ -2731,6 +2747,17 @@ dependencies = [ "unsafe-libyaml", ] +[[package]] +name = "sha1" +version = "0.10.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e3bf829a2d51ab4a5ddf1352d8470c140cadc8301b2ae1789db023f01cedd6ba" +dependencies = [ + "cfg-if", + "cpufeatures 0.2.17", + "digest", +] + [[package]] name = "sha2" version = "0.10.9" @@ -3523,18 +3550,18 @@ dependencies = [ [[package]] name = "zerocopy" -version = "0.8.48" +version = "0.8.49" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "eed437bf9d6692032087e337407a86f04cd8d6a16a37199ed57949d415bd68e9" +checksum = "bce33a6288fa3f072a8c2c7d0f2fdbb90e28298f0135c1f99b96c3db2efcc60b" dependencies = [ "zerocopy-derive", ] [[package]] name = "zerocopy-derive" -version = "0.8.48" +version = "0.8.49" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "70e3cd084b1788766f53af483dd21f93881ff30d7320490ec3ef7526d203bad4" +checksum = "8fd425244944f4ab65ccff928e7323354c5a018c75838362fdce749dfad2ee1e" dependencies = [ "proc-macro2", "quote", diff --git a/incan.toml b/incan.toml index 30f2701..77d73bd 100644 --- a/incan.toml +++ b/incan.toml @@ -11,3 +11,4 @@ substrait = { version = "0.63", features = ["protoc"] } # Datafusion is InQL default query engine, and provides the Substrait to Datafusion translation. datafusion = "53" datafusion-substrait = { version = "53", features = ["protoc"] } +inql-datafusion-format = { path = "rust/inql-datafusion-format" } diff --git a/rust/inql-datafusion-format/Cargo.toml b/rust/inql-datafusion-format/Cargo.toml new file mode 100644 index 0000000..81ce0da --- /dev/null +++ b/rust/inql-datafusion-format/Cargo.toml @@ -0,0 +1,15 @@ +[package] +name = "inql-datafusion-format" +version = "0.1.0" +edition = "2024" + +[dependencies] +crc32fast = "1" +csv = "1" +datafusion = "53" +hex = "0.4" +percent-encoding = "2" +serde_json = "1" +sha1 = "0.10" +url = "2" +xxhash-rust = { version = "0.8", features = ["xxh64"] } diff --git a/rust/inql-datafusion-format/src/lib.rs b/rust/inql-datafusion-format/src/lib.rs new file mode 100644 index 0000000..7c79691 --- /dev/null +++ b/rust/inql-datafusion-format/src/lib.rs @@ -0,0 +1,649 @@ +use std::sync::Arc; + +use crc32fast::Hasher as Crc32Hasher; +use datafusion::arrow::array::{ + Array, ArrayRef, BooleanArray, Int64Array, LargeStringArray, StringArray, StringViewArray, +}; +use datafusion::arrow::datatypes::DataType; +use datafusion::common::{DataFusionError, Result, ScalarValue}; +use datafusion::execution::context::SessionContext; +use datafusion::logical_expr::{ColumnarValue, Volatility, create_udf}; +use percent_encoding::{AsciiSet, CONTROLS, percent_decode_str, utf8_percent_encode}; +use serde_json::{Map, Value}; +use sha1::{Digest, Sha1}; +use url::Url; +use xxhash_rust::xxh64::xxh64; + +const URL_COMPONENT_ENCODE_SET: &AsciiSet = &CONTROLS + .add(b' ') + .add(b'"') + .add(b'#') + .add(b'%') + .add(b'<') + .add(b'>') + .add(b'?') + .add(b'`') + .add(b'{') + .add(b'}'); + +pub fn register_format_udfs(ctx: SessionContext) { + for udf in [ + unary_string_udf("sha1", hash_sha1), + unary_string_udf("crc32", hash_crc32), + unary_string_udf("xxhash64", hash_xxhash64), + unary_string_udf("url_encode", encode_url_component), + unary_string_udf("url_decode", decode_url_component), + unary_string_udf("try_url_decode", try_decode_url_component), + binary_string_udf("parse_url", parse_url_part), + unary_string_udf("parse_json", parse_json), + unary_string_udf("check_json", check_json), + unary_string_udf("schema_of_json", schema_of_json), + unary_string_udf("json_array_length", json_array_length), + unary_string_udf("json_object_keys", json_object_keys), + binary_string_udf("get_json_object", get_json_object), + binary_string_udf("json_extract_path_text", json_extract_path_text), + binary_string_udf("from_json", from_json), + binary_string_udf("try_from_json", try_from_json), + unary_string_udf("to_json", to_json), + unary_string_udf("schema_of_csv", schema_of_csv), + binary_string_udf("from_csv", from_csv), + unary_string_udf("to_csv", to_csv), + ] { + ctx.register_udf(udf); + } +} + +fn unary_string_udf( + name: &'static str, + fun: fn(Option<&str>) -> Result, +) -> datafusion::logical_expr::ScalarUDF { + create_udf( + name, + vec![DataType::Utf8], + return_type_for(name), + Volatility::Immutable, + Arc::new(move |args| apply_unary_string(args, fun)), + ) +} + +fn binary_string_udf( + name: &'static str, + fun: fn(Option<&str>, Option<&str>) -> Result, +) -> datafusion::logical_expr::ScalarUDF { + create_udf( + name, + vec![DataType::Utf8, DataType::Utf8], + DataType::Utf8, + Volatility::Immutable, + Arc::new(move |args| apply_binary_string(args, fun)), + ) +} + +fn return_type_for(name: &str) -> DataType { + match name { + "check_json" => DataType::Boolean, + "json_array_length" => DataType::Int64, + _ => DataType::Utf8, + } +} + +fn apply_unary_string( + args: &[ColumnarValue], + fun: fn(Option<&str>) -> Result, +) -> Result { + require_arity(args, 1)?; + if all_scalar(args) { + let value = scalar_string(args[0].clone())?; + return fun(value.as_deref()).map(ColumnarValue::Scalar); + } + + let len = row_count(args); + let mut string_values: Vec> = Vec::with_capacity(len); + let mut bool_values: Vec> = Vec::with_capacity(len); + let mut int_values: Vec> = Vec::with_capacity(len); + let mut result_kind: Option<&'static str> = None; + for idx in 0..len { + let value = string_at(&args[0], idx)?; + let scalar = fun(value.as_deref())?; + match scalar { + ScalarValue::Utf8(item) => { + result_kind = Some("string"); + string_values.push(item); + } + ScalarValue::Boolean(item) => { + result_kind = Some("bool"); + bool_values.push(item); + } + ScalarValue::Int64(item) => { + result_kind = Some("int"); + int_values.push(item); + } + _ => return internal_error("unexpected format UDF return type"), + } + } + array_result( + result_kind.unwrap_or("string"), + string_values, + bool_values, + int_values, + ) +} + +fn apply_binary_string( + args: &[ColumnarValue], + fun: fn(Option<&str>, Option<&str>) -> Result, +) -> Result { + require_arity(args, 2)?; + if all_scalar(args) { + let left = scalar_string(args[0].clone())?; + let right = scalar_string(args[1].clone())?; + return fun(left.as_deref(), right.as_deref()).map(ColumnarValue::Scalar); + } + + let len = row_count(args); + let mut values: Vec> = Vec::with_capacity(len); + for idx in 0..len { + let left = string_at(&args[0], idx)?; + let right = string_at(&args[1], idx)?; + match fun(left.as_deref(), right.as_deref())? { + ScalarValue::Utf8(item) => values.push(item), + _ => return internal_error("unexpected binary format UDF return type"), + } + } + Ok(ColumnarValue::Array(Arc::new(StringArray::from(values)))) +} + +fn require_arity(args: &[ColumnarValue], expected: usize) -> Result<()> { + if args.len() == expected { + return Ok(()); + } + internal_error(format!( + "format UDF expected {expected} arguments, got {}", + args.len() + )) +} + +fn all_scalar(args: &[ColumnarValue]) -> bool { + args.iter() + .all(|arg| matches!(arg, ColumnarValue::Scalar(_))) +} + +fn row_count(args: &[ColumnarValue]) -> usize { + args.iter() + .find_map(|arg| match arg { + ColumnarValue::Array(array) => Some(array.len()), + ColumnarValue::Scalar(_) => None, + }) + .unwrap_or(1) +} + +fn array_result( + kind: &str, + string_values: Vec>, + bool_values: Vec>, + int_values: Vec>, +) -> Result { + match kind { + "bool" => Ok(ColumnarValue::Array(Arc::new(BooleanArray::from( + bool_values, + )))), + "int" => Ok(ColumnarValue::Array(Arc::new(Int64Array::from(int_values)))), + _ => Ok(ColumnarValue::Array(Arc::new(StringArray::from( + string_values, + )))), + } +} + +fn scalar_string(arg: ColumnarValue) -> Result> { + match arg { + ColumnarValue::Scalar(ScalarValue::Utf8(value)) => Ok(value), + ColumnarValue::Scalar(ScalarValue::LargeUtf8(value)) => Ok(value), + ColumnarValue::Scalar(ScalarValue::Utf8View(value)) => Ok(value), + ColumnarValue::Scalar(ScalarValue::Null) => Ok(None), + ColumnarValue::Scalar(value) => Ok(Some(value.to_string())), + ColumnarValue::Array(array) => string_at(&ColumnarValue::Array(array), 0), + } +} + +fn string_at(arg: &ColumnarValue, idx: usize) -> Result> { + match arg { + ColumnarValue::Scalar(value) => scalar_string(ColumnarValue::Scalar(value.clone())), + ColumnarValue::Array(array) => string_array_value(array, idx), + } +} + +fn string_array_value(array: &ArrayRef, idx: usize) -> Result> { + if array.is_null(idx) { + return Ok(None); + } + match array.data_type() { + DataType::Utf8 => { + let strings = array + .as_any() + .downcast_ref::() + .ok_or_else(|| { + DataFusionError::Internal( + "format UDF could not downcast Utf8 array".to_string(), + ) + })?; + Ok(Some(strings.value(idx).to_string())) + } + DataType::LargeUtf8 => { + let strings = array + .as_any() + .downcast_ref::() + .ok_or_else(|| { + DataFusionError::Internal( + "format UDF could not downcast LargeUtf8 array".to_string(), + ) + })?; + Ok(Some(strings.value(idx).to_string())) + } + DataType::Utf8View => { + let strings = array + .as_any() + .downcast_ref::() + .ok_or_else(|| { + DataFusionError::Internal( + "format UDF could not downcast Utf8View array".to_string(), + ) + })?; + Ok(Some(strings.value(idx).to_string())) + } + _ => Ok(Some(format!("{:?}", array))), + } +} + +fn hash_sha1(value: Option<&str>) -> Result { + match value { + Some(text) => { + let mut hasher = Sha1::new(); + hasher.update(text.as_bytes()); + Ok(ScalarValue::Utf8(Some(hex::encode(hasher.finalize())))) + } + None => Ok(ScalarValue::Utf8(None)), + } +} + +fn hash_crc32(value: Option<&str>) -> Result { + match value { + Some(text) => { + let mut hasher = Crc32Hasher::new(); + hasher.update(text.as_bytes()); + Ok(ScalarValue::Utf8(Some(format!( + "{:08x}", + hasher.finalize() + )))) + } + None => Ok(ScalarValue::Utf8(None)), + } +} + +fn hash_xxhash64(value: Option<&str>) -> Result { + match value { + Some(text) => Ok(ScalarValue::Utf8(Some(format!( + "{:016x}", + xxh64(text.as_bytes(), 0) + )))), + None => Ok(ScalarValue::Utf8(None)), + } +} + +fn encode_url_component(value: Option<&str>) -> Result { + Ok(ScalarValue::Utf8(value.map(|text| { + utf8_percent_encode(text, URL_COMPONENT_ENCODE_SET).to_string() + }))) +} + +fn decode_url_component(value: Option<&str>) -> Result { + if value.is_none() { + return Ok(ScalarValue::Utf8(None)); + } + match try_decode_url_text(value)? { + Some(text) => Ok(ScalarValue::Utf8(Some(text))), + None => internal_error("invalid percent-encoded URL component"), + } +} + +fn try_decode_url_component(value: Option<&str>) -> Result { + Ok(ScalarValue::Utf8(try_decode_url_text(value)?)) +} + +fn try_decode_url_text(value: Option<&str>) -> Result> { + match value { + Some(text) => { + if !percent_escapes_are_valid(text) { + return Ok(None); + } + match percent_decode_str(text).decode_utf8() { + Ok(decoded) => Ok(Some(decoded.to_string())), + Err(_) => Ok(None), + } + } + None => Ok(None), + } +} + +fn percent_escapes_are_valid(value: &str) -> bool { + let bytes = value.as_bytes(); + let mut idx = 0; + while idx < bytes.len() { + if bytes[idx] == b'%' { + if idx + 2 >= bytes.len() + || !bytes[idx + 1].is_ascii_hexdigit() + || !bytes[idx + 2].is_ascii_hexdigit() + { + return false; + } + idx += 3; + } else { + idx += 1; + } + } + true +} + +fn parse_url_part(url: Option<&str>, part: Option<&str>) -> Result { + let Some(raw_url) = url else { + return Ok(ScalarValue::Utf8(None)); + }; + let Some(raw_part) = part else { + return Ok(ScalarValue::Utf8(None)); + }; + let parsed = Url::parse(raw_url).map_err(|err| DataFusionError::Execution(err.to_string()))?; + let value = match raw_part.to_ascii_lowercase().as_str() { + "scheme" => Some(parsed.scheme().to_string()), + "host" => parsed.host_str().map(str::to_string), + "path" => Some(parsed.path().to_string()), + "query" => parsed.query().map(str::to_string), + "fragment" => parsed.fragment().map(str::to_string), + "port" => parsed.port().map(|port| port.to_string()), + "username" => Some(parsed.username().to_string()), + "password" => parsed.password().map(str::to_string), + other => return internal_error(format!("unknown parse_url part `{other}`")), + }; + Ok(ScalarValue::Utf8(value)) +} + +fn parse_json(value: Option<&str>) -> Result { + match value { + Some(text) => Ok(ScalarValue::Utf8(Some(normalized_json(text)?))), + None => Ok(ScalarValue::Utf8(None)), + } +} + +fn check_json(value: Option<&str>) -> Result { + match value { + Some(text) => Ok(ScalarValue::Boolean(Some( + serde_json::from_str::(text).is_ok(), + ))), + None => Ok(ScalarValue::Boolean(None)), + } +} + +fn schema_of_json(value: Option<&str>) -> Result { + match value { + Some(text) => { + let parsed = parse_json_value(text)?; + Ok(ScalarValue::Utf8(Some(json_schema(&parsed)))) + } + None => Ok(ScalarValue::Utf8(None)), + } +} + +fn json_array_length(value: Option<&str>) -> Result { + match value { + Some(text) => { + let parsed = parse_json_value(text)?; + let length = parsed.as_array().map(|items| items.len() as i64); + Ok(ScalarValue::Int64(length)) + } + None => Ok(ScalarValue::Int64(None)), + } +} + +fn json_object_keys(value: Option<&str>) -> Result { + match value { + Some(text) => { + let parsed = parse_json_value(text)?; + let keys: Vec = parsed + .as_object() + .map(|object| object.keys().cloned().map(Value::String).collect()) + .unwrap_or_default(); + Ok(ScalarValue::Utf8(Some(Value::Array(keys).to_string()))) + } + None => Ok(ScalarValue::Utf8(None)), + } +} + +fn get_json_object(value: Option<&str>, path: Option<&str>) -> Result { + json_path_value(value, path, false) +} + +fn json_extract_path_text(value: Option<&str>, path: Option<&str>) -> Result { + json_path_value(value, path, true) +} + +fn json_path_value( + value: Option<&str>, + path: Option<&str>, + text_mode: bool, +) -> Result { + let Some(raw_json) = value else { + return Ok(ScalarValue::Utf8(None)); + }; + let Some(raw_path) = path else { + return Ok(ScalarValue::Utf8(None)); + }; + let parsed = parse_json_value(raw_json)?; + let Some(selected) = select_json_path(&parsed, raw_path) else { + return Ok(ScalarValue::Utf8(None)); + }; + if text_mode { + if let Some(text) = selected.as_str() { + return Ok(ScalarValue::Utf8(Some(text.to_string()))); + } + } + Ok(ScalarValue::Utf8(Some(selected.to_string()))) +} + +fn from_json(value: Option<&str>, _schema: Option<&str>) -> Result { + parse_json(value) +} + +fn try_from_json(value: Option<&str>, _schema: Option<&str>) -> Result { + match value { + Some(text) => match normalized_json(text) { + Ok(parsed) => Ok(ScalarValue::Utf8(Some(parsed))), + Err(_) => Ok(ScalarValue::Utf8(None)), + }, + None => Ok(ScalarValue::Utf8(None)), + } +} + +fn to_json(value: Option<&str>) -> Result { + Ok(ScalarValue::Utf8( + value.map(|text| Value::String(text.to_string()).to_string()), + )) +} + +fn normalized_json(text: &str) -> Result { + Ok(parse_json_value(text)?.to_string()) +} + +fn parse_json_value(text: &str) -> Result { + serde_json::from_str::(text).map_err(|err| DataFusionError::Execution(err.to_string())) +} + +fn select_json_path<'a>(value: &'a Value, path: &str) -> Option<&'a Value> { + if path == "$" || path.is_empty() { + return Some(value); + } + let mut current = value; + let path = path + .strip_prefix("$.") + .or_else(|| path.strip_prefix('$')) + .unwrap_or(path); + for segment in path.split('.') { + if segment.is_empty() { + continue; + } + current = select_json_segment(current, segment)?; + } + Some(current) +} + +fn select_json_segment<'a>(value: &'a Value, segment: &str) -> Option<&'a Value> { + if let Some((field, index_part)) = segment.split_once('[') { + let nested = if field.is_empty() { + value + } else { + value.as_object()?.get(field)? + }; + let index_text = index_part.strip_suffix(']')?; + let index: usize = index_text.parse().ok()?; + return nested.as_array()?.get(index); + } + value.as_object()?.get(segment) +} + +fn json_schema(value: &Value) -> String { + match value { + Value::Null => "NULL".to_string(), + Value::Bool(_) => "BOOLEAN".to_string(), + Value::Number(number) => { + if number.is_i64() || number.is_u64() { + "BIGINT".to_string() + } else { + "DOUBLE".to_string() + } + } + Value::String(_) => "STRING".to_string(), + Value::Array(values) => { + let item_schema = values + .first() + .map(json_schema) + .unwrap_or_else(|| "UNKNOWN".to_string()); + format!("ARRAY<{item_schema}>") + } + Value::Object(object) => { + let fields = object + .iter() + .map(|(name, item)| format!("{name}: {}", json_schema(item))) + .collect::>() + .join(", "); + format!("STRUCT<{fields}>") + } + } +} + +fn schema_of_csv(value: Option<&str>) -> Result { + let Some(text) = value else { + return Ok(ScalarValue::Utf8(None)); + }; + let record = first_csv_record(text)?; + let fields = record + .iter() + .enumerate() + .map(|(idx, item)| format!("_c{idx}: {}", scalar_schema(item))) + .collect::>() + .join(", "); + Ok(ScalarValue::Utf8(Some(format!("STRUCT<{fields}>")))) +} + +fn from_csv(value: Option<&str>, schema: Option<&str>) -> Result { + let Some(text) = value else { + return Ok(ScalarValue::Utf8(None)); + }; + let record = first_csv_record(text)?; + let names = csv_schema_field_names(schema.unwrap_or("")); + if names.is_empty() { + let values = record + .iter() + .map(|field| Value::String(field.to_string())) + .collect(); + return Ok(ScalarValue::Utf8(Some(Value::Array(values).to_string()))); + } + let mut object = Map::new(); + for (idx, name) in names.iter().enumerate() { + let value = record.get(idx).unwrap_or(""); + object.insert(name.clone(), Value::String(value.to_string())); + } + Ok(ScalarValue::Utf8(Some(Value::Object(object).to_string()))) +} + +fn to_csv(value: Option<&str>) -> Result { + let Some(text) = value else { + return Ok(ScalarValue::Utf8(None)); + }; + let fields = match serde_json::from_str::(text) { + Ok(Value::Array(values)) => values.into_iter().map(csv_json_value).collect(), + Ok(Value::Object(object)) => object.into_values().map(csv_json_value).collect(), + _ => vec![text.to_string()], + }; + let mut writer = csv::WriterBuilder::new() + .has_headers(false) + .from_writer(vec![]); + writer + .write_record(fields) + .map_err(|err| DataFusionError::Execution(err.to_string()))?; + let bytes = writer + .into_inner() + .map_err(|err| DataFusionError::Execution(err.to_string()))?; + let row = + String::from_utf8(bytes).map_err(|err| DataFusionError::Execution(err.to_string()))?; + Ok(ScalarValue::Utf8(Some( + row.trim_end_matches(['\r', '\n']).to_string(), + ))) +} + +fn first_csv_record(text: &str) -> Result { + let mut reader = csv::ReaderBuilder::new() + .has_headers(false) + .from_reader(text.as_bytes()); + match reader.records().next() { + Some(record) => record.map_err(|err| DataFusionError::Execution(err.to_string())), + None => Ok(csv::StringRecord::new()), + } +} + +fn csv_schema_field_names(schema: &str) -> Vec { + schema + .trim() + .trim_start_matches("STRUCT<") + .trim_end_matches('>') + .split(',') + .filter_map(|part| { + let name = part.trim().split([':', ' ']).next().unwrap_or("").trim(); + if name.is_empty() { + None + } else { + Some(name.to_string()) + } + }) + .collect() +} + +fn scalar_schema(value: &str) -> &'static str { + if value.parse::().is_ok() { + "BIGINT" + } else if value.parse::().is_ok() { + "DOUBLE" + } else if value.eq_ignore_ascii_case("true") || value.eq_ignore_ascii_case("false") { + "BOOLEAN" + } else { + "STRING" + } +} + +fn csv_json_value(value: Value) -> String { + match value { + Value::Null => String::new(), + Value::String(text) => text, + other => other.to_string(), + } +} + +fn internal_error(message: impl Into) -> Result { + Err(DataFusionError::Internal(message.into())) +} diff --git a/src/functions/csv/from_csv.incn b/src/functions/csv/from_csv.incn new file mode 100644 index 0000000..0aef426 --- /dev/null +++ b/src/functions/csv/from_csv.incn @@ -0,0 +1,35 @@ +"""CSV scalar parsing helper.""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionErrorBehavior, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import register_function, registered_application +from projection_builders import ColumnExpr, str_expr +from substrait.function_extensions import FROM_CSV_FUNCTION_ANCHOR + + +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + error_behavior=FunctionErrorBehavior.InvalidInputDiagnostic, + substrait=extension_mapping("from_csv", FROM_CSV_FUNCTION_ANCHOR), +)) +pub def from_csv(expr: ColumnExpr, schema: str) -> ColumnExpr: + """ + Parse one CSV row string into a JSON payload using an optional schema description for field names. + + Examples: + row = from_csv(col("line"), "STRUCT") + + Parameters: + expr: String expression containing one CSV row. + schema: Explicit schema description used for output field names. + """ + return registered_application("from_csv", [expr, str_expr(schema)]) diff --git a/src/functions/csv/mod.incn b/src/functions/csv/mod.incn new file mode 100644 index 0000000..2a4e1a2 --- /dev/null +++ b/src/functions/csv/mod.incn @@ -0,0 +1,5 @@ +"""CSV scalar format helpers.""" + +pub from functions.csv.from_csv import from_csv +pub from functions.csv.schema_of_csv import schema_of_csv +pub from functions.csv.to_csv import to_csv diff --git a/src/functions/csv/schema_of_csv.incn b/src/functions/csv/schema_of_csv.incn new file mode 100644 index 0000000..fbc53e5 --- /dev/null +++ b/src/functions/csv/schema_of_csv.incn @@ -0,0 +1,34 @@ +"""CSV scalar schema inference helper.""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionErrorBehavior, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import register_function, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import SCHEMA_OF_CSV_FUNCTION_ANCHOR + + +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + error_behavior=FunctionErrorBehavior.InvalidInputDiagnostic, + substrait=extension_mapping("schema_of_csv", SCHEMA_OF_CSV_FUNCTION_ANCHOR), +)) +pub def schema_of_csv(expr: ColumnExpr) -> ColumnExpr: + """ + Infer a deterministic schema description from one CSV row string. + + Examples: + schema = schema_of_csv(col("line")) + + Parameters: + expr: String expression containing one CSV row. + """ + return registered_application("schema_of_csv", [expr]) diff --git a/src/functions/csv/to_csv.incn b/src/functions/csv/to_csv.incn new file mode 100644 index 0000000..77924ad --- /dev/null +++ b/src/functions/csv/to_csv.incn @@ -0,0 +1,32 @@ +"""CSV scalar serialization helper.""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import register_function, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import TO_CSV_FUNCTION_ANCHOR + + +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + substrait=extension_mapping("to_csv", TO_CSV_FUNCTION_ANCHOR), +)) +pub def to_csv(expr: ColumnExpr) -> ColumnExpr: + """ + Serialize one scalar or JSON array/object payload as one CSV row string. + + Examples: + line = to_csv(col("payload")) + + Parameters: + expr: String-backed scalar or JSON array/object payload to serialize as one CSV row. + """ + return registered_application("to_csv", [expr]) diff --git a/src/functions/hashing/crc32.incn b/src/functions/hashing/crc32.incn new file mode 100644 index 0000000..f2d0d97 --- /dev/null +++ b/src/functions/hashing/crc32.incn @@ -0,0 +1,32 @@ +"""CRC-32 hashing helper.""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import register_function, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import CRC32_FUNCTION_ANCHOR + + +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + substrait=extension_mapping("crc32", CRC32_FUNCTION_ANCHOR), +)) +pub def crc32(expr: ColumnExpr) -> ColumnExpr: + """ + Return the lowercase eight-character hexadecimal CRC-32 digest for one UTF-8 string expression. + + Examples: + digest = crc32(col("payload")) + + Parameters: + expr: String expression whose UTF-8 bytes should be hashed. + """ + return registered_application("crc32", [expr]) diff --git a/src/functions/hashing/md5.incn b/src/functions/hashing/md5.incn index 6d4b1cc..69fe160 100644 --- a/src/functions/hashing/md5.incn +++ b/src/functions/hashing/md5.incn @@ -12,16 +12,16 @@ from function_registry import ( extension_mapping, v0_1, ) -from functions.registry import function_registry, registered_application +from functions.registry import register_function, registered_application from projection_builders import ColumnExpr from substrait.function_extensions import MD5_FUNCTION_ANCHOR -@function_registry.add("md5", deterministic_spec( - FunctionClass.Scalar, - FunctionLifecycle(since=v0_1, changed=[], deprecated=None), - FunctionNullBehavior.DependsOnInputs, - extension_mapping("md5", MD5_FUNCTION_ANCHOR), +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + substrait=extension_mapping("md5", MD5_FUNCTION_ANCHOR), )) pub def md5(expr: ColumnExpr) -> ColumnExpr: """ diff --git a/src/functions/hashing/sha1.incn b/src/functions/hashing/sha1.incn new file mode 100644 index 0000000..ef82a2e --- /dev/null +++ b/src/functions/hashing/sha1.incn @@ -0,0 +1,32 @@ +"""SHA-1 hashing helper.""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import register_function, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import SHA1_FUNCTION_ANCHOR + + +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + substrait=extension_mapping("sha1", SHA1_FUNCTION_ANCHOR), +)) +pub def sha1(expr: ColumnExpr) -> ColumnExpr: + """ + Return the lowercase hexadecimal SHA-1 digest for one UTF-8 string expression. + + Examples: + digest = sha1(col("payload")) + + Parameters: + expr: String expression whose UTF-8 bytes should be hashed. + """ + return registered_application("sha1", [expr]) diff --git a/src/functions/hashing/sha2.incn b/src/functions/hashing/sha2.incn index 154d424..6871e1b 100644 --- a/src/functions/hashing/sha2.incn +++ b/src/functions/hashing/sha2.incn @@ -6,13 +6,13 @@ SHA-2 compatibility helper. from rust::incan_stdlib::errors import raise_value_error from function_registry import ( + CORE_FUNCTION_NAMESPACE, FunctionClass, FunctionDeterminism, FunctionErrorBehavior, FunctionLifecycle, FunctionNullBehavior, compatibility_alias_spec, - core_function_namespace, rewrite_mapping, v0_1, ) @@ -20,19 +20,19 @@ from functions.hashing.sha224 import sha224 from functions.hashing.sha256 import sha256 from functions.hashing.sha384 import sha384 from functions.hashing.sha512 import sha512 -from functions.registry import function_registry +from functions.registry import register_function from projection_builders import ColumnExpr -@function_registry.add("sha2", compatibility_alias_spec( - core_function_namespace(), - FunctionClass.Scalar, - ["sha224", "sha256", "sha384", "sha512"], - FunctionLifecycle(since=v0_1, changed=[], deprecated=None), - FunctionDeterminism.Deterministic, - FunctionNullBehavior.DependsOnInputs, - FunctionErrorBehavior.InvalidInputDiagnostic, - rewrite_mapping("sha2(expr, bits) -> sha224/sha256/sha384/sha512(expr) for supported literal bit lengths"), +@register_function(compatibility_alias_spec( + namespace=CORE_FUNCTION_NAMESPACE, + function_class=FunctionClass.Scalar, + aliases=["sha224", "sha256", "sha384", "sha512"], + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + determinism=FunctionDeterminism.Deterministic, + null_behavior=FunctionNullBehavior.DependsOnInputs, + error_behavior=FunctionErrorBehavior.InvalidInputDiagnostic, + substrait=rewrite_mapping("sha2(expr, bits) -> sha224/sha256/sha384/sha512(expr) for supported literal bit lengths"), )) pub def sha2(expr: ColumnExpr, bit_length: int) -> ColumnExpr: """ diff --git a/src/functions/hashing/sha224.incn b/src/functions/hashing/sha224.incn index 4b209d1..a31807a 100644 --- a/src/functions/hashing/sha224.incn +++ b/src/functions/hashing/sha224.incn @@ -12,16 +12,16 @@ from function_registry import ( extension_mapping, v0_1, ) -from functions.registry import function_registry, registered_application +from functions.registry import register_function, registered_application from projection_builders import ColumnExpr from substrait.function_extensions import SHA224_FUNCTION_ANCHOR -@function_registry.add("sha224", deterministic_spec( - FunctionClass.Scalar, - FunctionLifecycle(since=v0_1, changed=[], deprecated=None), - FunctionNullBehavior.DependsOnInputs, - extension_mapping("sha224", SHA224_FUNCTION_ANCHOR), +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + substrait=extension_mapping("sha224", SHA224_FUNCTION_ANCHOR), )) pub def sha224(expr: ColumnExpr) -> ColumnExpr: """ diff --git a/src/functions/hashing/sha256.incn b/src/functions/hashing/sha256.incn index 32d0963..39d11db 100644 --- a/src/functions/hashing/sha256.incn +++ b/src/functions/hashing/sha256.incn @@ -12,16 +12,16 @@ from function_registry import ( extension_mapping, v0_1, ) -from functions.registry import function_registry, registered_application +from functions.registry import register_function, registered_application from projection_builders import ColumnExpr from substrait.function_extensions import SHA256_FUNCTION_ANCHOR -@function_registry.add("sha256", deterministic_spec( - FunctionClass.Scalar, - FunctionLifecycle(since=v0_1, changed=[], deprecated=None), - FunctionNullBehavior.DependsOnInputs, - extension_mapping("sha256", SHA256_FUNCTION_ANCHOR), +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + substrait=extension_mapping("sha256", SHA256_FUNCTION_ANCHOR), )) pub def sha256(expr: ColumnExpr) -> ColumnExpr: """ diff --git a/src/functions/hashing/sha384.incn b/src/functions/hashing/sha384.incn index c7afab1..de0e371 100644 --- a/src/functions/hashing/sha384.incn +++ b/src/functions/hashing/sha384.incn @@ -12,16 +12,16 @@ from function_registry import ( extension_mapping, v0_1, ) -from functions.registry import function_registry, registered_application +from functions.registry import register_function, registered_application from projection_builders import ColumnExpr from substrait.function_extensions import SHA384_FUNCTION_ANCHOR -@function_registry.add("sha384", deterministic_spec( - FunctionClass.Scalar, - FunctionLifecycle(since=v0_1, changed=[], deprecated=None), - FunctionNullBehavior.DependsOnInputs, - extension_mapping("sha384", SHA384_FUNCTION_ANCHOR), +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + substrait=extension_mapping("sha384", SHA384_FUNCTION_ANCHOR), )) pub def sha384(expr: ColumnExpr) -> ColumnExpr: """ diff --git a/src/functions/hashing/sha512.incn b/src/functions/hashing/sha512.incn index 193fe54..bfc9780 100644 --- a/src/functions/hashing/sha512.incn +++ b/src/functions/hashing/sha512.incn @@ -12,16 +12,16 @@ from function_registry import ( extension_mapping, v0_1, ) -from functions.registry import function_registry, registered_application +from functions.registry import register_function, registered_application from projection_builders import ColumnExpr from substrait.function_extensions import SHA512_FUNCTION_ANCHOR -@function_registry.add("sha512", deterministic_spec( - FunctionClass.Scalar, - FunctionLifecycle(since=v0_1, changed=[], deprecated=None), - FunctionNullBehavior.DependsOnInputs, - extension_mapping("sha512", SHA512_FUNCTION_ANCHOR), +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + substrait=extension_mapping("sha512", SHA512_FUNCTION_ANCHOR), )) pub def sha512(expr: ColumnExpr) -> ColumnExpr: """ diff --git a/src/functions/hashing/xxhash64.incn b/src/functions/hashing/xxhash64.incn new file mode 100644 index 0000000..4422427 --- /dev/null +++ b/src/functions/hashing/xxhash64.incn @@ -0,0 +1,32 @@ +"""xxHash64 hashing helper.""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import register_function, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import XXHASH64_FUNCTION_ANCHOR + + +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + substrait=extension_mapping("xxhash64", XXHASH64_FUNCTION_ANCHOR), +)) +pub def xxhash64(expr: ColumnExpr) -> ColumnExpr: + """ + Return the lowercase sixteen-character hexadecimal xxHash64 digest for one UTF-8 string expression. + + Examples: + digest = xxhash64(col("payload")) + + Parameters: + expr: String expression whose UTF-8 bytes should be hashed. + """ + return registered_application("xxhash64", [expr]) diff --git a/src/functions/json/check_json.incn b/src/functions/json/check_json.incn new file mode 100644 index 0000000..da1ade3 --- /dev/null +++ b/src/functions/json/check_json.incn @@ -0,0 +1,32 @@ +"""JSON validity predicate helper.""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import register_function, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import CHECK_JSON_FUNCTION_ANCHOR + + +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + substrait=extension_mapping("check_json", CHECK_JSON_FUNCTION_ANCHOR), +)) +pub def check_json(expr: ColumnExpr) -> ColumnExpr: + """ + Return whether one string expression contains valid JSON. + + Examples: + valid = check_json(col("payload")) + + Parameters: + expr: String expression containing a JSON payload. + """ + return registered_application("check_json", [expr]) diff --git a/src/functions/json/common.incn b/src/functions/json/common.incn new file mode 100644 index 0000000..089a2c8 --- /dev/null +++ b/src/functions/json/common.incn @@ -0,0 +1,13 @@ +"""Shared validation helpers for JSON scalar payload functions.""" + +from projection_builders import ColumnExpr, str_expr +from rust::incan_stdlib::errors import raise_value_error + + +pub def json_path_expr(path: str) -> ColumnExpr: + """Build a path literal expression after validating the supported JSON path prefix.""" + if path == "$": + return str_expr(path) + if len(path) >= 2 and path[0] == "$" and (path[1] == "." or path[1] == "["): + return str_expr(path) + return raise_value_error("JSON path must be `$`, start with `$.`, or start with `$[`") diff --git a/src/functions/json/from_json.incn b/src/functions/json/from_json.incn new file mode 100644 index 0000000..1edc210 --- /dev/null +++ b/src/functions/json/from_json.incn @@ -0,0 +1,35 @@ +"""JSON parse helper with an explicit schema description.""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionErrorBehavior, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import register_function, registered_application +from projection_builders import ColumnExpr, str_expr +from substrait.function_extensions import FROM_JSON_FUNCTION_ANCHOR + + +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + error_behavior=FunctionErrorBehavior.InvalidInputDiagnostic, + substrait=extension_mapping("from_json", FROM_JSON_FUNCTION_ANCHOR), +)) +pub def from_json(expr: ColumnExpr, schema: str) -> ColumnExpr: + """ + Validate JSON using an explicit schema description and return a normalized JSON payload. + + Examples: + payload = from_json(col("payload"), "STRUCT") + + Parameters: + expr: String expression containing a JSON payload. + schema: Explicit schema description for the expected JSON payload. + """ + return registered_application("from_json", [expr, str_expr(schema)]) diff --git a/src/functions/json/get_json_object.incn b/src/functions/json/get_json_object.incn new file mode 100644 index 0000000..29c9bbb --- /dev/null +++ b/src/functions/json/get_json_object.incn @@ -0,0 +1,36 @@ +"""JSON object-path extraction helper.""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionErrorBehavior, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.json.common import json_path_expr +from functions.registry import register_function, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import GET_JSON_OBJECT_FUNCTION_ANCHOR + + +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + error_behavior=FunctionErrorBehavior.InvalidInputDiagnostic, + substrait=extension_mapping("get_json_object", GET_JSON_OBJECT_FUNCTION_ANCHOR), +)) +pub def get_json_object(expr: ColumnExpr, path: str) -> ColumnExpr: + """ + Extract one JSON value at a literal object path and return it as JSON text. + + Examples: + event_type = get_json_object(col("payload"), "$.type") + + Parameters: + expr: String expression containing a JSON payload. + path: Literal JSON path beginning with `$`, `$.`, or `$[`. + """ + return registered_application("get_json_object", [expr, json_path_expr(path)]) diff --git a/src/functions/json/json_array_length.incn b/src/functions/json/json_array_length.incn new file mode 100644 index 0000000..c5c5e76 --- /dev/null +++ b/src/functions/json/json_array_length.incn @@ -0,0 +1,34 @@ +"""JSON array length helper.""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionErrorBehavior, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import register_function, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import JSON_ARRAY_LENGTH_FUNCTION_ANCHOR + + +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + error_behavior=FunctionErrorBehavior.InvalidInputDiagnostic, + substrait=extension_mapping("json_array_length", JSON_ARRAY_LENGTH_FUNCTION_ANCHOR), +)) +pub def json_array_length(expr: ColumnExpr) -> ColumnExpr: + """ + Return the number of elements in a JSON array, or null for non-array JSON values. + + Examples: + tag_count = json_array_length(col("tags_json")) + + Parameters: + expr: String expression containing a JSON array payload. + """ + return registered_application("json_array_length", [expr]) diff --git a/src/functions/json/json_extract_path_text.incn b/src/functions/json/json_extract_path_text.incn new file mode 100644 index 0000000..33e5363 --- /dev/null +++ b/src/functions/json/json_extract_path_text.incn @@ -0,0 +1,36 @@ +"""JSON path text extraction helper.""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionErrorBehavior, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.json.common import json_path_expr +from functions.registry import register_function, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import JSON_EXTRACT_PATH_TEXT_FUNCTION_ANCHOR + + +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + error_behavior=FunctionErrorBehavior.InvalidInputDiagnostic, + substrait=extension_mapping("json_extract_path_text", JSON_EXTRACT_PATH_TEXT_FUNCTION_ANCHOR), +)) +pub def json_extract_path_text(expr: ColumnExpr, path: str) -> ColumnExpr: + """ + Extract one JSON value at a literal object path and return scalar strings as plain text. + + Examples: + event_type = json_extract_path_text(col("payload"), "$.type") + + Parameters: + expr: String expression containing a JSON payload. + path: Literal JSON path beginning with `$`, `$.`, or `$[`. + """ + return registered_application("json_extract_path_text", [expr, json_path_expr(path)]) diff --git a/src/functions/json/json_object_keys.incn b/src/functions/json/json_object_keys.incn new file mode 100644 index 0000000..fb39d29 --- /dev/null +++ b/src/functions/json/json_object_keys.incn @@ -0,0 +1,34 @@ +"""JSON object key listing helper.""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionErrorBehavior, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import register_function, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import JSON_OBJECT_KEYS_FUNCTION_ANCHOR + + +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + error_behavior=FunctionErrorBehavior.InvalidInputDiagnostic, + substrait=extension_mapping("json_object_keys", JSON_OBJECT_KEYS_FUNCTION_ANCHOR), +)) +pub def json_object_keys(expr: ColumnExpr) -> ColumnExpr: + """ + Return object keys from a JSON payload as a JSON array string. + + Examples: + keys = json_object_keys(col("payload")) + + Parameters: + expr: String expression containing a JSON object payload. + """ + return registered_application("json_object_keys", [expr]) diff --git a/src/functions/json/mod.incn b/src/functions/json/mod.incn new file mode 100644 index 0000000..e72bc12 --- /dev/null +++ b/src/functions/json/mod.incn @@ -0,0 +1,12 @@ +"""JSON scalar format helpers.""" + +pub from functions.json.check_json import check_json +pub from functions.json.from_json import from_json +pub from functions.json.get_json_object import get_json_object +pub from functions.json.json_array_length import json_array_length +pub from functions.json.json_extract_path_text import json_extract_path_text +pub from functions.json.json_object_keys import json_object_keys +pub from functions.json.parse_json import parse_json +pub from functions.json.schema_of_json import schema_of_json +pub from functions.json.to_json import to_json +pub from functions.json.try_from_json import try_from_json diff --git a/src/functions/json/parse_json.incn b/src/functions/json/parse_json.incn new file mode 100644 index 0000000..afb757b --- /dev/null +++ b/src/functions/json/parse_json.incn @@ -0,0 +1,34 @@ +"""Strict JSON normalization helper.""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionErrorBehavior, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import register_function, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import PARSE_JSON_FUNCTION_ANCHOR + + +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + error_behavior=FunctionErrorBehavior.InvalidInputDiagnostic, + substrait=extension_mapping("parse_json", PARSE_JSON_FUNCTION_ANCHOR), +)) +pub def parse_json(expr: ColumnExpr) -> ColumnExpr: + """ + Validate and normalize one JSON scalar payload. + + Examples: + normalized = parse_json(col("payload")) + + Parameters: + expr: String expression containing a JSON payload. + """ + return registered_application("parse_json", [expr]) diff --git a/src/functions/json/schema_of_json.incn b/src/functions/json/schema_of_json.incn new file mode 100644 index 0000000..809f292 --- /dev/null +++ b/src/functions/json/schema_of_json.incn @@ -0,0 +1,34 @@ +"""JSON schema inference helper.""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionErrorBehavior, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import register_function, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import SCHEMA_OF_JSON_FUNCTION_ANCHOR + + +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + error_behavior=FunctionErrorBehavior.InvalidInputDiagnostic, + substrait=extension_mapping("schema_of_json", SCHEMA_OF_JSON_FUNCTION_ANCHOR), +)) +pub def schema_of_json(expr: ColumnExpr) -> ColumnExpr: + """ + Infer a deterministic schema description from one JSON scalar payload. + + Examples: + schema = schema_of_json(col("payload")) + + Parameters: + expr: String expression containing a JSON payload. + """ + return registered_application("schema_of_json", [expr]) diff --git a/src/functions/json/to_json.incn b/src/functions/json/to_json.incn new file mode 100644 index 0000000..345e065 --- /dev/null +++ b/src/functions/json/to_json.incn @@ -0,0 +1,32 @@ +"""JSON serialization helper.""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import register_function, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import TO_JSON_FUNCTION_ANCHOR + + +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + substrait=extension_mapping("to_json", TO_JSON_FUNCTION_ANCHOR), +)) +pub def to_json(expr: ColumnExpr) -> ColumnExpr: + """ + Serialize one scalar expression as JSON text. + + Examples: + payload = to_json(col("event_type")) + + Parameters: + expr: String-backed scalar expression to serialize as JSON text. + """ + return registered_application("to_json", [expr]) diff --git a/src/functions/json/try_from_json.incn b/src/functions/json/try_from_json.incn new file mode 100644 index 0000000..2d9ecc5 --- /dev/null +++ b/src/functions/json/try_from_json.incn @@ -0,0 +1,33 @@ +"""Recoverable JSON parse helper.""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import register_function, registered_application +from projection_builders import ColumnExpr, str_expr +from substrait.function_extensions import TRY_FROM_JSON_FUNCTION_ANCHOR + + +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + substrait=extension_mapping("try_from_json", TRY_FROM_JSON_FUNCTION_ANCHOR), +)) +pub def try_from_json(expr: ColumnExpr, schema: str) -> ColumnExpr: + """ + Parse JSON with an explicit schema description and return null when the payload is invalid. + + Examples: + payload = try_from_json(col("payload"), "STRUCT") + + Parameters: + expr: String expression containing a JSON payload. + schema: Explicit schema description for the expected JSON payload. + """ + return registered_application("try_from_json", [expr, str_expr(schema)]) diff --git a/src/functions/mod.incn b/src/functions/mod.incn index deb93a6..6a2ae8b 100644 --- a/src/functions/mod.incn +++ b/src/functions/mod.incn @@ -82,11 +82,31 @@ pub from functions.windows.last_value import last_value pub from functions.windows.nth_value import nth_value pub from functions.windows.bounds import current_row, following, preceding, unbounded_following, unbounded_preceding pub from functions.hashing.md5 import md5 +pub from functions.hashing.crc32 import crc32 +pub from functions.hashing.sha1 import sha1 pub from functions.hashing.sha2 import sha2 pub from functions.hashing.sha224 import sha224 pub from functions.hashing.sha256 import sha256 pub from functions.hashing.sha384 import sha384 pub from functions.hashing.sha512 import sha512 +pub from functions.hashing.xxhash64 import xxhash64 +pub from functions.urls.parse_url import parse_url +pub from functions.urls.try_url_decode import try_url_decode +pub from functions.urls.url_decode import url_decode +pub from functions.urls.url_encode import url_encode +pub from functions.json.check_json import check_json +pub from functions.json.from_json import from_json +pub from functions.json.get_json_object import get_json_object +pub from functions.json.json_array_length import json_array_length +pub from functions.json.json_extract_path_text import json_extract_path_text +pub from functions.json.json_object_keys import json_object_keys +pub from functions.json.parse_json import parse_json +pub from functions.json.schema_of_json import schema_of_json +pub from functions.json.to_json import to_json +pub from functions.json.try_from_json import try_from_json +pub from functions.csv.from_csv import from_csv +pub from functions.csv.schema_of_csv import schema_of_csv +pub from functions.csv.to_csv import to_csv pub from functions.operators.add import add pub from functions.operators.and_ import and_ pub from functions.operators.div import div diff --git a/src/functions/urls/mod.incn b/src/functions/urls/mod.incn new file mode 100644 index 0000000..a664957 --- /dev/null +++ b/src/functions/urls/mod.incn @@ -0,0 +1,6 @@ +"""URL format helpers.""" + +pub from functions.urls.parse_url import parse_url +pub from functions.urls.try_url_decode import try_url_decode +pub from functions.urls.url_decode import url_decode +pub from functions.urls.url_encode import url_encode diff --git a/src/functions/urls/parse_url.incn b/src/functions/urls/parse_url.incn new file mode 100644 index 0000000..147708b --- /dev/null +++ b/src/functions/urls/parse_url.incn @@ -0,0 +1,40 @@ +"""URL component parsing helper.""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionErrorBehavior, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import register_function, registered_application +from projection_builders import ColumnExpr, str_expr +from rust::incan_stdlib::errors import raise_value_error +from substrait.function_extensions import PARSE_URL_FUNCTION_ANCHOR + + +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + error_behavior=FunctionErrorBehavior.InvalidInputDiagnostic, + substrait=extension_mapping("parse_url", PARSE_URL_FUNCTION_ANCHOR), +)) +pub def parse_url(expr: ColumnExpr, part: str) -> ColumnExpr: + """ + Extract one URL component such as `scheme`, `host`, `path`, `query`, `fragment`, `port`, `username`, or `password`. + + Examples: + host = parse_url(col("landing_page"), "host") + + Parameters: + expr: String expression containing one absolute URL. + part: URL component to extract. + """ + if part == "scheme" or part == "host" or part == "path" or part == "query" or part == "fragment" or part == "port" or part == "username" or part == "password": + return registered_application("parse_url", [expr, str_expr(part)]) + return raise_value_error( + "parse_url part must be one of scheme, host, path, query, fragment, port, username, or password", + ) diff --git a/src/functions/urls/try_url_decode.incn b/src/functions/urls/try_url_decode.incn new file mode 100644 index 0000000..21c2748 --- /dev/null +++ b/src/functions/urls/try_url_decode.incn @@ -0,0 +1,32 @@ +"""Recoverable URL component decoding helper.""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import register_function, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import TRY_URL_DECODE_FUNCTION_ANCHOR + + +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + substrait=extension_mapping("try_url_decode", TRY_URL_DECODE_FUNCTION_ANCHOR), +)) +pub def try_url_decode(expr: ColumnExpr) -> ColumnExpr: + """ + Percent-decode one URL component and return null for malformed input. + + Examples: + decoded = try_url_decode(col("encoded_query")) + + Parameters: + expr: String expression containing one percent-encoded URL component. + """ + return registered_application("try_url_decode", [expr]) diff --git a/src/functions/urls/url_decode.incn b/src/functions/urls/url_decode.incn new file mode 100644 index 0000000..e50fd39 --- /dev/null +++ b/src/functions/urls/url_decode.incn @@ -0,0 +1,34 @@ +"""Strict URL component decoding helper.""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionErrorBehavior, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import register_function, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import URL_DECODE_FUNCTION_ANCHOR + + +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + error_behavior=FunctionErrorBehavior.InvalidInputDiagnostic, + substrait=extension_mapping("url_decode", URL_DECODE_FUNCTION_ANCHOR), +)) +pub def url_decode(expr: ColumnExpr) -> ColumnExpr: + """ + Percent-decode one URL component and fail on malformed percent escapes or invalid UTF-8. + + Examples: + decoded = url_decode(col("encoded_query")) + + Parameters: + expr: String expression containing one percent-encoded URL component. + """ + return registered_application("url_decode", [expr]) diff --git a/src/functions/urls/url_encode.incn b/src/functions/urls/url_encode.incn new file mode 100644 index 0000000..f6b4371 --- /dev/null +++ b/src/functions/urls/url_encode.incn @@ -0,0 +1,32 @@ +"""URL component encoding helper.""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import register_function, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import URL_ENCODE_FUNCTION_ANCHOR + + +@register_function(deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + substrait=extension_mapping("url_encode", URL_ENCODE_FUNCTION_ANCHOR), +)) +pub def url_encode(expr: ColumnExpr) -> ColumnExpr: + """ + Percent-encode one UTF-8 string for URL component use. + + Examples: + encoded = url_encode(col("query_text")) + + Parameters: + expr: String expression to percent-encode as a URL component. + """ + return registered_application("url_encode", [expr]) diff --git a/src/lib.incn b/src/lib.incn index 6aa2798..79a72eb 100644 --- a/src/lib.incn +++ b/src/lib.incn @@ -134,11 +134,31 @@ pub from functions.windows.last_value import last_value pub from functions.windows.nth_value import nth_value pub from functions.windows.bounds import current_row, following, preceding, unbounded_following, unbounded_preceding pub from functions.hashing.md5 import md5 +pub from functions.hashing.crc32 import crc32 +pub from functions.hashing.sha1 import sha1 pub from functions.hashing.sha2 import sha2 pub from functions.hashing.sha224 import sha224 pub from functions.hashing.sha256 import sha256 pub from functions.hashing.sha384 import sha384 pub from functions.hashing.sha512 import sha512 +pub from functions.hashing.xxhash64 import xxhash64 +pub from functions.urls.parse_url import parse_url +pub from functions.urls.try_url_decode import try_url_decode +pub from functions.urls.url_decode import url_decode +pub from functions.urls.url_encode import url_encode +pub from functions.json.check_json import check_json +pub from functions.json.from_json import from_json +pub from functions.json.get_json_object import get_json_object +pub from functions.json.json_array_length import json_array_length +pub from functions.json.json_extract_path_text import json_extract_path_text +pub from functions.json.json_object_keys import json_object_keys +pub from functions.json.parse_json import parse_json +pub from functions.json.schema_of_json import schema_of_json +pub from functions.json.to_json import to_json +pub from functions.json.try_from_json import try_from_json +pub from functions.csv.from_csv import from_csv +pub from functions.csv.schema_of_csv import schema_of_csv +pub from functions.csv.to_csv import to_csv pub from functions.operators.add import add pub from functions.operators.and_ import and_ pub from functions.operators.div import div diff --git a/src/session/datafusion_backend.incn b/src/session/datafusion_backend.incn index 8aa8c1d..633464d 100644 --- a/src/session/datafusion_backend.incn +++ b/src/session/datafusion_backend.incn @@ -72,6 +72,7 @@ from rust::datafusion::prelude import ( from rust::datafusion::dataframe import DataFrameWriteOptions from rust::datafusion_substrait::substrait::proto import Plan as ConsumerPlan from rust::datafusion_substrait::logical_plan::consumer import from_substrait_plan +from rust::inql_datafusion_format import register_format_udfs from backends import SourceKind, TableSource from dataset.materialization import DataFrameMaterialization from session.backend_types import BackendError, BackendErrorKind, BackendRegistration, backend_error @@ -146,6 +147,7 @@ pub async def datafusion_execute_async( ) -> Result[None, BackendError]: """Execute one Substrait plan via DataFusion and discard collected rows.""" ctx = SessionContext.new() + register_format_udfs(ctx.clone()) await _register_sources(ctx, registrations)? df = await _dataframe_from_plan(ctx, plan)? @@ -161,6 +163,7 @@ pub async def datafusion_collect_materialization_async( """Execute one Substrait plan via DataFusion and return structured DataFrame materialization.""" ctx = SessionContext.new() resolved_columns = root_names(plan.clone()) + register_format_udfs(ctx.clone()) await _register_sources(ctx, registrations)? df = await _dataframe_from_plan(ctx, plan)? @@ -189,6 +192,7 @@ pub async def datafusion_write_csv_async( ) -> Result[None, BackendError]: """Execute one plan and write result rows to a CSV sink URI via DataFusion.""" ctx = SessionContext.new() + register_format_udfs(ctx.clone()) await _register_sources(ctx, registrations)? df = await _dataframe_from_plan(ctx, plan)? @@ -204,6 +208,7 @@ pub async def datafusion_write_parquet_async( ) -> Result[None, BackendError]: """Execute one plan and write result rows to a Parquet sink URI via DataFusion.""" ctx = SessionContext.new() + register_format_udfs(ctx.clone()) await _register_sources(ctx, registrations)? df = await _dataframe_from_plan(ctx, plan)? diff --git a/src/substrait/function_extensions.incn b/src/substrait/function_extensions.incn index b34b3c1..f303df7 100644 --- a/src/substrait/function_extensions.incn +++ b/src/substrait/function_extensions.incn @@ -93,6 +93,26 @@ pub const SHA224_FUNCTION_ANCHOR: u32 = 65 pub const SHA256_FUNCTION_ANCHOR: u32 = 66 pub const SHA384_FUNCTION_ANCHOR: u32 = 67 pub const SHA512_FUNCTION_ANCHOR: u32 = 68 +pub const SHA1_FUNCTION_ANCHOR: u32 = 69 +pub const CRC32_FUNCTION_ANCHOR: u32 = 70 +pub const XXHASH64_FUNCTION_ANCHOR: u32 = 71 +pub const URL_ENCODE_FUNCTION_ANCHOR: u32 = 72 +pub const URL_DECODE_FUNCTION_ANCHOR: u32 = 73 +pub const TRY_URL_DECODE_FUNCTION_ANCHOR: u32 = 74 +pub const PARSE_URL_FUNCTION_ANCHOR: u32 = 75 +pub const PARSE_JSON_FUNCTION_ANCHOR: u32 = 76 +pub const CHECK_JSON_FUNCTION_ANCHOR: u32 = 77 +pub const SCHEMA_OF_JSON_FUNCTION_ANCHOR: u32 = 78 +pub const JSON_ARRAY_LENGTH_FUNCTION_ANCHOR: u32 = 79 +pub const JSON_OBJECT_KEYS_FUNCTION_ANCHOR: u32 = 80 +pub const GET_JSON_OBJECT_FUNCTION_ANCHOR: u32 = 81 +pub const JSON_EXTRACT_PATH_TEXT_FUNCTION_ANCHOR: u32 = 82 +pub const FROM_JSON_FUNCTION_ANCHOR: u32 = 83 +pub const TRY_FROM_JSON_FUNCTION_ANCHOR: u32 = 84 +pub const TO_JSON_FUNCTION_ANCHOR: u32 = 85 +pub const SCHEMA_OF_CSV_FUNCTION_ANCHOR: u32 = 86 +pub const FROM_CSV_FUNCTION_ANCHOR: u32 = 87 +pub const TO_CSV_FUNCTION_ANCHOR: u32 = 88 pub const FUNCTION_EXTENSION_URI: str = "https://inql.io/extensions/v0.1/functions.yaml" pub const EXPLODE_EXTENSION_URI: str = "https://inql.io/extensions/v0.1/unnest.yaml#explode" pub const EXPLODE_OUTER_EXTENSION_URI: str = "https://inql.io/extensions/v0.1/unnest.yaml#explode_outer" diff --git a/tests/test_format_functions.incn b/tests/test_format_functions.incn new file mode 100644 index 0000000..15b3033 --- /dev/null +++ b/tests/test_format_functions.incn @@ -0,0 +1,29 @@ +"""Test: RFC 022 scalar format helper validation.""" + +from std.testing import assert_raises +from functions import col, get_json_object, json_extract_path_text, parse_url + + +def _call_parse_url_with_unknown_part() -> None: + """Call parse_url with an unsupported literal part for ValueError assertions.""" + parse_url(col("landing_page"), "authority") + return + + +def _call_get_json_object_with_invalid_path() -> None: + """Call get_json_object with an unsupported literal path for ValueError assertions.""" + get_json_object(col("payload"), "type") + return + + +def _call_json_extract_path_text_with_invalid_path() -> None: + """Call json_extract_path_text with an unsupported literal path for ValueError assertions.""" + json_extract_path_text(col("payload"), "") + return + + +def test_format_functions__literal_parse_options_are_validated_early() -> None: + # -- Arrange / Act / Assert -- + assert_raises[ValueError](_call_parse_url_with_unknown_part) + assert_raises[ValueError](_call_get_json_object_with_invalid_path) + assert_raises[ValueError](_call_json_extract_path_text_with_invalid_path) diff --git a/tests/test_function_registry.incn b/tests/test_function_registry.incn index 77baaad..723a805 100644 --- a/tests/test_function_registry.incn +++ b/tests/test_function_registry.incn @@ -32,6 +32,7 @@ from functions import ( case_when, cast, ceil, + check_json, col, coalesce, cardinality, @@ -39,6 +40,7 @@ from functions import ( count_distinct, count_expr, count_if, + crc32, cume_dist, desc, desc_nulls_first, @@ -55,6 +57,8 @@ from functions import ( floor, float_expr, following, + from_csv, + from_json, function_registry_canonical_names, function_registry_entries, function_registry_entry, @@ -63,6 +67,7 @@ from functions import ( function_registry_function_refs, gt, gte, + get_json_object, in_, int_expr, int_lit, @@ -71,6 +76,9 @@ from functions import ( is_not_nan, is_not_null, is_null, + json_array_length, + json_extract_path_text, + json_object_keys, inline, inline_outer, lag, @@ -106,6 +114,11 @@ from functions import ( rank, round, row_number, + parse_json, + parse_url, + schema_of_csv, + schema_of_json, + sha1, sha2, sha224, sha256, @@ -117,9 +130,16 @@ from functions import ( sum, stack, try_cast, + try_from_json, + try_url_decode, + to_csv, + to_json, + url_decode, + url_encode, unbounded_following, unbounded_preceding, window, + xxhash64, ) from function_registry import ( CORE_FUNCTION_NAMESPACE, @@ -164,20 +184,28 @@ from substrait.function_extensions import ( BETWEEN_FUNCTION_ANCHOR, CARDINALITY_FUNCTION_ANCHOR, CEIL_FUNCTION_ANCHOR, + CHECK_JSON_FUNCTION_ANCHOR, COALESCE_FUNCTION_ANCHOR, COUNT_FUNCTION_ANCHOR, + CRC32_FUNCTION_ANCHOR, CUME_DIST_FUNCTION_ANCHOR, DENSE_RANK_FUNCTION_ANCHOR, DIVIDE_FUNCTION_ANCHOR, EQUAL_FUNCTION_ANCHOR, FIRST_VALUE_FUNCTION_ANCHOR, FLOOR_FUNCTION_ANCHOR, + FROM_CSV_FUNCTION_ANCHOR, + FROM_JSON_FUNCTION_ANCHOR, + GET_JSON_OBJECT_FUNCTION_ANCHOR, GT_FUNCTION_ANCHOR, GTE_FUNCTION_ANCHOR, IS_NAN_FUNCTION_ANCHOR, IS_NOT_DISTINCT_FROM_FUNCTION_ANCHOR, IS_NOT_NULL_FUNCTION_ANCHOR, IS_NULL_FUNCTION_ANCHOR, + JSON_ARRAY_LENGTH_FUNCTION_ANCHOR, + JSON_EXTRACT_PATH_TEXT_FUNCTION_ANCHOR, + JSON_OBJECT_KEYS_FUNCTION_ANCHOR, LT_FUNCTION_ANCHOR, LTE_FUNCTION_ANCHOR, LAG_FUNCTION_ANCHOR, @@ -207,12 +235,24 @@ from substrait.function_extensions import ( ROW_NUMBER_FUNCTION_ANCHOR, ROUND_FUNCTION_ANCHOR, STACK_EXTENSION_URI, + PARSE_JSON_FUNCTION_ANCHOR, + PARSE_URL_FUNCTION_ANCHOR, + SCHEMA_OF_CSV_FUNCTION_ANCHOR, + SCHEMA_OF_JSON_FUNCTION_ANCHOR, + SHA1_FUNCTION_ANCHOR, SHA224_FUNCTION_ANCHOR, SHA256_FUNCTION_ANCHOR, SHA384_FUNCTION_ANCHOR, SHA512_FUNCTION_ANCHOR, SUBTRACT_FUNCTION_ANCHOR, SUM_FUNCTION_ANCHOR, + TO_CSV_FUNCTION_ANCHOR, + TO_JSON_FUNCTION_ANCHOR, + TRY_FROM_JSON_FUNCTION_ANCHOR, + TRY_URL_DECODE_FUNCTION_ANCHOR, + URL_DECODE_FUNCTION_ANCHOR, + URL_ENCODE_FUNCTION_ANCHOR, + XXHASH64_FUNCTION_ANCHOR, EXPLODE_EXTENSION_URI, EXPLODE_OUTER_EXTENSION_URI, FLATTEN_EXTENSION_URI, @@ -280,12 +320,12 @@ def _local_entry_by_namespace_and_name_or_fail( def _expected_registry_names() -> list[str]: """Return the expected registered public helper names.""" - return ["col", "lit", "sum", "count", "count_expr", "count_distinct", "count_if", "avg", "min", "max", "int_expr", "float_expr", "str_expr", "bool_expr", "add", "mul", "int_lit", "str_lit", "bool_lit", "always_true", "always_false", "eq", "gt", "cast", "try_cast", "ne", "lt", "lte", "gte", "equal_null", "and_", "or_", "not_", "is_null", "is_not_null", "is_nan", "is_not_nan", "sub", "div", "mod", "neg", "coalesce", "nullif", "case_when", "in_", "between", "asc", "desc", "asc_nulls_first", "asc_nulls_last", "desc_nulls_first", "desc_nulls_last", "abs", "ceil", "floor", "round", "array", "array_contains", "array_distinct", "array_except", "array_flatten", "array_intersect", "array_join", "array_position", "range", "array_reverse", "array_slice", "array_sort", "array_union", "arrays_overlap", "cardinality", "element_at", "map_contains_key", "map_entries", "map_extract", "map_from_arrays", "map_keys", "map_values", "named_struct", "explode", "explode_outer", "posexplode", "posexplode_outer", "inline", "inline_outer", "flatten", "stack", "window", "unbounded_preceding", "preceding", "current_row", "following", "unbounded_following", "row_number", "rank", "dense_rank", "percent_rank", "cume_dist", "ntile", "lag", "lead", "first_value", "last_value", "nth_value", "sha224", "sha256", "sha384", "sha512", "sha2", "md5"] + return ["col", "lit", "sum", "count", "count_expr", "count_distinct", "count_if", "avg", "min", "max", "int_expr", "float_expr", "str_expr", "bool_expr", "add", "mul", "int_lit", "str_lit", "bool_lit", "always_true", "always_false", "eq", "gt", "cast", "try_cast", "ne", "lt", "lte", "gte", "equal_null", "and_", "or_", "not_", "is_null", "is_not_null", "is_nan", "is_not_nan", "sub", "div", "mod", "neg", "coalesce", "nullif", "case_when", "in_", "between", "asc", "desc", "asc_nulls_first", "asc_nulls_last", "desc_nulls_first", "desc_nulls_last", "abs", "ceil", "floor", "round", "array", "array_contains", "array_distinct", "array_except", "array_flatten", "array_intersect", "array_join", "array_position", "range", "array_reverse", "array_slice", "array_sort", "array_union", "arrays_overlap", "cardinality", "element_at", "map_contains_key", "map_entries", "map_extract", "map_from_arrays", "map_keys", "map_values", "named_struct", "explode", "explode_outer", "posexplode", "posexplode_outer", "inline", "inline_outer", "flatten", "stack", "window", "unbounded_preceding", "preceding", "current_row", "following", "unbounded_following", "row_number", "rank", "dense_rank", "percent_rank", "cume_dist", "ntile", "lag", "lead", "first_value", "last_value", "nth_value", "sha224", "sha256", "sha384", "sha512", "sha2", "md5", "sha1", "crc32", "xxhash64", "parse_url", "try_url_decode", "url_decode", "url_encode", "check_json", "from_json", "get_json_object", "json_array_length", "json_extract_path_text", "json_object_keys", "parse_json", "schema_of_json", "to_json", "try_from_json", "from_csv", "schema_of_csv", "to_csv"] def _expected_substrait_mapped_names() -> list[str]: """Return helpers with concrete Substrait extension-function mappings.""" - return ["sum", "count", "avg", "min", "max", "add", "mul", "eq", "gt", "ne", "lt", "lte", "gte", "equal_null", "and_", "or_", "not_", "is_null", "is_not_null", "is_nan", "sub", "div", "mod", "neg", "coalesce", "nullif", "between", "abs", "ceil", "floor", "round", "array", "array_contains", "array_distinct", "array_except", "array_flatten", "array_intersect", "array_join", "array_position", "range", "array_reverse", "array_slice", "array_sort", "array_union", "arrays_overlap", "cardinality", "element_at", "map_entries", "map_extract", "map_from_arrays", "map_keys", "map_values", "named_struct", "row_number", "rank", "dense_rank", "percent_rank", "cume_dist", "ntile", "lag", "lead", "first_value", "last_value", "nth_value", "sha224", "sha256", "sha384", "sha512", "md5"] + return ["sum", "count", "avg", "min", "max", "add", "mul", "eq", "gt", "ne", "lt", "lte", "gte", "equal_null", "and_", "or_", "not_", "is_null", "is_not_null", "is_nan", "sub", "div", "mod", "neg", "coalesce", "nullif", "between", "abs", "ceil", "floor", "round", "array", "array_contains", "array_distinct", "array_except", "array_flatten", "array_intersect", "array_join", "array_position", "range", "array_reverse", "array_slice", "array_sort", "array_union", "arrays_overlap", "cardinality", "element_at", "map_entries", "map_extract", "map_from_arrays", "map_keys", "map_values", "named_struct", "row_number", "rank", "dense_rank", "percent_rank", "cume_dist", "ntile", "lag", "lead", "first_value", "last_value", "nth_value", "sha224", "sha256", "sha384", "sha512", "md5", "sha1", "crc32", "xxhash64", "parse_url", "try_url_decode", "url_decode", "url_encode", "check_json", "from_json", "get_json_object", "json_array_length", "json_extract_path_text", "json_object_keys", "parse_json", "schema_of_json", "to_json", "try_from_json", "from_csv", "schema_of_csv", "to_csv"] def _exercise_current_public_helpers() -> None: @@ -401,6 +441,26 @@ def _exercise_current_public_helpers() -> None: sha512(status) sha2(status, 256) md5(status) + sha1(status) + crc32(status) + xxhash64(status) + parse_url(status, "host") + url_encode(status) + url_decode(status) + try_url_decode(status) + parse_json(status) + check_json(status) + schema_of_json(status) + get_json_object(status, "$.type") + json_extract_path_text(status, "$.type") + json_array_length(status) + json_object_keys(status) + from_json(status, "STRUCT") + try_from_json(status, "STRUCT") + to_json(status) + schema_of_csv(status) + from_csv(status, "STRUCT") + to_csv(status) return @@ -725,6 +785,26 @@ def test_function_registry__substrait_extension_mappings_are_structured() -> Non _assert_extension_mapping("sha384", "sha384", SHA384_FUNCTION_ANCHOR) _assert_extension_mapping("sha512", "sha512", SHA512_FUNCTION_ANCHOR) _assert_extension_mapping("md5", "md5", MD5_FUNCTION_ANCHOR) + _assert_extension_mapping("sha1", "sha1", SHA1_FUNCTION_ANCHOR) + _assert_extension_mapping("crc32", "crc32", CRC32_FUNCTION_ANCHOR) + _assert_extension_mapping("xxhash64", "xxhash64", XXHASH64_FUNCTION_ANCHOR) + _assert_extension_mapping("parse_url", "parse_url", PARSE_URL_FUNCTION_ANCHOR) + _assert_extension_mapping("url_encode", "url_encode", URL_ENCODE_FUNCTION_ANCHOR) + _assert_extension_mapping("url_decode", "url_decode", URL_DECODE_FUNCTION_ANCHOR) + _assert_extension_mapping("try_url_decode", "try_url_decode", TRY_URL_DECODE_FUNCTION_ANCHOR) + _assert_extension_mapping("parse_json", "parse_json", PARSE_JSON_FUNCTION_ANCHOR) + _assert_extension_mapping("check_json", "check_json", CHECK_JSON_FUNCTION_ANCHOR) + _assert_extension_mapping("schema_of_json", "schema_of_json", SCHEMA_OF_JSON_FUNCTION_ANCHOR) + _assert_extension_mapping("json_array_length", "json_array_length", JSON_ARRAY_LENGTH_FUNCTION_ANCHOR) + _assert_extension_mapping("json_object_keys", "json_object_keys", JSON_OBJECT_KEYS_FUNCTION_ANCHOR) + _assert_extension_mapping("get_json_object", "get_json_object", GET_JSON_OBJECT_FUNCTION_ANCHOR) + _assert_extension_mapping("json_extract_path_text", "json_extract_path_text", JSON_EXTRACT_PATH_TEXT_FUNCTION_ANCHOR) + _assert_extension_mapping("from_json", "from_json", FROM_JSON_FUNCTION_ANCHOR) + _assert_extension_mapping("try_from_json", "try_from_json", TRY_FROM_JSON_FUNCTION_ANCHOR) + _assert_extension_mapping("to_json", "to_json", TO_JSON_FUNCTION_ANCHOR) + _assert_extension_mapping("schema_of_csv", "schema_of_csv", SCHEMA_OF_CSV_FUNCTION_ANCHOR) + _assert_extension_mapping("from_csv", "from_csv", FROM_CSV_FUNCTION_ANCHOR) + _assert_extension_mapping("to_csv", "to_csv", TO_CSV_FUNCTION_ANCHOR) def test_function_registry__generator_helpers_are_relation_extensions() -> None: @@ -858,7 +938,18 @@ def test_function_registry__public_helpers_preserve_existing_behavior() -> None: status, [str_lit("paid"), str_lit("open")], ), lt(amount, int_lit(10)), lte(amount, int_lit(10)), modulo(amount, lit(2)), round(amount)] - hash_exprs = [md5(status), sha2(status, 256), sha224(status), sha256(status), sha384(status), sha512(status)] + hash_exprs = [md5(status), sha2(status, 256), sha224(status), sha256(status), sha384(status), sha512(status), sha1( + status, + ), crc32(status), xxhash64(status)] + format_exprs = [parse_url(status, "host"), url_encode(status), url_decode(status), try_url_decode(status), parse_json( + status, + ), check_json(status), schema_of_json(status), get_json_object(status, "$.type"), json_extract_path_text( + status, + "$.type", + ), json_array_length(status), json_object_keys(status), from_json(status, "STRUCT"), try_from_json( + status, + "STRUCT", + ), to_json(status), schema_of_csv(status), from_csv(status, "STRUCT"), to_csv(status)] # -- Assert -- assert column_expr_kind(amount) == ColumnExprKind.Column, "col should still build a column reference" @@ -885,5 +976,7 @@ def test_function_registry__public_helpers_preserve_existing_behavior() -> None: assert column_expr_kind(core_expr) != ColumnExprKind.Column, "core scalar helpers should build scalar expressions" for hash_expr in hash_exprs: assert column_expr_kind(hash_expr) == ColumnExprKind.ScalarFunction, "hash helpers should build scalar expressions" + for format_expr in format_exprs: + assert column_expr_kind(format_expr) == ColumnExprKind.ScalarFunction, "format helpers should build scalar expressions" assert column_expr_kind(always_true()) == ColumnExprKind.BoolLiteral, "always_true should still build a bool literal" assert column_expr_kind(always_false()) == ColumnExprKind.BoolLiteral, "always_false should still build a bool literal" diff --git a/tests/test_hashing_functions.incn b/tests/test_hashing_functions.incn index 2cfc4c5..7978b2e 100644 --- a/tests/test_hashing_functions.incn +++ b/tests/test_hashing_functions.incn @@ -1,7 +1,7 @@ """Test: RFC 022 hashing helper surface.""" from std.testing import assert_raises -from functions import col, md5, sha2, sha224, sha256, sha384, sha512 +from functions import col, crc32, md5, sha1, sha2, sha224, sha256, sha384, sha512, xxhash64 from function_registry import function_ref_for from projection_builders import ( ColumnExpr, @@ -37,6 +37,9 @@ def test_hashing_functions__concrete_helpers_share_scalar_application_node() -> _assert_hash_application(sha256(payload), "sha256") _assert_hash_application(sha384(payload), "sha384") _assert_hash_application(sha512(payload), "sha512") + _assert_hash_application(sha1(payload), "sha1") + _assert_hash_application(crc32(payload), "crc32") + _assert_hash_application(xxhash64(payload), "xxhash64") def test_hashing_functions__sha2_rewrites_to_concrete_sha2_helpers() -> None: diff --git a/tests/test_session_projection.incn b/tests/test_session_projection.incn index bf03109..d0e0d8a 100644 --- a/tests/test_session_projection.incn +++ b/tests/test_session_projection.incn @@ -11,8 +11,10 @@ from functions import ( case_when, cast, ceil, + check_json, coalesce, col, + crc32, desc, div, floor, @@ -24,14 +26,32 @@ from functions import ( neg, nullif, round, + sha1, sha2, sha224, sha384, sha512, sub, + xxhash64, try_cast, cardinality, element_at, + from_csv, + from_json, + get_json_object, + json_array_length, + json_extract_path_text, + json_object_keys, + parse_json, + parse_url, + schema_of_csv, + schema_of_json, + to_csv, + to_json, + try_from_json, + try_url_decode, + url_decode, + url_encode, ) from dataset import DataFrame, LazyFrame from session import Session, SessionErrorKind @@ -214,7 +234,7 @@ def test_session_projection__collect_executes_common_math_scalar_projection_func def test_session_projection__collect_executes_format_hashing_projection_functions() -> None: - """collect should execute the first RFC 022 hashing helpers through DataFusion.""" + """collect should execute RFC 022 hashing helpers through DataFusion.""" # -- Arrange -- mut session = Session.default() @@ -223,7 +243,10 @@ def test_session_projection__collect_executes_format_hashing_projection_function session.read_csv("aggregate_orders", AGGREGATE_ORDERS_CSV_FIXTURE), "aggregate orders fixture should load", ) - projected = lazy.with_column("md5_abc", md5(lit("abc"))).with_column("sha224_abc", sha224(lit("abc"))).with_column( + projected = lazy.with_column("md5_abc", md5(lit("abc"))).with_column("sha1_abc", sha1(lit("abc"))).with_column( + "crc32_abc", + crc32(lit("abc")), + ).with_column("xxhash64_abc", xxhash64(lit("abc"))).with_column("sha224_abc", sha224(lit("abc"))).with_column( "sha2_256_abc", sha2(lit("abc"), 256), ).with_column("sha384_abc", sha384(lit("abc"))).with_column("sha512_abc", sha512(lit("abc"))) @@ -233,10 +256,13 @@ def test_session_projection__collect_executes_format_hashing_projection_function # -- Assert -- assert df.row_count() == 3, "hashing projections should preserve the input rows" - assert len(resolved) == 7, "projection should expose all appended hash outputs" + assert len(resolved) == 10, "projection should expose all appended hash outputs" assert payload.contains("md5_abc"), "md5 projection should materialize its alias" assert payload.contains("sha2_256_abc"), "sha2 compatibility projection should materialize its alias" assert payload.contains("900150983cd24fb0d6963f7d28e17f72"), "md5 should return the lowercase hex digest for abc" + assert payload.contains("a9993e364706816aba3e25717850c26c9cd0d89d"), "sha1 should return the lowercase hex digest for abc" + assert payload.contains("352441c2"), "crc32 should return the lowercase hex digest for abc" + assert payload.contains("44bc2cf5ad770999"), "xxhash64 should return the lowercase hex digest for abc" assert payload.contains("23097d223405d8228642a477bda255b32aadbce4bda0b3f7e36c9da7"), "sha224 should return the lowercase hex digest for abc" assert payload.contains("ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad"), "sha2(..., 256) should rewrite to sha256" assert payload.contains( @@ -247,6 +273,75 @@ def test_session_projection__collect_executes_format_hashing_projection_function ), "sha512 should return the lowercase hex digest for abc" +def test_session_projection__collect_executes_json_url_and_csv_format_functions() -> None: + """collect should execute RFC 022 scalar format helpers through adapter-owned DataFusion UDFs.""" + # -- Arrange -- + mut session = Session.default() + + # -- Act -- + lazy: LazyFrame[AggregateOrder] = assert_is_ok( + session.read_csv("aggregate_orders_format", AGGREGATE_ORDERS_CSV_FIXTURE), + "aggregate orders fixture should load", + ) + json_payload = lit("{\"type\":\"click\",\"tags\":[\"paid\",\"web\"],\"user\":{\"id\":7}}") + csv_payload = lit("42,paid") + projected = lazy.with_column("json_valid", check_json(json_payload)).with_column( + "json_type", + json_extract_path_text(json_payload, "$.type"), + ).with_column("json_object", get_json_object(json_payload, "$.user")).with_column( + "json_tags", + json_array_length(lit("[\"paid\",\"web\"]")), + ).with_column("json_keys", json_object_keys(json_payload)).with_column("json_schema", schema_of_json(json_payload)).with_column( + "json_normalized", + parse_json(json_payload), + ).with_column("json_from", from_json(json_payload, "STRUCT")).with_column( + "json_try_invalid", + try_from_json(lit("{"), "STRUCT"), + ).with_column("json_string", to_json(lit("paid"))).with_column( + "url_host", + parse_url(lit("https://example.com/orders?id=7#top"), "host"), + ).with_column("url_encoded", url_encode(lit("a b"))).with_column("url_decoded", url_decode(lit("a%20b"))).with_column( + "url_try_invalid", + try_url_decode(lit("%zz")), + ).with_column("csv_schema", schema_of_csv(csv_payload)).with_column( + "csv_json", + from_csv(csv_payload, "STRUCT"), + ).with_column("csv_line", to_csv(lit("[\"42\",\"paid\"]"))) + df = _collect_or_fail(session, projected) + payload = df.preview_text() + resolved = df.resolved_columns() + + # -- Assert -- + assert df.row_count() == 3, "format projections should preserve input rows" + assert len(resolved) == 19, "projection should expose all appended format outputs" + assert payload.contains("click"), "JSON path extraction should return scalar text" + assert payload.contains("{\"id\":7}"), "get_json_object should return nested JSON text" + assert payload.contains("example.com"), "parse_url should extract the host" + assert payload.contains("a%20b"), "url_encode should percent-encode spaces" + assert payload.contains("a b"), "url_decode should decode valid percent escapes" + assert payload.contains("STRUCT<"), "schema helpers should emit deterministic schema text" + assert payload.contains("{\"id\":\"42\",\"status\":\"paid\"}"), "from_csv should use schema field names" + assert payload.contains("42,paid"), "to_csv should serialize JSON arrays as CSV rows" + + +def test_session_projection__collect_reports_strict_format_decode_errors() -> None: + """collect should surface strict format parsing failures as backend execution errors.""" + # -- Arrange -- + mut session = Session.default() + + # -- Act -- + lazy: LazyFrame[AggregateOrder] = assert_is_ok( + session.read_csv("aggregate_orders_bad_format", AGGREGATE_ORDERS_CSV_FIXTURE), + "aggregate orders fixture should load", + ) + projected = lazy.with_column("bad_url", url_decode(lit("%zz"))) + err = assert_is_err(session.collect(projected), "expected strict URL decoding to fail") + + # -- Assert -- + assert err.kind == SessionErrorKind.BackendExecutionError, err.error_message() + assert err.message.contains("invalid percent-encoded URL component"), err.error_message() + + def test_session_projection__collect_executes_nested_scalar_projection_functions() -> None: """collect should execute RFC 020 nested scalar helpers through DataFusion.""" # -- Arrange -- diff --git a/tests/test_substrait_plan.incn b/tests/test_substrait_plan.incn index 1ed878e..56cf70f 100644 --- a/tests/test_substrait_plan.incn +++ b/tests/test_substrait_plan.incn @@ -26,12 +26,14 @@ from functions import ( case_when, cast, ceil, + check_json, col, cume_dist, coalesce, count, count_distinct, count_if, + crc32, dense_rank, desc, div, @@ -39,6 +41,9 @@ from functions import ( equal_null, floor, first_value, + from_csv, + from_json, + get_json_object, gt, gte, in_, @@ -46,6 +51,9 @@ from functions import ( is_not_nan, is_not_null, is_null, + json_array_length, + json_extract_path_text, + json_object_keys, lit, lag, last_value, @@ -71,10 +79,15 @@ from functions import ( nullif, ntile, or_, + parse_json, + parse_url, percent_rank, rank, round, row_number, + schema_of_csv, + schema_of_json, + sha1, sha2, sha224, sha256, @@ -82,7 +95,14 @@ from functions import ( sha512, sub, sum, + to_csv, + to_json, try_cast, + try_from_json, + try_url_decode, + url_decode, + url_encode, + xxhash64, cardinality, element_at, explode, @@ -466,6 +486,26 @@ def test_plan__core_scalar_extension_mappings_lower_to_substrait() -> None: _assert_scalar_expr_lowers(sha384(col("status"))) _assert_scalar_expr_lowers(sha512(col("status"))) _assert_scalar_expr_lowers(sha2(col("status"), 256)) + _assert_scalar_expr_lowers(sha1(col("status"))) + _assert_scalar_expr_lowers(crc32(col("status"))) + _assert_scalar_expr_lowers(xxhash64(col("status"))) + _assert_scalar_expr_lowers(parse_url(col("status"), "host")) + _assert_scalar_expr_lowers(url_encode(col("status"))) + _assert_scalar_expr_lowers(url_decode(col("status"))) + _assert_scalar_expr_lowers(try_url_decode(col("status"))) + _assert_scalar_expr_lowers(parse_json(col("status"))) + _assert_scalar_expr_lowers(check_json(col("status"))) + _assert_scalar_expr_lowers(schema_of_json(col("status"))) + _assert_scalar_expr_lowers(get_json_object(col("status"), "$.type")) + _assert_scalar_expr_lowers(json_extract_path_text(col("status"), "$.type")) + _assert_scalar_expr_lowers(json_array_length(col("status"))) + _assert_scalar_expr_lowers(json_object_keys(col("status"))) + _assert_scalar_expr_lowers(from_json(col("status"), "STRUCT")) + _assert_scalar_expr_lowers(try_from_json(col("status"), "STRUCT")) + _assert_scalar_expr_lowers(to_json(col("status"))) + _assert_scalar_expr_lowers(schema_of_csv(col("status"))) + _assert_scalar_expr_lowers(from_csv(col("status"), "STRUCT")) + _assert_scalar_expr_lowers(to_csv(col("status"))) def test_plan__nested_scalar_extension_mappings_lower_to_substrait() -> None: From f89e53ae34ed8f075cfd8f8830eaa83a9bc007b9 Mon Sep 17 00:00:00 2001 From: Danny Meijer Date: Thu, 28 May 2026 11:55:27 +0200 Subject: [PATCH 03/13] docs - mark RFC 022 implemented --- docs/rfcs/022_semi_structured_format_functions.md | 6 +++--- docs/rfcs/README.md | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/rfcs/022_semi_structured_format_functions.md b/docs/rfcs/022_semi_structured_format_functions.md index 88cda15..9772abf 100644 --- a/docs/rfcs/022_semi_structured_format_functions.md +++ b/docs/rfcs/022_semi_structured_format_functions.md @@ -1,6 +1,6 @@ # InQL RFC 022: Semi-structured and format functions -- **Status:** In Progress +- **Status:** Implemented - **Created:** 2026-04-27 - **Author(s):** Danny Meijer (@dannymeijer) - **Related:** @@ -13,7 +13,7 @@ - **Issue:** [InQL #39](https://github.com/dannys-code-corner/InQL/issues/39) - **RFC PR:** [InQL #49](https://github.com/dannys-code-corner/InQL/pull/49) - **Written against:** Incan v0.3-era InQL -- **Shipped in:** — +- **Shipped in:** v0.1 ## Summary @@ -139,7 +139,7 @@ This RFC is additive. It should not change existing CSV ingestion behavior. ## Progress Checklist -- [x] RFC 022 moved to In Progress with full scalar format helper design decisions recorded. +- [x] RFC 022 marked Implemented with full scalar format helper design decisions recorded. - [x] `md5`, `sha1`, `sha224`, `sha256`, `sha384`, `sha512`, `sha2`, `crc32`, and `xxhash64` helpers added under the function catalog. - [x] JSON, CSV, and URL scalar payload helpers added under the function catalog. - [x] Concrete helpers registered with Substrait extension metadata. diff --git a/docs/rfcs/README.md b/docs/rfcs/README.md index 619fe45..d7a1d9f 100644 --- a/docs/rfcs/README.md +++ b/docs/rfcs/README.md @@ -28,7 +28,7 @@ InQL uses its **own** RFC series (starting at 000), independent of the [Incan la | [019][rfc-019] | Implemented | Window functions | | | [020][rfc-020] | Implemented | Nested data functions | | | [021][rfc-021] | Implemented | Generator and table-valued functions | | -| [022][rfc-022] | In Progress | Semi-structured and format functions | | +| [022][rfc-022] | Implemented | Semi-structured and format functions | | | [023][rfc-023] | Draft | Approximate and sketch functions | | | [024][rfc-024] | Implemented | Function extension policy | | From eb930b4c59536db9c5a691c3f8ccd2848ca21d04 Mon Sep 17 00:00:00 2001 From: Danny Meijer Date: Thu, 28 May 2026 12:08:13 +0200 Subject: [PATCH 04/13] docs - clean format function reference prose --- docs/language/reference/functions/format.md | 57 ++++++++++----------- 1 file changed, 27 insertions(+), 30 deletions(-) diff --git a/docs/language/reference/functions/format.md b/docs/language/reference/functions/format.md index 05572d4..96c2cfc 100644 --- a/docs/language/reference/functions/format.md +++ b/docs/language/reference/functions/format.md @@ -1,39 +1,38 @@ # Format Functions (Reference) -Format helpers operate on scalar payloads that are already present in a relation. They do not read files, infer source -schemas from external locations, or change relation cardinality. +Format functions transform scalar values that are already present in a relation. Source discovery, file reads, and +relation reshaping belong to the session and relational APIs rather than this function family. -The implemented RFC 022 surface covers deterministic hashes, URL helpers, JSON scalar payload helpers, and CSV scalar -payload helpers: +The format catalog includes deterministic hashes, URL helpers, JSON helpers, and CSV helpers: | Function | Meaning | | --- | --- | -| `md5(expr)` | Return the lowercase hexadecimal MD5 digest for one string expression. | -| `sha1(expr)` | Return the lowercase hexadecimal SHA-1 digest for one string expression. | -| `sha224(expr)` | Return the lowercase hexadecimal SHA-224 digest for one string expression. | -| `sha256(expr)` | Return the lowercase hexadecimal SHA-256 digest for one string expression. | -| `sha384(expr)` | Return the lowercase hexadecimal SHA-384 digest for one string expression. | -| `sha512(expr)` | Return the lowercase hexadecimal SHA-512 digest for one string expression. | +| `md5(expr)` | Return the lowercase hexadecimal MD5 digest for a string expression. | +| `sha1(expr)` | Return the lowercase hexadecimal SHA-1 digest for a string expression. | +| `sha224(expr)` | Return the lowercase hexadecimal SHA-224 digest for a string expression. | +| `sha256(expr)` | Return the lowercase hexadecimal SHA-256 digest for a string expression. | +| `sha384(expr)` | Return the lowercase hexadecimal SHA-384 digest for a string expression. | +| `sha512(expr)` | Return the lowercase hexadecimal SHA-512 digest for a string expression. | | `sha2(expr, bit_length)` | Compatibility helper that rewrites to `sha224`, `sha256`, `sha384`, or `sha512` for supported literal bit lengths. | -| `crc32(expr)` | Return the lowercase eight-character hexadecimal CRC-32 digest for one string expression. | -| `xxhash64(expr)` | Return the lowercase sixteen-character hexadecimal xxHash64 digest for one string expression. | +| `crc32(expr)` | Return the lowercase eight-character hexadecimal CRC-32 digest for a string expression. | +| `xxhash64(expr)` | Return the lowercase sixteen-character hexadecimal xxHash64 digest for a string expression. | | `parse_url(expr, part)` | Extract `scheme`, `host`, `path`, `query`, `fragment`, `port`, `username`, or `password` from a URL string. | -| `url_encode(expr)` | Percent-encode one URL component string. | -| `url_decode(expr)` | Strictly decode one percent-encoded URL component string and fail on malformed escapes. | -| `try_url_decode(expr)` | Decode one percent-encoded URL component string, returning null on malformed escapes. | -| `parse_json(expr)` | Strictly validate and normalize one JSON payload string. | -| `check_json(expr)` | Return whether one string expression contains valid JSON. | -| `schema_of_json(expr)` | Infer a deterministic schema description from one JSON payload string. | +| `url_encode(expr)` | Percent-encode a URL component string. | +| `url_decode(expr)` | Decode a percent-encoded URL component string and fail on malformed escapes. | +| `try_url_decode(expr)` | Decode a percent-encoded URL component string, returning null on malformed escapes. | +| `parse_json(expr)` | Validate and normalize a JSON payload string. | +| `check_json(expr)` | Return whether a string expression contains valid JSON. | +| `schema_of_json(expr)` | Infer a deterministic schema description from a JSON payload string. | | `json_array_length(expr)` | Return the number of array elements for a JSON array payload, or null for non-array payloads. | | `json_object_keys(expr)` | Return object keys from a JSON object payload as a JSON array string. | -| `get_json_object(expr, path)` | Extract one JSON value at a literal path and return it as JSON text. | -| `json_extract_path_text(expr, path)` | Extract one JSON value at a literal path and return scalar strings as plain text. | +| `get_json_object(expr, path)` | Extract a JSON value at a literal path and return it as JSON text. | +| `json_extract_path_text(expr, path)` | Extract a JSON value at a literal path and return scalar strings as plain text. | | `from_json(expr, schema)` | Validate JSON with an explicit schema description and return a normalized JSON payload string. | | `try_from_json(expr, schema)` | Validate JSON with an explicit schema description and return null when the payload is invalid. | -| `to_json(expr)` | Serialize one scalar expression as JSON text. | -| `schema_of_csv(expr)` | Infer a deterministic schema description from one CSV row string. | -| `from_csv(expr, schema)` | Parse one CSV row string into a JSON payload string, using schema field names when provided. | -| `to_csv(expr)` | Serialize one scalar or JSON array/object payload as one CSV row string. | +| `to_json(expr)` | Serialize a scalar expression as JSON text. | +| `schema_of_csv(expr)` | Infer a deterministic schema description from a CSV row string. | +| `from_csv(expr, schema)` | Parse a CSV row string into a JSON payload string, using schema field names when provided. | +| `to_csv(expr)` | Serialize a scalar or JSON array/object payload as a CSV row string. | ```incan from pub::inql.functions import col, from_json, get_json_object, parse_url, sha2, to_json @@ -49,10 +48,8 @@ projected = ( ``` Hash helpers operate on UTF-8 string bytes and return lowercase hexadecimal strings. `sha2(...)` accepts `224`, `256`, -`384`, and `512`; unsupported digest lengths are rejected by the helper rather than being passed through to a backend. +`384`, and `512`; other digest lengths are rejected during expression construction. -JSON and CSV parser helpers are scalar payload helpers in the current InQL value model: they validate, normalize, and -project payload text. They do not read external files and do not introduce a dynamic variant value type. Names for -variant-style predicates such as `typeof(...)`, `is_array(...)`, `is_object(...)`, `is_integer(...)`, `is_timestamp(...)`, -and `is_null_value(...)` are reserved by RFC 022 and remain intentionally unavailable until InQL defines the variant value -model those predicates would inspect. +JSON and CSV helpers validate, normalize, and project payload text. They do not read external files or introduce a +dynamic variant value type. Predicates such as `typeof(...)`, `is_array(...)`, `is_object(...)`, `is_integer(...)`, +`is_timestamp(...)`, and `is_null_value(...)` are reserved for a future typed variant value model. From b2dd84d848b9ce8d870561a8524ee233db3a0a96 Mon Sep 17 00:00:00 2001 From: Danny Meijer Date: Thu, 28 May 2026 12:20:48 +0200 Subject: [PATCH 05/13] docs - add variant logical values RFC --- docs/language/reference/functions/format.md | 3 +- .../022_semi_structured_format_functions.md | 20 +-- .../026_semi_structured_variant_values.md | 144 ++++++++++++++++++ docs/rfcs/README.md | 2 + 4 files changed, 157 insertions(+), 12 deletions(-) create mode 100644 docs/rfcs/026_semi_structured_variant_values.md diff --git a/docs/language/reference/functions/format.md b/docs/language/reference/functions/format.md index 96c2cfc..b669ba4 100644 --- a/docs/language/reference/functions/format.md +++ b/docs/language/reference/functions/format.md @@ -51,5 +51,4 @@ Hash helpers operate on UTF-8 string bytes and return lowercase hexadecimal stri `384`, and `512`; other digest lengths are rejected during expression construction. JSON and CSV helpers validate, normalize, and project payload text. They do not read external files or introduce a -dynamic variant value type. Predicates such as `typeof(...)`, `is_array(...)`, `is_object(...)`, `is_integer(...)`, -`is_timestamp(...)`, and `is_null_value(...)` are reserved for a future typed variant value model. +dynamic variant value type. diff --git a/docs/rfcs/022_semi_structured_format_functions.md b/docs/rfcs/022_semi_structured_format_functions.md index 9772abf..aed9ee2 100644 --- a/docs/rfcs/022_semi_structured_format_functions.md +++ b/docs/rfcs/022_semi_structured_format_functions.md @@ -10,6 +10,7 @@ - InQL RFC 013 (function catalog program) - InQL RFC 014 (function registry and catalog governance) - InQL RFC 020 (nested data functions) + - InQL RFC 026 (semi-structured variant logical values) - **Issue:** [InQL #39](https://github.com/dannys-code-corner/InQL/issues/39) - **RFC PR:** [InQL #49](https://github.com/dannys-code-corner/InQL/pull/49) - **Written against:** Incan v0.3-era InQL @@ -17,7 +18,7 @@ ## Summary -This RFC defines InQL's semi-structured and format-oriented function families: JSON value functions, CSV value functions, schema inference helpers, type predicates for dynamic values, URL helpers, and hashing functions. These functions are practical data-engineering tools, but they should live in explicit format families rather than the core scalar catalog. +This RFC defines InQL's semi-structured and format-oriented function families: JSON value functions, CSV value functions, schema inference helpers, URL helpers, and hashing functions. These functions are practical data-engineering tools, but they should live in explicit format families rather than the core scalar catalog. Semi-structured variant values and their type predicates are defined separately by InQL RFC 026. ## Motivation @@ -29,7 +30,7 @@ Without a separate RFC, format helpers risk leaking ingestion policy into the sc - Define JSON scalar and schema helper functions. - Define CSV scalar and schema helper functions. -- Define dynamic-value type predicates where InQL supports variant-like values. +- Keep string-backed format helpers compatible with the variant value model defined by InQL RFC 026. - Define URL parse/encode/decode helpers. - Define deterministic hash functions for data engineering. - Keep format functions separate from source reading and writing contracts. @@ -37,7 +38,8 @@ Without a separate RFC, format helpers risk leaking ingestion policy into the sc ## Non-Goals - Defining source discovery, file scanning, or format handler registration. -- Defining XML, variant, geospatial, crypto, or sketch functions. +- Defining XML, geospatial, crypto, or sketch functions. +- Defining semi-structured variant logical values or variant predicates; see InQL RFC 026. - Defining nested array/map/struct functions except as return values of parsing functions. - Defining physical input-file metadata functions. @@ -69,9 +71,7 @@ InQL should define URL functions including `parse_url`, `url_encode`, `url_decod InQL should define hash functions including `crc32`, `md5`, `sha1`, `sha2`, and `xxhash64`, with input encoding and output representation specified. -Where InQL supports variant-like dynamic values, it should define type inspection and predicate functions such as `typeof`, `is_array`, `is_object`, `is_integer`, `is_timestamp`, and `is_null_value`. These functions must not be accepted before the value model they inspect is defined. - -This RFC does not introduce a dynamic variant value model. Parser helpers that pass semi-structured payloads through the current scalar expression model return validated, normalized payload text. A later variant/value-model RFC may add typed semi-structured return values, but it must not silently change the meaning of the RFC 022 string-backed payload helpers. +InQL RFC 026 defines semi-structured variant logical values and their type inspection predicates. Parser helpers in this RFC return validated, normalized payload text. RFC 026 must not silently change the meaning of the RFC 022 string-backed payload helpers. Schema inference helper functions must be deterministic for the same input values and options. They must not inspect external files or session state unless explicitly defined as source-discovery functions outside this RFC. @@ -99,7 +99,7 @@ This RFC is additive. It should not change existing CSV ingestion behavior. - **Place all format helpers in the common scalar catalog.** Rejected because format parsing has option, schema, and I/O-adjacent concerns that deserve a separate boundary. - **Make JSON and CSV functions source-only.** Rejected because scalar payload parsing is common inside already-loaded datasets. -- **Add full XML and variant support in the same RFC.** Rejected because those need their own type and compatibility discussion, even though this RFC may reserve JSON parsing and dynamic-value predicate names. +- **Add full XML and variant support in the same RFC.** Rejected because XML and semi-structured variant values need their own type and compatibility discussion. InQL RFC 026 owns the variant value model. ## Drawbacks @@ -125,7 +125,7 @@ This RFC is additive. It should not change existing CSV ingestion behavior. - URL helpers accept scalar URL or component strings. `url_decode(...)` is strict and fails malformed percent escapes; `try_url_decode(...)` returns null for malformed percent escapes. - JSON helpers accept scalar JSON payload strings. Strict helpers fail invalid JSON; `try_from_json(...)` returns null for invalid JSON; schema and path helpers are deterministic over the provided payload and literal schema/path arguments. - CSV helpers accept scalar row strings and explicit schema-description strings. They operate on payload values only and do not replace the session CSV read/write contract. -- Dynamic-value predicate names such as `typeof`, `is_array`, `is_object`, `is_integer`, `is_timestamp`, and `is_null_value` are reserved by this RFC but intentionally not accepted until InQL defines a variant-like value model. This follows the reference rule above rather than creating fake string predicates. +- Semi-structured variant predicates such as `typeof`, `is_array`, `is_object`, `is_integer`, `is_timestamp`, and `is_null_value` belong to InQL RFC 026. RFC 022 does not accept them as JSON-text parser helpers. ## Implementation Plan @@ -135,7 +135,7 @@ This RFC is additive. It should not change existing CSV ingestion behavior. 4. Keep `sha2(...)` as a compatibility rewrite over concrete helpers rather than a second mapping. 5. Add focused helper, registry, Substrait lowering, and DataFusion session tests with concrete output values. 6. Add user-facing format-function docs and release notes. -7. Reserve dynamic-value predicate names until a variant-like value model exists. +7. Record semi-structured variant values as InQL RFC 026 rather than accepting fake JSON-text predicates in RFC 022. ## Progress Checklist @@ -146,4 +146,4 @@ This RFC is additive. It should not change existing CSV ingestion behavior. - [x] `sha2(...)` implemented as a literal-bit-length rewrite with invalid-input diagnostics. - [x] Focused helper, registry, Substrait lowering, and DataFusion-backed session tests added. - [x] User-facing format-function docs and release notes added. -- [x] Dynamic-value predicate names reserved and rejected until a variant-like value model exists. +- [x] Semi-structured variant predicates delegated to InQL RFC 026 instead of being accepted as JSON-text parser helpers. diff --git a/docs/rfcs/026_semi_structured_variant_values.md b/docs/rfcs/026_semi_structured_variant_values.md new file mode 100644 index 0000000..b813d46 --- /dev/null +++ b/docs/rfcs/026_semi_structured_variant_values.md @@ -0,0 +1,144 @@ +# InQL RFC 026: Semi-structured variant logical values + +- **Status:** Draft +- **Created:** 2026-05-28 +- **Author(s):** Danny Meijer (@dannymeijer) +- **Related:** + - InQL RFC 002 (Apache Substrait integration) + - InQL RFC 014 (function registry and catalog governance) + - InQL RFC 020 (nested data functions) + - InQL RFC 022 (semi-structured and format functions) + - InQL RFC 024 (function extension policy) +- **Issue:** — +- **RFC PR:** [InQL #49](https://github.com/dannys-code-corner/InQL/pull/49) +- **Written against:** Incan v0.3-era InQL +- **Shipped in:** — + +## Summary + +This RFC defines semi-structured variant logical values for InQL. A variant value is distinct from ordinary `str` and `bytes` payloads: it carries a logical kind such as null, boolean, integer, floating point, string, timestamp, array, or object, and InQL predicates inspect that logical value rather than reparsing arbitrary JSON text. + +## Core model + +1. A semi-structured variant is a first-class logical value, even when a backend stores it in an opaque native representation. +2. Variant null and relation-level SQL null are distinct. Variant predicates must not erase that distinction. +3. Type predicates such as `typeof`, `is_array`, `is_object`, `is_integer`, `is_timestamp`, and `is_null_value` operate on variant expressions, not raw strings. +4. RFC 022 string-backed JSON and CSV payload helpers remain stable; this RFC may add variant-returning parse helpers, but it must not silently change existing helper return types. +5. Backend adapters may implement, emulate, or reject variant operations, but backend-native variant semantics do not define the portable InQL contract. + +## Motivation + +JSON and semi-structured payloads are common in data pipelines, but predicates such as `is_array(...)` and `typeof(...)` are ambiguous if InQL only has string payloads. `is_array(col("payload"))` could mean "parse this string as JSON and inspect the root value", or it could mean "inspect a typed semi-structured value that was already parsed by the logical plan." Those are different contracts with different null behavior, error behavior, and backend portability. + +If InQL accepts variant predicate names before defining variant values, it either squats on better names with string-parser semantics or leaves backend adapters to decide meaning at execution time. This RFC records the missing logical value model so those predicates have a precise home. + +## Goals + +- Define a semi-structured variant logical value family. +- Define the predicate and inspection functions that operate on variant values. +- Distinguish relation-level SQL null from semi-structured null values. +- Define how variant-returning parse helpers interact with RFC 022 string-backed payload helpers. +- Keep variant semantics backend-neutral across Prism, Substrait, and backend adapters. + +## Non-Goals + +- Changing RFC 022 string-backed helpers such as `parse_json`, `from_json`, `to_json`, `from_csv`, or `to_csv`. +- Defining XML, geospatial, sketch, or binary-format functions. +- Inferring timestamps, decimals, or other rich scalar kinds from untyped JSON strings without an explicit schema or parse option. +- Requiring every backend to support native variant storage. +- Defining query-block syntax for variant destructuring. + +## Guide-level explanation (how authors think about it) + +Authors should parse payload text into a variant value before using variant predicates. The exact helper names remain open in this Draft, but the shape is: + +```incan +from pub::inql.functions import col, is_array, is_null_value, parse_variant_json, typeof, variant_get + +events_with_payload = events.with_column("payload_value", parse_variant_json(col("payload"))) + +projected = ( + events_with_payload + .with_column("payload_kind", typeof(col("payload_value"))) + .with_column("is_items_array", is_array(variant_get(col("payload_value"), "$.items"))) + .with_column("deleted_was_json_null", is_null_value(variant_get(col("payload_value"), "$.deleted_at"))) +) +``` + +Authors who only need text validation or normalized JSON strings should keep using the RFC 022 helpers. Variant helpers are for plans that need logical semi-structured values and type-aware predicates. + +## Reference-level explanation (precise rules) + +InQL must define a variant logical value type that can represent at least null, boolean, integer, floating point, string, timestamp, array, and object values. Decimal, binary, date, and interval kinds may be added by design decision before this RFC moves beyond Draft. + +Variant predicates must accept variant expressions. They must not accept ordinary `str` expressions as an implicit parse-and-inspect shortcut. Authors must use an explicit variant parse or cast helper when starting from JSON text. + +`typeof(expr)` must return a stable lowercase kind name for a non-null variant value. It must distinguish at least `null`, `boolean`, `integer`, `float`, `string`, `timestamp`, `array`, and `object`. It must not report `timestamp` for a plain JSON string unless an explicit schema or parse option produced a typed timestamp variant. + +`is_array(expr)`, `is_object(expr)`, `is_integer(expr)`, `is_timestamp(expr)`, and `is_null_value(expr)` must inspect the variant kind. `is_integer(...)` must be true only for integer variant values, not floating point values whose runtime value happens to have no fractional component. `is_null_value(...)` must be true only for semi-structured null values. + +SQL null must remain distinct from variant null. If a predicate input is SQL null rather than a present variant value, the predicate result must follow InQL's scalar null behavior for missing inputs rather than returning true for `is_null_value(...)`. + +Variant parse helpers must define strict and recoverable forms. Strict parse helpers must fail malformed payloads according to registry error metadata. Recoverable parse helpers must return SQL null or another explicitly documented recoverable result for malformed payloads. A JSON `null` payload must produce a present variant null, not SQL null. + +Variant field/path access must preserve whether a missing path produced SQL null, variant null, or a present value. If a backend cannot preserve that distinction, the adapter must reject the operation or require an explicit compatibility mode. + +Substrait lowering must preserve variant logical type identity through extension type metadata or reject the operation before execution. A backend adapter may map variant values and predicates to native functions only when it preserves the InQL variant contract. + +## Design details + +### Syntax + +This RFC does not require new language syntax. Variant values and predicates may use ordinary helper calls and dataframe method chains. Future query-block syntax must lower to the same variant expression model. + +### Semantics + +Variant arrays and objects are semi-structured values, not InQL relation shapes. They may be projected, inspected, and passed to variant-aware helpers. They must not change relation cardinality unless used with a generator or table-valued function that explicitly defines such a change. + +Variant ordering, equality, grouping, and serialization are not implicit. If InQL supports those operations for variants, the operation must define kind ordering, null behavior, and backend compatibility rather than inheriting an arbitrary backend default. + +### Interaction with other InQL surfaces + +RFC 020 nested data functions operate on typed array, map, and struct expressions. Variant arrays and objects may interoperate with those functions only through explicit conversion rules. + +RFC 022 format helpers that return normalized JSON or CSV text remain string-backed helpers. This RFC may add variant-returning helpers with different names or explicit options, but existing RFC 022 helpers must not silently change return type. + +Prism may use variant kind metadata for validation, projection pruning, and rewrite safety. Prism must not rewrite variant operations in ways that collapse SQL null and variant null, drop schema-directed typed scalar information, or turn variant predicates into text parser calls. + +The Substrait boundary remains between InQL semantics and backend execution. DataFusion or any other backend may be an implementation target, but backend-native variant names, path syntaxes, and null rules do not define the portable InQL semantics. + +### Compatibility / migration + +This RFC is additive. Existing string payload columns remain strings. Existing RFC 022 functions keep their current string-backed behavior. Authors who want variant semantics must opt into variant-returning helpers or explicit casts. + +## Alternatives considered + +- **Make `typeof` and `is_array` parse strings directly.** Rejected because it couples predicate semantics to JSON text parsing and makes typed variant values incompatible with the obvious function names. +- **Change RFC 022 JSON helpers to return variants.** Rejected because it would silently change the meaning of already-defined string-backed payload helpers and make simple normalization workflows depend on a richer value model. +- **Expose backend-native variant functions as compatibility aliases.** Rejected as the portable core because backend-native null behavior, path syntax, and type names differ. +- **Represent variants as ordinary structs or maps.** Rejected unless the value carries distinct variant logical type identity; ordinary nested values do not by themselves encode semi-structured null and kind semantics. + +## Drawbacks + +- Variant logical values add a new type family to expression planning and backend capability reporting. +- Cross-backend support will be uneven because engines differ in native semi-structured support. +- Keeping SQL null and variant null distinct requires careful documentation, tests, and adapter validation. +- Variant path access can become a large surface if not kept separate from parser and generator responsibilities. + +## Layers affected + +- **InQL specification** — variant values must be distinct from strings, bytes, typed nested values, and sketch values. +- **InQL library package** — public helpers must expose variant parse, path access, and predicate functions only with explicit registry metadata. +- **Incan compiler** — typechecking may need enough helper metadata to reject string expressions where variant expressions are required. +- **Execution / interchange** — Prism, Substrait lowering, and backend adapters must preserve variant type identity and SQL-null versus variant-null behavior or reject unsupported operations. +- **Documentation** — function references must present variant predicates as variant operations, not as JSON-text parser shortcuts. + +## Unresolved questions + +- What is the public type spelling for variant expressions? +- Should variant-returning JSON parse helpers use new names such as `parse_variant_json`, or should RFC 022 helpers gain explicit options that return variants? +- Which scalar kinds are required before Planned status: decimal, binary, date, interval, or only the minimum JSON-compatible set plus timestamp? +- What path expression grammar should variant access use, and should it match RFC 022 JSON path helper strings? +- Should missing object keys and out-of-range array indexes return SQL null, a missing sentinel, or an error in strict modes? + + diff --git a/docs/rfcs/README.md b/docs/rfcs/README.md index d7a1d9f..fd30dfc 100644 --- a/docs/rfcs/README.md +++ b/docs/rfcs/README.md @@ -31,6 +31,7 @@ InQL uses its **own** RFC series (starting at 000), independent of the [Incan la | [022][rfc-022] | Implemented | Semi-structured and format functions | | | [023][rfc-023] | Draft | Approximate and sketch functions | | | [024][rfc-024] | Implemented | Function extension policy | | +| [026][rfc-026] | Draft | Semi-structured variant logical values | | @@ -67,4 +68,5 @@ New RFCs should follow [TEMPLATE.md] (aligned with Incan’s RFC structure, adap [rfc-022]: 022_semi_structured_format_functions.md [rfc-023]: 023_approximate_sketch_functions.md [rfc-024]: 024_function_extension_policy.md +[rfc-026]: 026_semi_structured_variant_values.md [incan-rfcs]: https://github.com/dannys-code-corner/incan/tree/main/workspaces/docs-site/docs/RFCs From 85a433fdd0d9a8ab98e86c1f86aa3790de8bf491 Mon Sep 17 00:00:00 2001 From: Danny Meijer Date: Thu, 28 May 2026 12:42:05 +0200 Subject: [PATCH 06/13] docs - link RFC 026 issue --- docs/rfcs/026_semi_structured_variant_values.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/rfcs/026_semi_structured_variant_values.md b/docs/rfcs/026_semi_structured_variant_values.md index b813d46..65ff770 100644 --- a/docs/rfcs/026_semi_structured_variant_values.md +++ b/docs/rfcs/026_semi_structured_variant_values.md @@ -9,7 +9,7 @@ - InQL RFC 020 (nested data functions) - InQL RFC 022 (semi-structured and format functions) - InQL RFC 024 (function extension policy) -- **Issue:** — +- **Issue:** [InQL #52](https://github.com/dannys-code-corner/InQL/issues/52) - **RFC PR:** [InQL #49](https://github.com/dannys-code-corner/InQL/pull/49) - **Written against:** Incan v0.3-era InQL - **Shipped in:** — From 3f848e37dd65c656aadcb0c6c1c5338580d6ee39 Mon Sep 17 00:00:00 2001 From: Danny Meijer Date: Thu, 28 May 2026 17:47:10 +0200 Subject: [PATCH 07/13] feature - 39 move format UDFs to Incan bridge --- .github/workflows/ci.yml | 2 +- docs/language/reference/functions/format.md | 3 + docs/release_notes/v0_1.md | 2 +- .../022_semi_structured_format_functions.md | 4 +- incan.lock | 34 +- incan.toml | 8 +- rust/inql-datafusion-format/Cargo.toml | 15 - rust/inql-datafusion-format/src/lib.rs | 649 ------------- src/session/datafusion_backend.incn | 2 +- src/session/datafusion_format_functions.incn | 876 ++++++++++++++++++ tests/test_session_projection.incn | 30 +- 11 files changed, 920 insertions(+), 705 deletions(-) delete mode 100644 rust/inql-datafusion-format/Cargo.toml delete mode 100644 rust/inql-datafusion-format/src/lib.rs create mode 100644 src/session/datafusion_format_functions.incn diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 8582983..c9007fd 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -20,7 +20,7 @@ concurrency: env: CARGO_TERM_COLOR: always INCAN_REF: release/v0.3 - EXPECTED_INCAN_VERSION: 0.3.0-rc21 + EXPECTED_INCAN_VERSION: 0.3.0-rc23 RUST_BACKTRACE: 1 INCAN_NO_BANNER: 1 diff --git a/docs/language/reference/functions/format.md b/docs/language/reference/functions/format.md index b669ba4..ae282bc 100644 --- a/docs/language/reference/functions/format.md +++ b/docs/language/reference/functions/format.md @@ -52,3 +52,6 @@ Hash helpers operate on UTF-8 string bytes and return lowercase hexadecimal stri JSON and CSV helpers validate, normalize, and project payload text. They do not read external files or introduce a dynamic variant value type. + +The DataFusion adapter executes the full RFC 022 catalog with native DataFusion functions where available and +Incan-authored adapter callbacks for helpers that DataFusion does not expose natively. diff --git a/docs/release_notes/v0_1.md b/docs/release_notes/v0_1.md index 5dc9a10..ec232d9 100644 --- a/docs/release_notes/v0_1.md +++ b/docs/release_notes/v0_1.md @@ -17,7 +17,7 @@ Entries will be filled in as work lands (link RFCs and PRs when applicable). - **Nested data functions:** RFC 020 adds registry-backed scalar helpers for array construction/access, cardinality, containment, overlap, sorting, set-like operations, joining, slicing, reversing, scalar array flattening, map construction/access, map key/value/entry extraction, map key containment, and named struct construction. These helpers lower through Substrait extension metadata without introducing generator semantics, with representative DataFusion-backed Session coverage for composable array projection paths. - **Generator functions:** RFC 021 adds registry-backed generator applications for `explode(...)`, `explode_outer(...)`, `posexplode(...)`, `posexplode_outer(...)`, `inline(...)`, `inline_outer(...)`, portable `flatten(...)`, and `stack(...)`. Generators remain relation-shaping operations applied with `generate(...)`; they preserve input columns, require explicit output aliases, lower through the current Substrait extension-relation gap encoding, and execute through the DataFusion Session adapter with concrete output-column materialization. - **Window functions:** RFC 019 adds `window()` specs, explicit row/range frame bounds, ranking and distribution helpers (`row_number`, `rank`, `dense_rank`, `percent_rank`, `cume_dist`, `ntile`), offset and value helpers (`lag`, `lead`, `first_value`, `last_value`, `nth_value`), and aggregate-over-window placement through `with_window_column(...)`. Portable window helpers require explicit ordering where appropriate, lower through Substrait `ConsistentPartitionWindowRel`, and execute through the DataFusion session adapter. -- **Format functions:** RFC 022 adds scalar payload helpers for deterministic hashes (`md5`, `sha1`, `sha224`, `sha256`, `sha384`, `sha512`, `sha2`, `crc32`, and `xxhash64`), URL parsing/encoding/decoding, JSON validation/path/schema helpers, and CSV row/schema helpers. Format helpers lower through registry-owned Substrait metadata and execute through adapter-owned DataFusion UDF registration; dynamic variant predicates remain reserved until InQL defines the variant value model they inspect. +- **Format functions:** RFC 022 adds scalar payload helpers for deterministic hashes (`md5`, `sha1`, `sha224`, `sha256`, `sha384`, `sha512`, `sha2`, `crc32`, and `xxhash64`), URL parsing/encoding/decoding, JSON validation/path/schema helpers, and CSV row/schema helpers. Format helpers lower through registry-owned Substrait metadata; the DataFusion adapter executes the full helper set with native functions where available and Incan-authored adapter callbacks for non-native helpers. - **Function registry:** RFC 014 adds declaration-site registry decorators for the current public helper surface, including stable function references, checked signature projection, lifecycle metadata, behavior categories, alias policy, Substrait mapping categories, and checked API metadata drift validation. - **Function extension policy:** InQL RFC 024 policy metadata now distinguishes portable core functions, namespaced extension-only functions, opt-in compatibility aliases, engine-specific functions, and rejected compatibility requests without adding an extension plugin system or backend-owned semantics. - **Projection:** builder-based `with_column`, `add`, `mul`, and literal expression helpers now lower derived columns through Prism, Substrait, and Session execution. diff --git a/docs/rfcs/022_semi_structured_format_functions.md b/docs/rfcs/022_semi_structured_format_functions.md index aed9ee2..bcf4bea 100644 --- a/docs/rfcs/022_semi_structured_format_functions.md +++ b/docs/rfcs/022_semi_structured_format_functions.md @@ -120,7 +120,7 @@ This RFC is additive. It should not change existing CSV ingestion behavior. ### Resolved - Hash helpers operate on UTF-8 string bytes and return lowercase hexadecimal strings. -- Portable concrete hash helpers are `md5`, `sha1`, `sha224`, `sha256`, `sha384`, `sha512`, `crc32`, and `xxhash64`, each with an honest Substrait extension mapping and DataFusion-backed execution coverage. +- Portable concrete hash helpers are `md5`, `sha1`, `sha224`, `sha256`, `sha384`, `sha512`, `crc32`, and `xxhash64`, each with an honest Substrait extension mapping. The DataFusion adapter validates materialized execution for the full helper set, using native DataFusion functions where available and Incan-authored adapter callbacks where DataFusion has no built-in implementation. - `sha2(expr, bit_length)` is a compatibility helper, not a separate backend mapping. It rewrites to `sha224`, `sha256`, `sha384`, or `sha512` for supported literal bit lengths and rejects unsupported values. - URL helpers accept scalar URL or component strings. `url_decode(...)` is strict and fails malformed percent escapes; `try_url_decode(...)` returns null for malformed percent escapes. - JSON helpers accept scalar JSON payload strings. Strict helpers fail invalid JSON; `try_from_json(...)` returns null for invalid JSON; schema and path helpers are deterministic over the provided payload and literal schema/path arguments. @@ -133,7 +133,7 @@ This RFC is additive. It should not change existing CSV ingestion behavior. 2. Add registry-backed URL, JSON, and CSV scalar payload helpers under logical function families. 3. Add stable Substrait extension anchors for concrete helpers. 4. Keep `sha2(...)` as a compatibility rewrite over concrete helpers rather than a second mapping. -5. Add focused helper, registry, Substrait lowering, and DataFusion session tests with concrete output values. +5. Add focused helper, registry, Substrait lowering, and DataFusion-backed session tests with concrete output values across the full helper set. 6. Add user-facing format-function docs and release notes. 7. Record semi-structured variant values as InQL RFC 026 rather than accepting fake JSON-text predicates in RFC 022. diff --git a/incan.lock b/incan.lock index 137c218..97b484b 100644 --- a/incan.lock +++ b/incan.lock @@ -3,8 +3,8 @@ [incan] format = 1 -incan-version = "0.3.0-rc21" -deps-fingerprint = "sha256:3610572303b9930ff93a467137924cf89c26e05467169061f3fb01f6175d8972" +incan-version = "0.3.0-rc23" +deps-fingerprint = "sha256:825926ed0a97c47f458cb9e9ed6c142473cb37e7f047d3de1a32263f1cef5097" cargo-features = [] cargo-no-default-features = false cargo-all-features = false @@ -1824,14 +1824,14 @@ dependencies = [ [[package]] name = "incan_core" -version = "0.3.0-rc21" +version = "0.3.0-rc23" dependencies = [ "serde", ] [[package]] name = "incan_derive" -version = "0.3.0-rc21" +version = "0.3.0-rc23" dependencies = [ "proc-macro2", "quote", @@ -1840,10 +1840,12 @@ dependencies = [ [[package]] name = "incan_stdlib" -version = "0.3.0-rc21" +version = "0.3.0-rc23" dependencies = [ "incan_core", "incan_derive", + "serde", + "serde_json", "tokio", "xxhash-rust", ] @@ -1862,32 +1864,24 @@ dependencies = [ [[package]] name = "inql" -version = "0.3.0-rc21" +version = "0.3.0-rc23" dependencies = [ "byteorder", + "crc32fast", "datafusion", + "datafusion-common", + "datafusion-expr", "datafusion-substrait", "encoding_rs", "incan_derive", "incan_stdlib", - "inql-datafusion-format", "prost", "prost-types", "rustix", - "substrait 0.63.0", -] - -[[package]] -name = "inql-datafusion-format" -version = "0.1.0" -dependencies = [ - "crc32fast", - "csv", - "datafusion", - "hex", - "percent-encoding", - "serde_json", + "serde", "sha1", + "sha2", + "substrait 0.63.0", "url", "xxhash-rust", ] diff --git a/incan.toml b/incan.toml index 77d73bd..cada06b 100644 --- a/incan.toml +++ b/incan.toml @@ -10,5 +10,11 @@ prost-types = "0.14" substrait = { version = "0.63", features = ["protoc"] } # Datafusion is InQL default query engine, and provides the Substrait to Datafusion translation. datafusion = "53" +datafusion_common = { package = "datafusion-common", version = "53" } +datafusion_expr = { package = "datafusion-expr", version = "53" } datafusion-substrait = { version = "53", features = ["protoc"] } -inql-datafusion-format = { path = "rust/inql-datafusion-format" } +crc32fast = "1.5" +sha1 = "0.10" +sha2 = "0.10.9" +url = "2.5" +xxhash_rust = { package = "xxhash-rust", version = "0.8", features = ["xxh64"] } diff --git a/rust/inql-datafusion-format/Cargo.toml b/rust/inql-datafusion-format/Cargo.toml deleted file mode 100644 index 81ce0da..0000000 --- a/rust/inql-datafusion-format/Cargo.toml +++ /dev/null @@ -1,15 +0,0 @@ -[package] -name = "inql-datafusion-format" -version = "0.1.0" -edition = "2024" - -[dependencies] -crc32fast = "1" -csv = "1" -datafusion = "53" -hex = "0.4" -percent-encoding = "2" -serde_json = "1" -sha1 = "0.10" -url = "2" -xxhash-rust = { version = "0.8", features = ["xxh64"] } diff --git a/rust/inql-datafusion-format/src/lib.rs b/rust/inql-datafusion-format/src/lib.rs deleted file mode 100644 index 7c79691..0000000 --- a/rust/inql-datafusion-format/src/lib.rs +++ /dev/null @@ -1,649 +0,0 @@ -use std::sync::Arc; - -use crc32fast::Hasher as Crc32Hasher; -use datafusion::arrow::array::{ - Array, ArrayRef, BooleanArray, Int64Array, LargeStringArray, StringArray, StringViewArray, -}; -use datafusion::arrow::datatypes::DataType; -use datafusion::common::{DataFusionError, Result, ScalarValue}; -use datafusion::execution::context::SessionContext; -use datafusion::logical_expr::{ColumnarValue, Volatility, create_udf}; -use percent_encoding::{AsciiSet, CONTROLS, percent_decode_str, utf8_percent_encode}; -use serde_json::{Map, Value}; -use sha1::{Digest, Sha1}; -use url::Url; -use xxhash_rust::xxh64::xxh64; - -const URL_COMPONENT_ENCODE_SET: &AsciiSet = &CONTROLS - .add(b' ') - .add(b'"') - .add(b'#') - .add(b'%') - .add(b'<') - .add(b'>') - .add(b'?') - .add(b'`') - .add(b'{') - .add(b'}'); - -pub fn register_format_udfs(ctx: SessionContext) { - for udf in [ - unary_string_udf("sha1", hash_sha1), - unary_string_udf("crc32", hash_crc32), - unary_string_udf("xxhash64", hash_xxhash64), - unary_string_udf("url_encode", encode_url_component), - unary_string_udf("url_decode", decode_url_component), - unary_string_udf("try_url_decode", try_decode_url_component), - binary_string_udf("parse_url", parse_url_part), - unary_string_udf("parse_json", parse_json), - unary_string_udf("check_json", check_json), - unary_string_udf("schema_of_json", schema_of_json), - unary_string_udf("json_array_length", json_array_length), - unary_string_udf("json_object_keys", json_object_keys), - binary_string_udf("get_json_object", get_json_object), - binary_string_udf("json_extract_path_text", json_extract_path_text), - binary_string_udf("from_json", from_json), - binary_string_udf("try_from_json", try_from_json), - unary_string_udf("to_json", to_json), - unary_string_udf("schema_of_csv", schema_of_csv), - binary_string_udf("from_csv", from_csv), - unary_string_udf("to_csv", to_csv), - ] { - ctx.register_udf(udf); - } -} - -fn unary_string_udf( - name: &'static str, - fun: fn(Option<&str>) -> Result, -) -> datafusion::logical_expr::ScalarUDF { - create_udf( - name, - vec![DataType::Utf8], - return_type_for(name), - Volatility::Immutable, - Arc::new(move |args| apply_unary_string(args, fun)), - ) -} - -fn binary_string_udf( - name: &'static str, - fun: fn(Option<&str>, Option<&str>) -> Result, -) -> datafusion::logical_expr::ScalarUDF { - create_udf( - name, - vec![DataType::Utf8, DataType::Utf8], - DataType::Utf8, - Volatility::Immutable, - Arc::new(move |args| apply_binary_string(args, fun)), - ) -} - -fn return_type_for(name: &str) -> DataType { - match name { - "check_json" => DataType::Boolean, - "json_array_length" => DataType::Int64, - _ => DataType::Utf8, - } -} - -fn apply_unary_string( - args: &[ColumnarValue], - fun: fn(Option<&str>) -> Result, -) -> Result { - require_arity(args, 1)?; - if all_scalar(args) { - let value = scalar_string(args[0].clone())?; - return fun(value.as_deref()).map(ColumnarValue::Scalar); - } - - let len = row_count(args); - let mut string_values: Vec> = Vec::with_capacity(len); - let mut bool_values: Vec> = Vec::with_capacity(len); - let mut int_values: Vec> = Vec::with_capacity(len); - let mut result_kind: Option<&'static str> = None; - for idx in 0..len { - let value = string_at(&args[0], idx)?; - let scalar = fun(value.as_deref())?; - match scalar { - ScalarValue::Utf8(item) => { - result_kind = Some("string"); - string_values.push(item); - } - ScalarValue::Boolean(item) => { - result_kind = Some("bool"); - bool_values.push(item); - } - ScalarValue::Int64(item) => { - result_kind = Some("int"); - int_values.push(item); - } - _ => return internal_error("unexpected format UDF return type"), - } - } - array_result( - result_kind.unwrap_or("string"), - string_values, - bool_values, - int_values, - ) -} - -fn apply_binary_string( - args: &[ColumnarValue], - fun: fn(Option<&str>, Option<&str>) -> Result, -) -> Result { - require_arity(args, 2)?; - if all_scalar(args) { - let left = scalar_string(args[0].clone())?; - let right = scalar_string(args[1].clone())?; - return fun(left.as_deref(), right.as_deref()).map(ColumnarValue::Scalar); - } - - let len = row_count(args); - let mut values: Vec> = Vec::with_capacity(len); - for idx in 0..len { - let left = string_at(&args[0], idx)?; - let right = string_at(&args[1], idx)?; - match fun(left.as_deref(), right.as_deref())? { - ScalarValue::Utf8(item) => values.push(item), - _ => return internal_error("unexpected binary format UDF return type"), - } - } - Ok(ColumnarValue::Array(Arc::new(StringArray::from(values)))) -} - -fn require_arity(args: &[ColumnarValue], expected: usize) -> Result<()> { - if args.len() == expected { - return Ok(()); - } - internal_error(format!( - "format UDF expected {expected} arguments, got {}", - args.len() - )) -} - -fn all_scalar(args: &[ColumnarValue]) -> bool { - args.iter() - .all(|arg| matches!(arg, ColumnarValue::Scalar(_))) -} - -fn row_count(args: &[ColumnarValue]) -> usize { - args.iter() - .find_map(|arg| match arg { - ColumnarValue::Array(array) => Some(array.len()), - ColumnarValue::Scalar(_) => None, - }) - .unwrap_or(1) -} - -fn array_result( - kind: &str, - string_values: Vec>, - bool_values: Vec>, - int_values: Vec>, -) -> Result { - match kind { - "bool" => Ok(ColumnarValue::Array(Arc::new(BooleanArray::from( - bool_values, - )))), - "int" => Ok(ColumnarValue::Array(Arc::new(Int64Array::from(int_values)))), - _ => Ok(ColumnarValue::Array(Arc::new(StringArray::from( - string_values, - )))), - } -} - -fn scalar_string(arg: ColumnarValue) -> Result> { - match arg { - ColumnarValue::Scalar(ScalarValue::Utf8(value)) => Ok(value), - ColumnarValue::Scalar(ScalarValue::LargeUtf8(value)) => Ok(value), - ColumnarValue::Scalar(ScalarValue::Utf8View(value)) => Ok(value), - ColumnarValue::Scalar(ScalarValue::Null) => Ok(None), - ColumnarValue::Scalar(value) => Ok(Some(value.to_string())), - ColumnarValue::Array(array) => string_at(&ColumnarValue::Array(array), 0), - } -} - -fn string_at(arg: &ColumnarValue, idx: usize) -> Result> { - match arg { - ColumnarValue::Scalar(value) => scalar_string(ColumnarValue::Scalar(value.clone())), - ColumnarValue::Array(array) => string_array_value(array, idx), - } -} - -fn string_array_value(array: &ArrayRef, idx: usize) -> Result> { - if array.is_null(idx) { - return Ok(None); - } - match array.data_type() { - DataType::Utf8 => { - let strings = array - .as_any() - .downcast_ref::() - .ok_or_else(|| { - DataFusionError::Internal( - "format UDF could not downcast Utf8 array".to_string(), - ) - })?; - Ok(Some(strings.value(idx).to_string())) - } - DataType::LargeUtf8 => { - let strings = array - .as_any() - .downcast_ref::() - .ok_or_else(|| { - DataFusionError::Internal( - "format UDF could not downcast LargeUtf8 array".to_string(), - ) - })?; - Ok(Some(strings.value(idx).to_string())) - } - DataType::Utf8View => { - let strings = array - .as_any() - .downcast_ref::() - .ok_or_else(|| { - DataFusionError::Internal( - "format UDF could not downcast Utf8View array".to_string(), - ) - })?; - Ok(Some(strings.value(idx).to_string())) - } - _ => Ok(Some(format!("{:?}", array))), - } -} - -fn hash_sha1(value: Option<&str>) -> Result { - match value { - Some(text) => { - let mut hasher = Sha1::new(); - hasher.update(text.as_bytes()); - Ok(ScalarValue::Utf8(Some(hex::encode(hasher.finalize())))) - } - None => Ok(ScalarValue::Utf8(None)), - } -} - -fn hash_crc32(value: Option<&str>) -> Result { - match value { - Some(text) => { - let mut hasher = Crc32Hasher::new(); - hasher.update(text.as_bytes()); - Ok(ScalarValue::Utf8(Some(format!( - "{:08x}", - hasher.finalize() - )))) - } - None => Ok(ScalarValue::Utf8(None)), - } -} - -fn hash_xxhash64(value: Option<&str>) -> Result { - match value { - Some(text) => Ok(ScalarValue::Utf8(Some(format!( - "{:016x}", - xxh64(text.as_bytes(), 0) - )))), - None => Ok(ScalarValue::Utf8(None)), - } -} - -fn encode_url_component(value: Option<&str>) -> Result { - Ok(ScalarValue::Utf8(value.map(|text| { - utf8_percent_encode(text, URL_COMPONENT_ENCODE_SET).to_string() - }))) -} - -fn decode_url_component(value: Option<&str>) -> Result { - if value.is_none() { - return Ok(ScalarValue::Utf8(None)); - } - match try_decode_url_text(value)? { - Some(text) => Ok(ScalarValue::Utf8(Some(text))), - None => internal_error("invalid percent-encoded URL component"), - } -} - -fn try_decode_url_component(value: Option<&str>) -> Result { - Ok(ScalarValue::Utf8(try_decode_url_text(value)?)) -} - -fn try_decode_url_text(value: Option<&str>) -> Result> { - match value { - Some(text) => { - if !percent_escapes_are_valid(text) { - return Ok(None); - } - match percent_decode_str(text).decode_utf8() { - Ok(decoded) => Ok(Some(decoded.to_string())), - Err(_) => Ok(None), - } - } - None => Ok(None), - } -} - -fn percent_escapes_are_valid(value: &str) -> bool { - let bytes = value.as_bytes(); - let mut idx = 0; - while idx < bytes.len() { - if bytes[idx] == b'%' { - if idx + 2 >= bytes.len() - || !bytes[idx + 1].is_ascii_hexdigit() - || !bytes[idx + 2].is_ascii_hexdigit() - { - return false; - } - idx += 3; - } else { - idx += 1; - } - } - true -} - -fn parse_url_part(url: Option<&str>, part: Option<&str>) -> Result { - let Some(raw_url) = url else { - return Ok(ScalarValue::Utf8(None)); - }; - let Some(raw_part) = part else { - return Ok(ScalarValue::Utf8(None)); - }; - let parsed = Url::parse(raw_url).map_err(|err| DataFusionError::Execution(err.to_string()))?; - let value = match raw_part.to_ascii_lowercase().as_str() { - "scheme" => Some(parsed.scheme().to_string()), - "host" => parsed.host_str().map(str::to_string), - "path" => Some(parsed.path().to_string()), - "query" => parsed.query().map(str::to_string), - "fragment" => parsed.fragment().map(str::to_string), - "port" => parsed.port().map(|port| port.to_string()), - "username" => Some(parsed.username().to_string()), - "password" => parsed.password().map(str::to_string), - other => return internal_error(format!("unknown parse_url part `{other}`")), - }; - Ok(ScalarValue::Utf8(value)) -} - -fn parse_json(value: Option<&str>) -> Result { - match value { - Some(text) => Ok(ScalarValue::Utf8(Some(normalized_json(text)?))), - None => Ok(ScalarValue::Utf8(None)), - } -} - -fn check_json(value: Option<&str>) -> Result { - match value { - Some(text) => Ok(ScalarValue::Boolean(Some( - serde_json::from_str::(text).is_ok(), - ))), - None => Ok(ScalarValue::Boolean(None)), - } -} - -fn schema_of_json(value: Option<&str>) -> Result { - match value { - Some(text) => { - let parsed = parse_json_value(text)?; - Ok(ScalarValue::Utf8(Some(json_schema(&parsed)))) - } - None => Ok(ScalarValue::Utf8(None)), - } -} - -fn json_array_length(value: Option<&str>) -> Result { - match value { - Some(text) => { - let parsed = parse_json_value(text)?; - let length = parsed.as_array().map(|items| items.len() as i64); - Ok(ScalarValue::Int64(length)) - } - None => Ok(ScalarValue::Int64(None)), - } -} - -fn json_object_keys(value: Option<&str>) -> Result { - match value { - Some(text) => { - let parsed = parse_json_value(text)?; - let keys: Vec = parsed - .as_object() - .map(|object| object.keys().cloned().map(Value::String).collect()) - .unwrap_or_default(); - Ok(ScalarValue::Utf8(Some(Value::Array(keys).to_string()))) - } - None => Ok(ScalarValue::Utf8(None)), - } -} - -fn get_json_object(value: Option<&str>, path: Option<&str>) -> Result { - json_path_value(value, path, false) -} - -fn json_extract_path_text(value: Option<&str>, path: Option<&str>) -> Result { - json_path_value(value, path, true) -} - -fn json_path_value( - value: Option<&str>, - path: Option<&str>, - text_mode: bool, -) -> Result { - let Some(raw_json) = value else { - return Ok(ScalarValue::Utf8(None)); - }; - let Some(raw_path) = path else { - return Ok(ScalarValue::Utf8(None)); - }; - let parsed = parse_json_value(raw_json)?; - let Some(selected) = select_json_path(&parsed, raw_path) else { - return Ok(ScalarValue::Utf8(None)); - }; - if text_mode { - if let Some(text) = selected.as_str() { - return Ok(ScalarValue::Utf8(Some(text.to_string()))); - } - } - Ok(ScalarValue::Utf8(Some(selected.to_string()))) -} - -fn from_json(value: Option<&str>, _schema: Option<&str>) -> Result { - parse_json(value) -} - -fn try_from_json(value: Option<&str>, _schema: Option<&str>) -> Result { - match value { - Some(text) => match normalized_json(text) { - Ok(parsed) => Ok(ScalarValue::Utf8(Some(parsed))), - Err(_) => Ok(ScalarValue::Utf8(None)), - }, - None => Ok(ScalarValue::Utf8(None)), - } -} - -fn to_json(value: Option<&str>) -> Result { - Ok(ScalarValue::Utf8( - value.map(|text| Value::String(text.to_string()).to_string()), - )) -} - -fn normalized_json(text: &str) -> Result { - Ok(parse_json_value(text)?.to_string()) -} - -fn parse_json_value(text: &str) -> Result { - serde_json::from_str::(text).map_err(|err| DataFusionError::Execution(err.to_string())) -} - -fn select_json_path<'a>(value: &'a Value, path: &str) -> Option<&'a Value> { - if path == "$" || path.is_empty() { - return Some(value); - } - let mut current = value; - let path = path - .strip_prefix("$.") - .or_else(|| path.strip_prefix('$')) - .unwrap_or(path); - for segment in path.split('.') { - if segment.is_empty() { - continue; - } - current = select_json_segment(current, segment)?; - } - Some(current) -} - -fn select_json_segment<'a>(value: &'a Value, segment: &str) -> Option<&'a Value> { - if let Some((field, index_part)) = segment.split_once('[') { - let nested = if field.is_empty() { - value - } else { - value.as_object()?.get(field)? - }; - let index_text = index_part.strip_suffix(']')?; - let index: usize = index_text.parse().ok()?; - return nested.as_array()?.get(index); - } - value.as_object()?.get(segment) -} - -fn json_schema(value: &Value) -> String { - match value { - Value::Null => "NULL".to_string(), - Value::Bool(_) => "BOOLEAN".to_string(), - Value::Number(number) => { - if number.is_i64() || number.is_u64() { - "BIGINT".to_string() - } else { - "DOUBLE".to_string() - } - } - Value::String(_) => "STRING".to_string(), - Value::Array(values) => { - let item_schema = values - .first() - .map(json_schema) - .unwrap_or_else(|| "UNKNOWN".to_string()); - format!("ARRAY<{item_schema}>") - } - Value::Object(object) => { - let fields = object - .iter() - .map(|(name, item)| format!("{name}: {}", json_schema(item))) - .collect::>() - .join(", "); - format!("STRUCT<{fields}>") - } - } -} - -fn schema_of_csv(value: Option<&str>) -> Result { - let Some(text) = value else { - return Ok(ScalarValue::Utf8(None)); - }; - let record = first_csv_record(text)?; - let fields = record - .iter() - .enumerate() - .map(|(idx, item)| format!("_c{idx}: {}", scalar_schema(item))) - .collect::>() - .join(", "); - Ok(ScalarValue::Utf8(Some(format!("STRUCT<{fields}>")))) -} - -fn from_csv(value: Option<&str>, schema: Option<&str>) -> Result { - let Some(text) = value else { - return Ok(ScalarValue::Utf8(None)); - }; - let record = first_csv_record(text)?; - let names = csv_schema_field_names(schema.unwrap_or("")); - if names.is_empty() { - let values = record - .iter() - .map(|field| Value::String(field.to_string())) - .collect(); - return Ok(ScalarValue::Utf8(Some(Value::Array(values).to_string()))); - } - let mut object = Map::new(); - for (idx, name) in names.iter().enumerate() { - let value = record.get(idx).unwrap_or(""); - object.insert(name.clone(), Value::String(value.to_string())); - } - Ok(ScalarValue::Utf8(Some(Value::Object(object).to_string()))) -} - -fn to_csv(value: Option<&str>) -> Result { - let Some(text) = value else { - return Ok(ScalarValue::Utf8(None)); - }; - let fields = match serde_json::from_str::(text) { - Ok(Value::Array(values)) => values.into_iter().map(csv_json_value).collect(), - Ok(Value::Object(object)) => object.into_values().map(csv_json_value).collect(), - _ => vec![text.to_string()], - }; - let mut writer = csv::WriterBuilder::new() - .has_headers(false) - .from_writer(vec![]); - writer - .write_record(fields) - .map_err(|err| DataFusionError::Execution(err.to_string()))?; - let bytes = writer - .into_inner() - .map_err(|err| DataFusionError::Execution(err.to_string()))?; - let row = - String::from_utf8(bytes).map_err(|err| DataFusionError::Execution(err.to_string()))?; - Ok(ScalarValue::Utf8(Some( - row.trim_end_matches(['\r', '\n']).to_string(), - ))) -} - -fn first_csv_record(text: &str) -> Result { - let mut reader = csv::ReaderBuilder::new() - .has_headers(false) - .from_reader(text.as_bytes()); - match reader.records().next() { - Some(record) => record.map_err(|err| DataFusionError::Execution(err.to_string())), - None => Ok(csv::StringRecord::new()), - } -} - -fn csv_schema_field_names(schema: &str) -> Vec { - schema - .trim() - .trim_start_matches("STRUCT<") - .trim_end_matches('>') - .split(',') - .filter_map(|part| { - let name = part.trim().split([':', ' ']).next().unwrap_or("").trim(); - if name.is_empty() { - None - } else { - Some(name.to_string()) - } - }) - .collect() -} - -fn scalar_schema(value: &str) -> &'static str { - if value.parse::().is_ok() { - "BIGINT" - } else if value.parse::().is_ok() { - "DOUBLE" - } else if value.eq_ignore_ascii_case("true") || value.eq_ignore_ascii_case("false") { - "BOOLEAN" - } else { - "STRING" - } -} - -fn csv_json_value(value: Value) -> String { - match value { - Value::Null => String::new(), - Value::String(text) => text, - other => other.to_string(), - } -} - -fn internal_error(message: impl Into) -> Result { - Err(DataFusionError::Internal(message.into())) -} diff --git a/src/session/datafusion_backend.incn b/src/session/datafusion_backend.incn index 633464d..920cf2f 100644 --- a/src/session/datafusion_backend.incn +++ b/src/session/datafusion_backend.incn @@ -72,7 +72,6 @@ from rust::datafusion::prelude import ( from rust::datafusion::dataframe import DataFrameWriteOptions from rust::datafusion_substrait::substrait::proto import Plan as ConsumerPlan from rust::datafusion_substrait::logical_plan::consumer import from_substrait_plan -from rust::inql_datafusion_format import register_format_udfs from backends import SourceKind, TableSource from dataset.materialization import DataFrameMaterialization from session.backend_types import BackendError, BackendErrorKind, BackendRegistration, backend_error @@ -96,6 +95,7 @@ from substrait.relations import project_rel_with_expressions, read_named_table_r from substrait.schema import RowColumnSpec, SubstraitPrimitiveKind from substrait.schema_registry import register_named_table_schema from substrait.window_metadata import WINDOW_NULL_TREATMENT_OPTION, WINDOW_OUTPUT_NAME_OPTION +from session.datafusion_format_functions import register_format_udfs @derive(Clone) diff --git a/src/session/datafusion_format_functions.incn b/src/session/datafusion_format_functions.incn new file mode 100644 index 0000000..57f7a97 --- /dev/null +++ b/src/session/datafusion_format_functions.incn @@ -0,0 +1,876 @@ +"""DataFusion execution bridge for InQL-owned scalar format functions.""" + +from rust::crc32fast import Hasher as Crc32Hasher +from rust::datafusion::arrow::array import ArrayRef, BooleanArray, Int64Array, StringArray +from rust::datafusion::arrow::datatypes import DataType +from rust::datafusion::execution::context import SessionContext +from rust::datafusion_common import DataFusionError, ScalarValue +from rust::datafusion_expr import ColumnarValue, ScalarFunctionImplementation, ScalarUDF, Volatility, create_udf +from rust::sha1 import Sha1 +from rust::sha2 import Digest +from rust::std::primitive import i64 as RustI64, usize as RustUsize +from rust::std::string import String as RustString +from rust::std::sync import Arc +from rust::url import Url +from rust::xxhash_rust::xxh64 import Xxh64 +from std.encoding.hex import hexlify +from std.json import JsonError, JsonKind, JsonValue + + +pub def register_format_udfs(ctx: SessionContext) -> None: + """Register RFC 022 scalar format functions that DataFusion does not expose natively.""" + # DataFusion stores UDF callbacks as 'static Rust closures. Keep the captured function identity as a literal inside + # each callback instead of borrowing a constructor-local name. + ctx.register_udf(_unary_string_udf("sha1", Arc.from((args) => _apply_unary_string("sha1", args.to_vec())))) + ctx.register_udf(_unary_string_udf("crc32", Arc.from((args) => _apply_unary_string("crc32", args.to_vec())))) + ctx.register_udf(_unary_string_udf("xxhash64", Arc.from((args) => _apply_unary_string("xxhash64", args.to_vec())))) + ctx.register_udf( + _unary_string_udf("url_encode", Arc.from((args) => _apply_unary_string("url_encode", args.to_vec()))), + ) + ctx.register_udf( + _unary_string_udf("url_decode", Arc.from((args) => _apply_unary_string("url_decode", args.to_vec()))), + ) + ctx.register_udf( + _unary_string_udf("try_url_decode", Arc.from((args) => _apply_unary_string("try_url_decode", args.to_vec()))), + ) + ctx.register_udf( + _binary_string_udf("parse_url", Arc.from((args) => _apply_binary_string("parse_url", args.to_vec()))), + ) + ctx.register_udf( + _unary_string_udf("parse_json", Arc.from((args) => _apply_unary_string("parse_json", args.to_vec()))), + ) + ctx.register_udf(_unary_bool_udf("check_json", Arc.from((args) => _apply_unary_bool("check_json", args.to_vec())))) + ctx.register_udf( + _unary_string_udf("schema_of_json", Arc.from((args) => _apply_unary_string("schema_of_json", args.to_vec()))), + ) + ctx.register_udf( + _unary_int_udf("json_array_length", Arc.from((args) => _apply_unary_int("json_array_length", args.to_vec()))), + ) + ctx.register_udf( + _unary_string_udf("json_object_keys", Arc.from((args) => _apply_unary_string("json_object_keys", args.to_vec()))), + ) + ctx.register_udf( + _binary_string_udf("get_json_object", Arc.from((args) => _apply_binary_string("get_json_object", args.to_vec()))), + ) + ctx.register_udf( + _binary_string_udf( + "json_extract_path_text", + Arc.from((args) => _apply_binary_string("json_extract_path_text", args.to_vec())), + ), + ) + ctx.register_udf( + _binary_string_udf("from_json", Arc.from((args) => _apply_binary_string("from_json", args.to_vec()))), + ) + ctx.register_udf( + _binary_string_udf("try_from_json", Arc.from((args) => _apply_binary_string("try_from_json", args.to_vec()))), + ) + ctx.register_udf(_unary_string_udf("to_json", Arc.from((args) => _apply_unary_string("to_json", args.to_vec())))) + ctx.register_udf( + _unary_string_udf("schema_of_csv", Arc.from((args) => _apply_unary_string("schema_of_csv", args.to_vec()))), + ) + ctx.register_udf(_binary_string_udf("from_csv", Arc.from((args) => _apply_binary_string("from_csv", args.to_vec())))) + ctx.register_udf(_unary_string_udf("to_csv", Arc.from((args) => _apply_unary_string("to_csv", args.to_vec())))) + + +def _unary_string_udf(name: str, implementation: ScalarFunctionImplementation) -> ScalarUDF: + """Create one unary string-to-string DataFusion UDF backed by Incan source.""" + return create_udf(name, [DataType.Utf8], DataType.Utf8, Volatility.Immutable, implementation) + + +def _unary_bool_udf(name: str, implementation: ScalarFunctionImplementation) -> ScalarUDF: + """Create one unary string-to-boolean DataFusion UDF backed by Incan source.""" + return create_udf(name, [DataType.Utf8], DataType.Boolean, Volatility.Immutable, implementation) + + +def _unary_int_udf(name: str, implementation: ScalarFunctionImplementation) -> ScalarUDF: + """Create one unary string-to-int DataFusion UDF backed by Incan source.""" + return create_udf(name, [DataType.Utf8], DataType.Int64, Volatility.Immutable, implementation) + + +def _binary_string_udf(name: str, implementation: ScalarFunctionImplementation) -> ScalarUDF: + """Create one binary string-to-string DataFusion UDF backed by Incan source.""" + return create_udf(name, [DataType.Utf8, DataType.Utf8], DataType.Utf8, Volatility.Immutable, implementation) + + +def _apply_unary_string(name: str, args: list[ColumnarValue]) -> Result[ColumnarValue, DataFusionError]: + """Apply one unary string function across scalar and array DataFusion inputs.""" + _require_arity(args, 1)? + if _all_scalar(args): + return Ok(ColumnarValue.Scalar(ScalarValue.Utf8(_eval_unary_string(name, _string_at(args[0].clone(), 0)?)?))) + + row_count = _row_count(args)? + mut values: list[Option[str]] = [] + for idx in range(row_count): + values.append(_eval_unary_string(name, _string_at(args[0].clone(), idx)?)?) + array_ref: ArrayRef = Arc.from(StringArray.from(values)) + return Ok(ColumnarValue.Array(array_ref)) + + +def _apply_unary_bool(name: str, args: list[ColumnarValue]) -> Result[ColumnarValue, DataFusionError]: + """Apply one unary string-to-bool function across scalar and array DataFusion inputs.""" + _require_arity(args, 1)? + if _all_scalar(args): + return Ok(ColumnarValue.Scalar(ScalarValue.Boolean(_eval_unary_bool(name, _string_at(args[0].clone(), 0)?)?))) + + row_count = _row_count(args)? + mut values: list[Option[bool]] = [] + for idx in range(row_count): + values.append(_eval_unary_bool(name, _string_at(args[0].clone(), idx)?)?) + array_ref: ArrayRef = Arc.from(BooleanArray.from(values)) + return Ok(ColumnarValue.Array(array_ref)) + + +def _apply_unary_int(name: str, args: list[ColumnarValue]) -> Result[ColumnarValue, DataFusionError]: + """Apply one unary string-to-int function across scalar and array DataFusion inputs.""" + _require_arity(args, 1)? + if _all_scalar(args): + value = _eval_unary_int(name, _string_at(args[0].clone(), 0)?)? + match value: + Some(number) => return Ok(ColumnarValue.Scalar(ScalarValue.Int64(Some(_int_to_rust_i64(number)?)))) + None => return Ok(ColumnarValue.Scalar(ScalarValue.Int64(None))) + + row_count = _row_count(args)? + mut values: list[Option[int]] = [] + for idx in range(row_count): + values.append(_eval_unary_int(name, _string_at(args[0].clone(), idx)?)?) + array_ref: ArrayRef = Arc.from(Int64Array.from(values)) + return Ok(ColumnarValue.Array(array_ref)) + + +def _apply_binary_string(name: str, args: list[ColumnarValue]) -> Result[ColumnarValue, DataFusionError]: + """Apply one binary string function across scalar and array DataFusion inputs.""" + _require_arity(args, 2)? + if _all_scalar(args): + return Ok( + ColumnarValue.Scalar( + ScalarValue.Utf8( + _eval_binary_string(name, _string_at(args[0].clone(), 0)?, _string_at(args[1].clone(), 0)?)?, + ), + ), + ) + + row_count = _row_count(args)? + mut values: list[Option[str]] = [] + for idx in range(row_count): + values.append(_eval_binary_string(name, _string_at(args[0].clone(), idx)?, _string_at(args[1].clone(), idx)?)?) + array_ref: ArrayRef = Arc.from(StringArray.from(values)) + return Ok(ColumnarValue.Array(array_ref)) + + +def _eval_unary_string(name: str, value: Option[str]) -> Result[Option[str], DataFusionError]: + """Dispatch unary string-returning format functions.""" + match name: + "sha1" => return _hash_sha1(value) + "crc32" => return _hash_crc32(value) + "xxhash64" => return _hash_xxhash64(value) + "url_encode" => return _url_encode(value) + "url_decode" => return _url_decode(value) + "try_url_decode" => return _try_url_decode(value) + "parse_json" => return _parse_json(value) + "schema_of_json" => return _schema_of_json(value) + "json_object_keys" => return _json_object_keys(value) + "to_json" => return _to_json(value) + "schema_of_csv" => return _schema_of_csv(value) + "to_csv" => return _to_csv(value) + _ => return _datafusion_error(f"unknown unary format function `{name}`") + + +def _eval_unary_bool(name: str, value: Option[str]) -> Result[Option[bool], DataFusionError]: + """Dispatch unary boolean-returning format functions.""" + match name: + "check_json" => return _check_json(value) + _ => return _datafusion_error(f"unknown boolean format function `{name}`") + + +def _eval_unary_int(name: str, value: Option[str]) -> Result[Option[int], DataFusionError]: + """Dispatch unary integer-returning format functions.""" + match name: + "json_array_length" => return _json_array_length(value) + _ => return _datafusion_error(f"unknown integer format function `{name}`") + + +def _eval_binary_string(name: str, left: Option[str], right: Option[str]) -> Result[Option[str], DataFusionError]: + """Dispatch binary string-returning format functions.""" + match name: + "parse_url" => return _parse_url(left, right) + "get_json_object" => return _json_path_value(left, right, false) + "json_extract_path_text" => return _json_path_value(left, right, true) + "from_json" => return _from_json(left, right) + "try_from_json" => return _try_from_json(left, right) + "from_csv" => return _from_csv(left, right) + _ => return _datafusion_error(f"unknown binary format function `{name}`") + + +def _hash_sha1(value: Option[str]) -> Result[Option[str], DataFusionError]: + """Hash one optional string with SHA-1.""" + match value: + Some(text) => return Ok(Some(hexlify(_sha1_digest(_utf8_bytes(text))))) + None => return Ok(None) + + +def _hash_crc32(value: Option[str]) -> Result[Option[str], DataFusionError]: + """Hash one optional string with CRC-32.""" + match value: + Some(text) => + mut hasher = Crc32Hasher.new() + hasher.update(_utf8_bytes(text).as_slice()) + return Ok(Some(hexlify(_u32_to_be_bytes(hasher.finalize())))) + None => return Ok(None) + + +def _hash_xxhash64(value: Option[str]) -> Result[Option[str], DataFusionError]: + """Hash one optional string with xxHash64.""" + match value: + Some(text) => return Ok(Some(hexlify(_u64_to_be_bytes(_xxhash64(_utf8_bytes(text)))))) + None => return Ok(None) + + +def _url_encode(value: Option[str]) -> Result[Option[str], DataFusionError]: + """Percent-encode one URL component.""" + match value: + Some(text) => + mut encoded = "" + for byte in _utf8_bytes(text): + if _url_byte_needs_encoding(byte): + encoded += "%" + _hex_byte(byte).upper() + else: + encoded += _byte_to_ascii_text(byte) + return Ok(Some(encoded)) + None => return Ok(None) + + +def _url_decode(value: Option[str]) -> Result[Option[str], DataFusionError]: + """Strictly decode one percent-encoded URL component.""" + decoded = _try_url_decode(value)? + match value: + Some(_) => + match decoded: + Some(_) => return Ok(decoded) + None => return _datafusion_error("invalid percent-encoded URL component") + None => return Ok(None) + + +def _try_url_decode(value: Option[str]) -> Result[Option[str], DataFusionError]: + """Decode one percent-encoded URL component, returning null for malformed input.""" + match value: + Some(text) => + mut out: list[u8] = [] + mut idx = 0 + while idx < len(text): + ch = text[idx:idx + 1] + if ch == "%": + if idx + 2 >= len(text): + return Ok(None) + high = _hex_value(text[idx + 1:idx + 2]) + low = _hex_value(text[idx + 2:idx + 3]) + if high < 0 or low < 0: + return Ok(None) + _append_byte(out, high * 16 + low) + idx += 3 + else: + for byte in _utf8_bytes(ch): + _append_byte(out, byte) + idx += 1 + match RustString.from_utf8(out): + Ok(decoded) => return Ok(Some(decoded)) + Err(_) => return Ok(None) + None => return Ok(None) + + +def _parse_url(value: Option[str], part: Option[str]) -> Result[Option[str], DataFusionError]: + """Extract one URL part with the Rust URL parser through Incan interop.""" + match value: + Some(raw_url) => + match part: + Some(raw_part) => + match Url.parse(raw_url): + Ok(parsed) => + normalized = raw_part.lower() + match normalized: + "scheme" => return Ok(Some(f"{parsed.scheme()}")) + "host" => + match parsed.host_str(): + Some(host) => return Ok(Some(f"{host}")) + None => return Ok(None) + "path" => return Ok(Some(f"{parsed.path()}")) + "query" => + match parsed.query(): + Some(query) => return Ok(Some(f"{query}")) + None => return Ok(None) + "fragment" => + match parsed.fragment(): + Some(fragment) => return Ok(Some(f"{fragment}")) + None => return Ok(None) + "port" => + match parsed.port(): + Some(port) => return Ok(Some(str(port))) + None => return Ok(None) + "username" => return Ok(Some(f"{parsed.username()}")) + "password" => + match parsed.password(): + Some(password) => return Ok(Some(f"{password}")) + None => return Ok(None) + _ => return _datafusion_error(f"unknown parse_url part `{raw_part}`") + Err(err) => return _datafusion_error(err.to_string()) + None => return Ok(None) + None => return Ok(None) + + +def _parse_json(value: Option[str]) -> Result[Option[str], DataFusionError]: + """Validate and normalize one JSON payload string.""" + match value: + Some(text) => + parsed = _parse_json_value(text)? + return Ok(Some(_json_text(parsed)?)) + None => return Ok(None) + + +def _check_json(value: Option[str]) -> Result[Option[bool], DataFusionError]: + """Return whether one optional payload is valid JSON.""" + match value: + Some(text) => + match JsonValue.parse(text): + Ok(_) => return Ok(Some(true)) + Err(_) => return Ok(Some(false)) + None => return Ok(None) + + +def _schema_of_json(value: Option[str]) -> Result[Option[str], DataFusionError]: + """Infer a deterministic schema description for one JSON payload.""" + match value: + Some(text) => return Ok(Some(_json_schema(_parse_json_value(text)?))) + None => return Ok(None) + + +def _json_array_length(value: Option[str]) -> Result[Option[int], DataFusionError]: + """Return the length of a JSON array payload.""" + match value: + Some(text) => + parsed = _parse_json_value(text)? + match parsed.as_array(): + Some(values) => return Ok(Some(len(values))) + None => return Ok(None) + None => return Ok(None) + + +def _json_object_keys(value: Option[str]) -> Result[Option[str], DataFusionError]: + """Return JSON object keys as a JSON array string.""" + match value: + Some(text) => + parsed = _parse_json_value(text)? + mut keys: list[JsonValue] = [] + for key in parsed.keys(): + keys.append(JsonValue.string(key)) + return Ok(Some(_json_text(JsonValue.array(keys))?)) + None => return Ok(None) + + +def _json_path_value(value: Option[str], path: Option[str], text_mode: bool) -> Result[Option[str], DataFusionError]: + """Select one JSON path and return JSON text, with scalar strings unquoted in text mode.""" + match value: + Some(raw_json) => + match path: + Some(raw_path) => + parsed = _parse_json_value(raw_json)? + match _select_json_path(parsed, raw_path)?: + Some(selected) => + if text_mode: + match selected.as_str(): + Some(text) => return Ok(Some(text)) + None => pass + return Ok(Some(_json_text(selected)?)) + None => return Ok(None) + None => return Ok(None) + None => return Ok(None) + + +def _from_json(value: Option[str], _schema: Option[str]) -> Result[Option[str], DataFusionError]: + """Validate and normalize one JSON payload string.""" + return _parse_json(value) + + +def _try_from_json(value: Option[str], _schema: Option[str]) -> Result[Option[str], DataFusionError]: + """Validate JSON, returning null for malformed payloads.""" + match value: + Some(text) => + match _parse_json_value(text): + Ok(parsed) => return Ok(Some(_json_text(parsed)?)) + Err(_) => return Ok(None) + None => return Ok(None) + + +def _to_json(value: Option[str]) -> Result[Option[str], DataFusionError]: + """Serialize a scalar string as JSON text.""" + match value: + Some(text) => return Ok(Some(_json_text(JsonValue.string(text))?)) + None => return Ok(None) + + +def _schema_of_csv(value: Option[str]) -> Result[Option[str], DataFusionError]: + """Infer a deterministic schema description for one CSV row string.""" + match value: + Some(text) => + fields = _csv_fields(text)? + mut specs: list[str] = [] + for idx, item in enumerate(fields): + specs.append(f"_c{idx}: {_scalar_schema(item)}") + return Ok(Some(f"STRUCT<{", ".join(specs)}>")) + None => return Ok(None) + + +def _from_csv(value: Option[str], schema: Option[str]) -> Result[Option[str], DataFusionError]: + """Parse one CSV row string into a JSON array or schema-named object string.""" + match value: + Some(text) => + fields = _csv_fields(text)? + names = _csv_schema_field_names(schema.unwrap_or("")) + if len(names) == 0: + mut values: list[JsonValue] = [] + for field in fields: + values.append(JsonValue.string(field)) + return Ok(Some(_json_text(JsonValue.array(values))?)) + + mut entries: Dict[str, JsonValue] = {} + for idx, name in enumerate(names): + if idx < len(fields): + entries[name] = JsonValue.string(fields[idx]) + else: + entries[name] = JsonValue.string("") + return Ok(Some(_json_text(JsonValue.object(entries))?)) + None => return Ok(None) + + +def _to_csv(value: Option[str]) -> Result[Option[str], DataFusionError]: + """Serialize one scalar or JSON array/object payload as a CSV row.""" + match value: + Some(text) => + match JsonValue.parse(text): + Ok(parsed) => + if let Some(values) = parsed.as_array(): + return Ok(Some(_csv_line([_json_csv_value(item)? for item in values]))) + if let Some(entries) = parsed.as_object(): + mut fields: list[str] = [] + for key in entries.keys(): + fields.append(_json_csv_value(entries[key])?) + return Ok(Some(_csv_line(fields))) + return Ok(Some(_csv_line([_json_csv_value(parsed)?]))) + Err(_) => return Ok(Some(_csv_line([text]))) + None => return Ok(None) + + +def _parse_json_value(text: str) -> Result[JsonValue, DataFusionError]: + """Parse one JSON payload and convert std.json errors to DataFusion errors.""" + match JsonValue.parse(text): + Ok(value) => return Ok(value) + Err(err) => return _json_error(err) + + +def _json_error[T](err: JsonError) -> Result[T, DataFusionError]: + """Convert a std.json error to the DataFusion UDF error channel.""" + return _datafusion_error(err.message()) + + +def _json_text(value: JsonValue) -> Result[str, DataFusionError]: + """Serialize one JSON value and convert std.json errors to DataFusion errors.""" + match value.to_json(): + Ok(text) => return Ok(text) + Err(err) => return _json_error(err) + + +def _json_schema(value: JsonValue) -> str: + """Infer a compact deterministic schema descriptor for one dynamic JSON value.""" + match value.kind(): + JsonKind.Null => return "NULL" + JsonKind.Bool => return "BOOLEAN" + JsonKind.Int => return "BIGINT" + JsonKind.Float => return "DOUBLE" + JsonKind.String => return "STRING" + JsonKind.Array => + match value.as_array(): + Some(values) => + if len(values) == 0: + return "ARRAY" + return f"ARRAY<{_json_schema(values[0])}>" + None => return "ARRAY" + JsonKind.Object => + mut fields: list[str] = [] + for item in value.items(): + key, nested = item + fields.append(f"{key}: {_json_schema(nested)}") + return f"STRUCT<{", ".join(fields)}>" + + +def _select_json_path(value: JsonValue, path: str) -> Result[Option[JsonValue], DataFusionError]: + """Select a small JSONPath subset accepted by the public helper validators.""" + if path == "$" or path == "": + return Ok(Some(value)) + mut current = value + mut remaining = path + if remaining.startswith("$."): + remaining = remaining[2:] + elif remaining.startswith("$["): + remaining = remaining[1:] + elif remaining.startswith("$"): + remaining = remaining[1:] + for segment in remaining.split("."): + if segment == "": + continue + match _select_json_segment(current, segment)?: + Some(selected) => + current = selected + None => return Ok(None) + return Ok(Some(current)) + + +def _select_json_segment(value: JsonValue, segment: str) -> Result[Option[JsonValue], DataFusionError]: + """Select one object field plus optional array index suffixes.""" + mut current = value + mut remaining = segment + if not remaining.startswith("["): + field = _segment_field_name(remaining) + match current.get(field): + Some(selected) => + current = selected + None => return Ok(None) + remaining = remaining[len(field):] + while remaining.startswith("["): + close_idx = _find_closing_bracket(remaining)? + index_text = remaining[1:close_idx] + index = _parse_non_negative_int(index_text)? + match current.as_array(): + Some(values) => + if index >= len(values): + return Ok(None) + current = values[index] + None => return Ok(None) + remaining = remaining[close_idx + 1:] + if remaining != "": + return Ok(None) + return Ok(Some(current)) + + +def _segment_field_name(segment: str) -> str: + """Return the object-field prefix before any array index suffix.""" + mut idx = 0 + while idx < len(segment): + if segment[idx:idx + 1] == "[": + return segment[0:idx] + idx += 1 + return segment + + +def _find_closing_bracket(value: str) -> Result[int, DataFusionError]: + """Find the first closing bracket in one path segment.""" + for idx, ch in enumerate(value): + if ch == "]": + return Ok(idx) + return _datafusion_error(f"invalid JSON path segment `{value}`") + + +def _parse_non_negative_int(value: str) -> Result[int, DataFusionError]: + """Parse a non-negative integer used as a JSON array index.""" + if value == "": + return _datafusion_error("empty JSON array index") + mut out = 0 + for ch in value: + digit = _digit_value(ch) + if digit < 0: + return _datafusion_error(f"invalid JSON array index `{value}`") + out = out * 10 + digit + return Ok(out) + + +def _csv_fields(text: str) -> Result[list[str], DataFusionError]: + """Parse one CSV row with comma delimiter and double-quote escaping.""" + mut fields: list[str] = [] + mut field = "" + mut in_quotes = false + mut idx = 0 + while idx < len(text): + ch = text[idx:idx + 1] + if in_quotes: + if ch == "\"": + if idx + 1 < len(text) and text[idx + 1:idx + 2] == "\"": + field += "\"" + idx += 2 + else: + in_quotes = false + idx += 1 + else: + field += ch + idx += 1 + elif ch == ",": + fields.append(field) + field = str("") + idx += 1 + elif ch == "\"" and field == "": + in_quotes = true + idx += 1 + else: + field += ch + idx += 1 + if in_quotes: + return _datafusion_error("unterminated quoted CSV field") + fields.append(field) + return Ok(fields) + + +def _csv_schema_field_names(schema: str) -> list[str]: + """Extract field names from a compact `STRUCT` schema string.""" + mut trimmed = schema.strip() + if trimmed.startswith("STRUCT<") and trimmed.endswith(">"): + trimmed = trimmed[7:len(trimmed) - 1] + mut names: list[str] = [] + for part in trimmed.split(","): + name = part.strip().split(":")[0].strip() + if name != "": + names.append(name) + return names + + +def _scalar_schema(value: str) -> str: + """Infer a small scalar schema spelling from one CSV field.""" + if _is_integer_text(value): + return "BIGINT" + if _is_float_text(value): + return "DOUBLE" + lower = value.lower() + if lower == "true" or lower == "false": + return "BOOLEAN" + return "STRING" + + +def _is_integer_text(value: str) -> bool: + """Return whether text has an integer spelling.""" + if value == "": + return false + mut start = 0 + if value.startswith("-") or value.startswith("+"): + if len(value) == 1: + return false + start = 1 + for ch in value[start:]: + if _digit_value(ch) < 0: + return false + return true + + +def _is_float_text(value: str) -> bool: + """Return whether text has a simple decimal floating-point spelling.""" + if not value.contains("."): + return false + mut digit_count = 0 + mut start = 0 + if value.startswith("-") or value.startswith("+"): + start = 1 + for ch in value[start:]: + if ch == ".": + pass + elif _digit_value(ch) >= 0: + digit_count += 1 + else: + return false + return digit_count > 0 + + +def _json_csv_value(value: JsonValue) -> Result[str, DataFusionError]: + """Convert one JSON value to its scalar CSV field text.""" + if value.kind() == JsonKind.Null: + return Ok("") + match value.as_str(): + Some(text) => return Ok(text) + None => return Ok(_json_text(value)?) + + +def _csv_line(fields: list[str]) -> str: + """Serialize one CSV row without a trailing newline.""" + return ",".join([_escape_csv_field(field) for field in fields]) + + +def _escape_csv_field(value: str) -> str: + """Escape one CSV field when delimiters, quotes, or newlines require quoting.""" + if not value.contains(",") and not value.contains("\"") and not value.contains("\n") and not value.contains("\r"): + return value + return "\"" + value.replace("\"", "\"\"") + "\"" + + +def _require_arity(args: list[ColumnarValue], expected: int) -> Result[None, DataFusionError]: + """Validate a DataFusion UDF arity.""" + if len(args) == expected: + return Ok(None) + return _datafusion_error(f"format UDF expected {expected} arguments, got {len(args)}") + + +def _all_scalar(args: list[ColumnarValue]) -> bool: + """Return whether all DataFusion arguments are scalar values.""" + for arg in args: + match arg: + ColumnarValue.Array(_) => return false + ColumnarValue.Scalar(_) => pass + return true + + +def _row_count(args: list[ColumnarValue]) -> Result[int, DataFusionError]: + """Return the first array row count, or one for scalar-only arguments.""" + for arg in args: + match arg: + ColumnarValue.Array(array) => return _rust_usize_to_int(array.len()) + ColumnarValue.Scalar(_) => pass + return Ok(1) + + +def _string_at(arg: ColumnarValue, idx: int) -> Result[Option[str], DataFusionError]: + """Read one optional string value from a scalar or array DataFusion value.""" + array_result: Result[ArrayRef, DataFusionError] = arg.to_array(_int_to_rust_usize(idx + 1)?) + array = array_result? + scalar_result: Result[ScalarValue, DataFusionError] = ScalarValue.try_from_array( + array.as_ref(), + _int_to_rust_usize(idx)?, + ) + scalar = scalar_result? + return _scalar_value_string(scalar) + + +def _scalar_value_string(value: ScalarValue) -> Result[Option[str], DataFusionError]: + """Convert one DataFusion scalar to optional text for format-function execution.""" + match value: + ScalarValue.Utf8(text) => return Ok(text) + ScalarValue.LargeUtf8(text) => return Ok(text) + ScalarValue.Utf8View(text) => return Ok(text) + ScalarValue.Null => return Ok(None) + _ => return Ok(Some(value.to_string())) + + +def _utf8_bytes(value: str) -> bytes: + """Encode one Incan string as UTF-8 bytes.""" + return RustString.from(value).into_bytes() + + +def _sha1_digest(data: bytes) -> bytes: + """Return SHA-1 digest bytes.""" + mut hasher = Sha1.default() + Digest.update(hasher, data.as_slice()) + return hasher.finalize_reset().to_vec() + + +def _xxhash64(data: bytes) -> u64: + """Return xxHash64 with the same zero-seed contract as InQL's public helper.""" + mut hasher = Xxh64.default() + hasher.update(data.as_slice()) + return hasher.digest() + + +def _u32_to_be_bytes(value: u32) -> bytes: + """Encode a 32-bit integer as big-endian bytes.""" + number = int(value) + mut out: list[u8] = [] + _append_byte(out, (number // 16777216) % 256) + _append_byte(out, (number // 65536) % 256) + _append_byte(out, (number // 256) % 256) + _append_byte(out, number % 256) + return out + + +def _u64_to_be_bytes(value: u64) -> bytes: + """Encode a 64-bit integer as big-endian bytes.""" + mut out: list[u8] = [] + _append_byte(out, int((value // 72057594037927936) % 256)) + _append_byte(out, int((value // 281474976710656) % 256)) + _append_byte(out, int((value // 1099511627776) % 256)) + _append_byte(out, int((value // 4294967296) % 256)) + _append_byte(out, int((value // 16777216) % 256)) + _append_byte(out, int((value // 65536) % 256)) + _append_byte(out, int((value // 256) % 256)) + _append_byte(out, int(value % 256)) + return out + + +def _append_byte(mut out: list[u8], value: int) -> None: + """Append one wrapped byte to a byte list.""" + byte: u8 = value.wrapping_resize() + out.append(byte) + + +def _hex_byte(value: int) -> str: + """Return a two-character lowercase hex byte.""" + as_int = value + high = as_int // 16 + low = as_int - (high * 16) + return _hex_digit(high) + _hex_digit(low) + + +def _hex_digit(value: int) -> str: + """Return one lowercase hex digit.""" + digits = "0123456789abcdef" + return digits[value:value + 1] + + +def _url_byte_needs_encoding(value: int) -> bool: + """Return whether a byte belongs to the RFC 022 URL component encode set.""" + as_int = value + if as_int < 32 or as_int > 126: + return true + ch = _byte_to_ascii_text(value) + return ch == " " or ch == "\"" or ch == "#" or ch == "%" or ch == "<" or ch == ">" or ch == "?" or ch == "`" or ch == "{" or ch == "}" + + +def _byte_to_ascii_text(value: int) -> str: + """Convert a printable ASCII byte to one-character text.""" + as_int = value + if as_int < 32 or as_int > 126: + return "" + printable = " !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~" + return printable[as_int - 32:as_int - 31] + + +def _hex_value(ch: str) -> int: + """Return a hex digit value, or -1 for non-hex input.""" + if ch >= "0" and ch <= "9": + return int(ch) + lower = ch.lower() + if lower == "a": + return 10 + if lower == "b": + return 11 + if lower == "c": + return 12 + if lower == "d": + return 13 + if lower == "e": + return 14 + if lower == "f": + return 15 + return -1 + + +def _digit_value(ch: str) -> int: + """Return a decimal digit value, or -1 for non-digits.""" + if ch >= "0" and ch <= "9": + return int(ch) + return -1 + + +def _rust_usize_to_int(value: RustUsize) -> Result[int, DataFusionError]: + """Convert Rust usize to Incan int through a checked i64 boundary.""" + match RustI64.try_from(value): + Ok(converted) => return Ok(converted.into()) + Err(_) => return _datafusion_error("row count does not fit Incan int") + + +def _int_to_rust_usize(value: int) -> Result[RustUsize, DataFusionError]: + """Convert an Incan int to Rust usize through a checked boundary.""" + match RustUsize.try_from(value): + Ok(converted) => return Ok(converted) + Err(_) => return _datafusion_error(f"index {value} does not fit Rust usize") + + +def _int_to_rust_i64(value: int) -> Result[RustI64, DataFusionError]: + """Convert an Incan int to Rust i64 through a checked boundary.""" + match RustI64.try_from(value): + Ok(converted) => return Ok(converted) + Err(_) => return _datafusion_error(f"value {value} does not fit Rust i64") + + +def _datafusion_error[T](message: str) -> Result[T, DataFusionError]: + """Build one DataFusion execution error.""" + return Err(DataFusionError.Execution(f"{message}")) diff --git a/tests/test_session_projection.incn b/tests/test_session_projection.incn index d0e0d8a..d22957f 100644 --- a/tests/test_session_projection.incn +++ b/tests/test_session_projection.incn @@ -18,40 +18,40 @@ from functions import ( desc, div, floor, + from_csv, + from_json, + get_json_object, gt, + json_array_length, + json_extract_path_text, + json_object_keys, lit, md5, modulo, mul, neg, nullif, + parse_json, + parse_url, round, + schema_of_csv, + schema_of_json, sha1, sha2, sha224, sha384, sha512, sub, - xxhash64, - try_cast, - cardinality, - element_at, - from_csv, - from_json, - get_json_object, - json_array_length, - json_extract_path_text, - json_object_keys, - parse_json, - parse_url, - schema_of_csv, - schema_of_json, to_csv, to_json, try_from_json, try_url_decode, + try_cast, url_decode, url_encode, + xxhash64, + cardinality, + element_at, ) from dataset import DataFrame, LazyFrame from session import Session, SessionErrorKind @@ -274,7 +274,7 @@ def test_session_projection__collect_executes_format_hashing_projection_function def test_session_projection__collect_executes_json_url_and_csv_format_functions() -> None: - """collect should execute RFC 022 scalar format helpers through adapter-owned DataFusion UDFs.""" + """collect should execute RFC 022 scalar format helpers through Incan-authored DataFusion callbacks.""" # -- Arrange -- mut session = Session.default() From 22a089efece8948a31a37395b0108aff4e2220e9 Mon Sep 17 00:00:00 2001 From: Danny Meijer Date: Thu, 28 May 2026 22:16:52 +0200 Subject: [PATCH 08/13] fix - 39 address format function review --- .github/workflows/ci.yml | 2 +- docs/language/reference/functions/format.md | 13 +- .../022_semi_structured_format_functions.md | 8 +- .../026_semi_structured_variant_values.md | 2 +- incan.lock | 43 +++-- incan.toml | 3 - src/functions/csv/from_csv.incn | 19 ++- src/functions/csv/schema_of_csv.incn | 15 ++ src/functions/csv/to_csv.incn | 15 ++ src/functions/hashing/crc32.incn | 15 ++ src/functions/hashing/sha1.incn | 15 ++ src/functions/hashing/xxhash64.incn | 15 ++ src/functions/json/check_json.incn | 15 ++ src/functions/json/common.incn | 14 ++ src/functions/json/from_json.incn | 15 ++ src/functions/json/get_json_object.incn | 15 ++ src/functions/json/json_array_length.incn | 15 ++ .../json/json_extract_path_text.incn | 15 ++ src/functions/json/json_object_keys.incn | 15 ++ src/functions/json/parse_json.incn | 15 ++ src/functions/json/schema_of_json.incn | 15 ++ src/functions/json/to_json.incn | 15 ++ src/functions/json/try_from_json.incn | 15 ++ src/functions/urls/parse_url.incn | 31 ++-- src/functions/urls/try_url_decode.incn | 15 ++ src/functions/urls/url_decode.incn | 15 ++ src/functions/urls/url_encode.incn | 15 ++ src/session/datafusion_format_functions.incn | 147 +++++++++--------- tests/test_format_functions.incn | 8 +- tests/test_function_registry.incn | 4 +- tests/test_session_projection.incn | 11 +- tests/test_substrait_plan.incn | 2 +- 32 files changed, 459 insertions(+), 118 deletions(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index c9007fd..0c92e87 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -20,7 +20,7 @@ concurrency: env: CARGO_TERM_COLOR: always INCAN_REF: release/v0.3 - EXPECTED_INCAN_VERSION: 0.3.0-rc23 + EXPECTED_INCAN_VERSION: 0.3.0-rc24 RUST_BACKTRACE: 1 INCAN_NO_BANNER: 1 diff --git a/docs/language/reference/functions/format.md b/docs/language/reference/functions/format.md index ae282bc..ed3c859 100644 --- a/docs/language/reference/functions/format.md +++ b/docs/language/reference/functions/format.md @@ -16,7 +16,7 @@ The format catalog includes deterministic hashes, URL helpers, JSON helpers, and | `sha2(expr, bit_length)` | Compatibility helper that rewrites to `sha224`, `sha256`, `sha384`, or `sha512` for supported literal bit lengths. | | `crc32(expr)` | Return the lowercase eight-character hexadecimal CRC-32 digest for a string expression. | | `xxhash64(expr)` | Return the lowercase sixteen-character hexadecimal xxHash64 digest for a string expression. | -| `parse_url(expr, part)` | Extract `scheme`, `host`, `path`, `query`, `fragment`, `port`, `username`, or `password` from a URL string. | +| `parse_url(expr, key)` | Extract the first query parameter value for `key` from a URL string, returning null when the key is absent. | | `url_encode(expr)` | Percent-encode a URL component string. | | `url_decode(expr)` | Decode a percent-encoded URL component string and fail on malformed escapes. | | `try_url_decode(expr)` | Decode a percent-encoded URL component string, returning null on malformed escapes. | @@ -31,18 +31,19 @@ The format catalog includes deterministic hashes, URL helpers, JSON helpers, and | `try_from_json(expr, schema)` | Validate JSON with an explicit schema description and return null when the payload is invalid. | | `to_json(expr)` | Serialize a scalar expression as JSON text. | | `schema_of_csv(expr)` | Infer a deterministic schema description from a CSV row string. | -| `from_csv(expr, schema)` | Parse a CSV row string into a JSON payload string, using schema field names when provided. | +| `from_csv(expr, schema)` | Parse a CSV row string into a logical map keyed by schema field names, or `_cN` positional keys when no schema is provided. | | `to_csv(expr)` | Serialize a scalar or JSON array/object payload as a CSV row string. | ```incan -from pub::inql.functions import col, from_json, get_json_object, parse_url, sha2, to_json +from pub::inql.functions import col, from_csv, from_json, get_json_object, parse_url, sha2, to_json projected = ( events .with_column("user_hash", sha2(col("user_id"), 256)) - .with_column("host", parse_url(col("landing_page"), "host")) + .with_column("campaign", parse_url(col("landing_page"), "utm_campaign")) .with_column("event_type", get_json_object(col("payload"), "$.type")) .with_column("payload_obj", from_json(col("payload"), "STRUCT")) + .with_column("row_fields", from_csv(col("csv_line"), "STRUCT")) .with_column("payload_out", to_json(col("event_type"))) ) ``` @@ -50,8 +51,8 @@ projected = ( Hash helpers operate on UTF-8 string bytes and return lowercase hexadecimal strings. `sha2(...)` accepts `224`, `256`, `384`, and `512`; other digest lengths are rejected during expression construction. -JSON and CSV helpers validate, normalize, and project payload text. They do not read external files or introduce a -dynamic variant value type. +JSON helpers validate, normalize, and project payload text. CSV parsing returns logical map values instead of JSON text. +These helpers do not read external files or introduce a dynamic variant value type. The DataFusion adapter executes the full RFC 022 catalog with native DataFusion functions where available and Incan-authored adapter callbacks for helpers that DataFusion does not expose natively. diff --git a/docs/rfcs/022_semi_structured_format_functions.md b/docs/rfcs/022_semi_structured_format_functions.md index bcf4bea..ae4303c 100644 --- a/docs/rfcs/022_semi_structured_format_functions.md +++ b/docs/rfcs/022_semi_structured_format_functions.md @@ -65,9 +65,9 @@ Reading a JSON file into a dataset remains a session/source operation, not a sca InQL should define JSON functions including `from_json`, `to_json`, `get_json_object`, `json_array_length`, `json_object_keys`, `schema_of_json`, `parse_json`, `check_json`, `json_extract_path_text`, and `try_from_json` where recoverable parse behavior is desired. -InQL should define CSV value functions including `from_csv`, `to_csv`, and `schema_of_csv` only insofar as they operate on scalar values or schema descriptions. These functions must not replace the session CSV read/write contract. +InQL should define CSV value functions including `from_csv`, `to_csv`, and `schema_of_csv` only insofar as they operate on scalar values or schema descriptions. `from_csv` returns a logical map keyed by schema field names, or `_cN` positional keys when no schema is supplied. These functions must not replace the session CSV read/write contract. -InQL should define URL functions including `parse_url`, `url_encode`, `url_decode`, and `try_url_decode`, with exact invalid-input behavior recorded in the registry. +InQL should define URL functions including `parse_url`, `url_encode`, `url_decode`, and `try_url_decode`, with exact invalid-input behavior recorded in the registry. `parse_url` extracts query parameter values by key. InQL should define hash functions including `crc32`, `md5`, `sha1`, `sha2`, and `xxhash64`, with input encoding and output representation specified. @@ -122,9 +122,9 @@ This RFC is additive. It should not change existing CSV ingestion behavior. - Hash helpers operate on UTF-8 string bytes and return lowercase hexadecimal strings. - Portable concrete hash helpers are `md5`, `sha1`, `sha224`, `sha256`, `sha384`, `sha512`, `crc32`, and `xxhash64`, each with an honest Substrait extension mapping. The DataFusion adapter validates materialized execution for the full helper set, using native DataFusion functions where available and Incan-authored adapter callbacks where DataFusion has no built-in implementation. - `sha2(expr, bit_length)` is a compatibility helper, not a separate backend mapping. It rewrites to `sha224`, `sha256`, `sha384`, or `sha512` for supported literal bit lengths and rejects unsupported values. -- URL helpers accept scalar URL or component strings. `url_decode(...)` is strict and fails malformed percent escapes; `try_url_decode(...)` returns null for malformed percent escapes. +- URL helpers accept scalar URL or component strings. `parse_url(...)` returns the first query parameter value for a literal key. `url_decode(...)` is strict and fails malformed percent escapes; `try_url_decode(...)` returns null for malformed percent escapes. - JSON helpers accept scalar JSON payload strings. Strict helpers fail invalid JSON; `try_from_json(...)` returns null for invalid JSON; schema and path helpers are deterministic over the provided payload and literal schema/path arguments. -- CSV helpers accept scalar row strings and explicit schema-description strings. They operate on payload values only and do not replace the session CSV read/write contract. +- CSV helpers accept scalar row strings and explicit schema-description strings. Parsed CSV rows are logical maps rather than JSON text. They operate on payload values only and do not replace the session CSV read/write contract. - Semi-structured variant predicates such as `typeof`, `is_array`, `is_object`, `is_integer`, `is_timestamp`, and `is_null_value` belong to InQL RFC 026. RFC 022 does not accept them as JSON-text parser helpers. ## Implementation Plan diff --git a/docs/rfcs/026_semi_structured_variant_values.md b/docs/rfcs/026_semi_structured_variant_values.md index 65ff770..2bf3090 100644 --- a/docs/rfcs/026_semi_structured_variant_values.md +++ b/docs/rfcs/026_semi_structured_variant_values.md @@ -10,7 +10,7 @@ - InQL RFC 022 (semi-structured and format functions) - InQL RFC 024 (function extension policy) - **Issue:** [InQL #52](https://github.com/dannys-code-corner/InQL/issues/52) -- **RFC PR:** [InQL #49](https://github.com/dannys-code-corner/InQL/pull/49) +- **RFC PR:** — - **Written against:** Incan v0.3-era InQL - **Shipped in:** — diff --git a/incan.lock b/incan.lock index 97b484b..dbda203 100644 --- a/incan.lock +++ b/incan.lock @@ -3,8 +3,8 @@ [incan] format = 1 -incan-version = "0.3.0-rc23" -deps-fingerprint = "sha256:825926ed0a97c47f458cb9e9ed6c142473cb37e7f047d3de1a32263f1cef5097" +incan-version = "0.3.0-rc24" +deps-fingerprint = "sha256:c857db33615865c432fa4c9b3947975ccb4255249143ecc1676c26e85981aafb" cargo-features = [] cargo-no-default-features = false cargo-all-features = false @@ -1824,14 +1824,14 @@ dependencies = [ [[package]] name = "incan_core" -version = "0.3.0-rc23" +version = "0.3.0-rc24" dependencies = [ "serde", ] [[package]] name = "incan_derive" -version = "0.3.0-rc23" +version = "0.3.0-rc24" dependencies = [ "proc-macro2", "quote", @@ -1840,7 +1840,7 @@ dependencies = [ [[package]] name = "incan_stdlib" -version = "0.3.0-rc23" +version = "0.3.0-rc24" dependencies = [ "incan_core", "incan_derive", @@ -1864,8 +1864,10 @@ dependencies = [ [[package]] name = "inql" -version = "0.3.0-rc23" +version = "0.3.0-rc24" dependencies = [ + "blake2", + "blake3", "byteorder", "crc32fast", "datafusion", @@ -1875,12 +1877,14 @@ dependencies = [ "encoding_rs", "incan_derive", "incan_stdlib", + "md-5", "prost", "prost-types", "rustix", "serde", "sha1", "sha2", + "sha3", "substrait 0.63.0", "url", "xxhash-rust", @@ -1929,6 +1933,15 @@ dependencies = [ "wasm-bindgen", ] +[[package]] +name = "keccak" +version = "0.1.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "cb26cec98cce3a3d96cbb7bced3c4b16e3d13f27ec56dbd62cbc8f39cfb9d653" +dependencies = [ + "cpufeatures 0.2.17", +] + [[package]] name = "leb128fmt" version = "0.1.0" @@ -2094,9 +2107,9 @@ dependencies = [ [[package]] name = "mio" -version = "1.2.0" +version = "1.2.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "50b7e5b27aa02a74bac8c3f23f448f8d87ff11f92d3aac1a6ed369ee08cc56c1" +checksum = "02bd0af71c67b473010cbbc60715ee815645a4dc942899111f494b4b737d6fda" dependencies = [ "libc", "wasi", @@ -2763,6 +2776,16 @@ dependencies = [ "digest", ] +[[package]] +name = "sha3" +version = "0.10.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "77fd7028345d415a4034cf8777cd4f8ab1851274233b45f84e3d955502d93874" +dependencies = [ + "digest", + "keccak", +] + [[package]] name = "shlex" version = "1.3.0" @@ -2807,9 +2830,9 @@ checksum = "1b6b67fb9a61334225b5b790716f609cd58395f895b3fe8b328786812a40bc3b" [[package]] name = "socket2" -version = "0.6.3" +version = "0.6.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3a766e1110788c36f4fa1c2b71b387a7815aa65f88ce0229841826633d93723e" +checksum = "52d1cfed4120b4d927bf7c0f86d2087a4a7d6027c906d9f9d525a80573b9be51" dependencies = [ "libc", "windows-sys", diff --git a/incan.toml b/incan.toml index cada06b..d0a33f6 100644 --- a/incan.toml +++ b/incan.toml @@ -14,7 +14,4 @@ datafusion_common = { package = "datafusion-common", version = "53" } datafusion_expr = { package = "datafusion-expr", version = "53" } datafusion-substrait = { version = "53", features = ["protoc"] } crc32fast = "1.5" -sha1 = "0.10" -sha2 = "0.10.9" url = "2.5" -xxhash_rust = { package = "xxhash-rust", version = "0.8", features = ["xxh64"] } diff --git a/src/functions/csv/from_csv.incn b/src/functions/csv/from_csv.incn index 0aef426..17c493c 100644 --- a/src/functions/csv/from_csv.incn +++ b/src/functions/csv/from_csv.incn @@ -23,13 +23,28 @@ from substrait.function_extensions import FROM_CSV_FUNCTION_ANCHOR )) pub def from_csv(expr: ColumnExpr, schema: str) -> ColumnExpr: """ - Parse one CSV row string into a JSON payload using an optional schema description for field names. + Parse one CSV row string into a logical map keyed by schema field names. Examples: row = from_csv(col("line"), "STRUCT") Parameters: expr: String expression containing one CSV row. - schema: Explicit schema description used for output field names. + schema: Explicit schema description used for output keys, or an empty string for `_cN` positional keys. """ return registered_application("from_csv", [expr, str_expr(schema)]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_from_csv_builds_registered_application() -> None: + expr = from_csv(col("line"), "STRUCT") + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "from_csv" + assert column_expr_argument_count(expr) == 2 diff --git a/src/functions/csv/schema_of_csv.incn b/src/functions/csv/schema_of_csv.incn index fbc53e5..97785fc 100644 --- a/src/functions/csv/schema_of_csv.incn +++ b/src/functions/csv/schema_of_csv.incn @@ -32,3 +32,18 @@ pub def schema_of_csv(expr: ColumnExpr) -> ColumnExpr: expr: String expression containing one CSV row. """ return registered_application("schema_of_csv", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_schema_of_csv_builds_registered_application() -> None: + expr = schema_of_csv(col("line")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "schema_of_csv" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/csv/to_csv.incn b/src/functions/csv/to_csv.incn index 77924ad..b32bab4 100644 --- a/src/functions/csv/to_csv.incn +++ b/src/functions/csv/to_csv.incn @@ -30,3 +30,18 @@ pub def to_csv(expr: ColumnExpr) -> ColumnExpr: expr: String-backed scalar or JSON array/object payload to serialize as one CSV row. """ return registered_application("to_csv", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_to_csv_builds_registered_application() -> None: + expr = to_csv(col("payload")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "to_csv" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/hashing/crc32.incn b/src/functions/hashing/crc32.incn index f2d0d97..ce9c48a 100644 --- a/src/functions/hashing/crc32.incn +++ b/src/functions/hashing/crc32.incn @@ -30,3 +30,18 @@ pub def crc32(expr: ColumnExpr) -> ColumnExpr: expr: String expression whose UTF-8 bytes should be hashed. """ return registered_application("crc32", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_crc32_builds_registered_application() -> None: + expr = crc32(col("payload")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "crc32" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/hashing/sha1.incn b/src/functions/hashing/sha1.incn index ef82a2e..2715cb3 100644 --- a/src/functions/hashing/sha1.incn +++ b/src/functions/hashing/sha1.incn @@ -30,3 +30,18 @@ pub def sha1(expr: ColumnExpr) -> ColumnExpr: expr: String expression whose UTF-8 bytes should be hashed. """ return registered_application("sha1", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_sha1_builds_registered_application() -> None: + expr = sha1(col("payload")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "sha1" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/hashing/xxhash64.incn b/src/functions/hashing/xxhash64.incn index 4422427..c595728 100644 --- a/src/functions/hashing/xxhash64.incn +++ b/src/functions/hashing/xxhash64.incn @@ -30,3 +30,18 @@ pub def xxhash64(expr: ColumnExpr) -> ColumnExpr: expr: String expression whose UTF-8 bytes should be hashed. """ return registered_application("xxhash64", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_xxhash64_builds_registered_application() -> None: + expr = xxhash64(col("payload")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "xxhash64" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/json/check_json.incn b/src/functions/json/check_json.incn index da1ade3..fdec6d7 100644 --- a/src/functions/json/check_json.incn +++ b/src/functions/json/check_json.incn @@ -30,3 +30,18 @@ pub def check_json(expr: ColumnExpr) -> ColumnExpr: expr: String expression containing a JSON payload. """ return registered_application("check_json", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_check_json_builds_registered_application() -> None: + expr = check_json(col("payload")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "check_json" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/json/common.incn b/src/functions/json/common.incn index 089a2c8..91a7f7f 100644 --- a/src/functions/json/common.incn +++ b/src/functions/json/common.incn @@ -11,3 +11,17 @@ pub def json_path_expr(path: str) -> ColumnExpr: if len(path) >= 2 and path[0] == "$" and (path[1] == "." or path[1] == "["): return str_expr(path) return raise_value_error("JSON path must be `$`, start with `$.`, or start with `$[`") + + +module tests: + from std.testing import assert_raises + from projection_builders import ColumnExprKind, column_expr_kind + def _invalid_json_path_expr() -> None: + json_path_expr("type") + return + def test_json_path_expr_accepts_json_path_prefixes() -> None: + assert column_expr_kind(json_path_expr("$")) == ColumnExprKind.StringLiteral + assert column_expr_kind(json_path_expr("$.type")) == ColumnExprKind.StringLiteral + assert column_expr_kind(json_path_expr("$[0]")) == ColumnExprKind.StringLiteral + def test_json_path_expr_rejects_relative_paths() -> None: + assert_raises[ValueError](_invalid_json_path_expr) diff --git a/src/functions/json/from_json.incn b/src/functions/json/from_json.incn index 1edc210..37984e9 100644 --- a/src/functions/json/from_json.incn +++ b/src/functions/json/from_json.incn @@ -33,3 +33,18 @@ pub def from_json(expr: ColumnExpr, schema: str) -> ColumnExpr: schema: Explicit schema description for the expected JSON payload. """ return registered_application("from_json", [expr, str_expr(schema)]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_from_json_builds_registered_application() -> None: + expr = from_json(col("payload"), "STRUCT") + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "from_json" + assert column_expr_argument_count(expr) == 2 diff --git a/src/functions/json/get_json_object.incn b/src/functions/json/get_json_object.incn index 29c9bbb..b52f57d 100644 --- a/src/functions/json/get_json_object.incn +++ b/src/functions/json/get_json_object.incn @@ -34,3 +34,18 @@ pub def get_json_object(expr: ColumnExpr, path: str) -> ColumnExpr: path: Literal JSON path beginning with `$`, `$.`, or `$[`. """ return registered_application("get_json_object", [expr, json_path_expr(path)]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_get_json_object_builds_registered_application() -> None: + expr = get_json_object(col("payload"), "$.type") + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "get_json_object" + assert column_expr_argument_count(expr) == 2 diff --git a/src/functions/json/json_array_length.incn b/src/functions/json/json_array_length.incn index c5c5e76..5a1976c 100644 --- a/src/functions/json/json_array_length.incn +++ b/src/functions/json/json_array_length.incn @@ -32,3 +32,18 @@ pub def json_array_length(expr: ColumnExpr) -> ColumnExpr: expr: String expression containing a JSON array payload. """ return registered_application("json_array_length", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_json_array_length_builds_registered_application() -> None: + expr = json_array_length(col("tags_json")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "json_array_length" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/json/json_extract_path_text.incn b/src/functions/json/json_extract_path_text.incn index 33e5363..553b969 100644 --- a/src/functions/json/json_extract_path_text.incn +++ b/src/functions/json/json_extract_path_text.incn @@ -34,3 +34,18 @@ pub def json_extract_path_text(expr: ColumnExpr, path: str) -> ColumnExpr: path: Literal JSON path beginning with `$`, `$.`, or `$[`. """ return registered_application("json_extract_path_text", [expr, json_path_expr(path)]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_json_extract_path_text_builds_registered_application() -> None: + expr = json_extract_path_text(col("payload"), "$.type") + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "json_extract_path_text" + assert column_expr_argument_count(expr) == 2 diff --git a/src/functions/json/json_object_keys.incn b/src/functions/json/json_object_keys.incn index fb39d29..8cc022d 100644 --- a/src/functions/json/json_object_keys.incn +++ b/src/functions/json/json_object_keys.incn @@ -32,3 +32,18 @@ pub def json_object_keys(expr: ColumnExpr) -> ColumnExpr: expr: String expression containing a JSON object payload. """ return registered_application("json_object_keys", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_json_object_keys_builds_registered_application() -> None: + expr = json_object_keys(col("payload")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "json_object_keys" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/json/parse_json.incn b/src/functions/json/parse_json.incn index afb757b..a94b4bb 100644 --- a/src/functions/json/parse_json.incn +++ b/src/functions/json/parse_json.incn @@ -32,3 +32,18 @@ pub def parse_json(expr: ColumnExpr) -> ColumnExpr: expr: String expression containing a JSON payload. """ return registered_application("parse_json", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_parse_json_builds_registered_application() -> None: + expr = parse_json(col("payload")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "parse_json" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/json/schema_of_json.incn b/src/functions/json/schema_of_json.incn index 809f292..36b2872 100644 --- a/src/functions/json/schema_of_json.incn +++ b/src/functions/json/schema_of_json.incn @@ -32,3 +32,18 @@ pub def schema_of_json(expr: ColumnExpr) -> ColumnExpr: expr: String expression containing a JSON payload. """ return registered_application("schema_of_json", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_schema_of_json_builds_registered_application() -> None: + expr = schema_of_json(col("payload")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "schema_of_json" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/json/to_json.incn b/src/functions/json/to_json.incn index 345e065..70aa566 100644 --- a/src/functions/json/to_json.incn +++ b/src/functions/json/to_json.incn @@ -30,3 +30,18 @@ pub def to_json(expr: ColumnExpr) -> ColumnExpr: expr: String-backed scalar expression to serialize as JSON text. """ return registered_application("to_json", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_to_json_builds_registered_application() -> None: + expr = to_json(col("event_type")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "to_json" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/json/try_from_json.incn b/src/functions/json/try_from_json.incn index 2d9ecc5..6cc3c83 100644 --- a/src/functions/json/try_from_json.incn +++ b/src/functions/json/try_from_json.incn @@ -31,3 +31,18 @@ pub def try_from_json(expr: ColumnExpr, schema: str) -> ColumnExpr: schema: Explicit schema description for the expected JSON payload. """ return registered_application("try_from_json", [expr, str_expr(schema)]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_try_from_json_builds_registered_application() -> None: + expr = try_from_json(col("payload"), "STRUCT") + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "try_from_json" + assert column_expr_argument_count(expr) == 2 diff --git a/src/functions/urls/parse_url.incn b/src/functions/urls/parse_url.incn index 147708b..0fc7d76 100644 --- a/src/functions/urls/parse_url.incn +++ b/src/functions/urls/parse_url.incn @@ -1,4 +1,4 @@ -"""URL component parsing helper.""" +"""URL query parameter parsing helper.""" from function_registry import ( FunctionClass, @@ -22,19 +22,32 @@ from substrait.function_extensions import PARSE_URL_FUNCTION_ANCHOR error_behavior=FunctionErrorBehavior.InvalidInputDiagnostic, substrait=extension_mapping("parse_url", PARSE_URL_FUNCTION_ANCHOR), )) -pub def parse_url(expr: ColumnExpr, part: str) -> ColumnExpr: +pub def parse_url(expr: ColumnExpr, key: str) -> ColumnExpr: """ - Extract one URL component such as `scheme`, `host`, `path`, `query`, `fragment`, `port`, `username`, or `password`. + Extract the first query parameter value for a key. Examples: - host = parse_url(col("landing_page"), "host") + page = parse_url(col("landing_page"), "page") Parameters: expr: String expression containing one absolute URL. - part: URL component to extract. + key: Query parameter key to extract. """ - if part == "scheme" or part == "host" or part == "path" or part == "query" or part == "fragment" or part == "port" or part == "username" or part == "password": - return registered_application("parse_url", [expr, str_expr(part)]) - return raise_value_error( - "parse_url part must be one of scheme, host, path, query, fragment, port, username, or password", + if key == "": + return raise_value_error("parse_url key must not be empty") + return registered_application("parse_url", [expr, str_expr(key)]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, ) + def test_parse_url_builds_registered_application_for_query_parameter() -> None: + expr = parse_url(col("landing_page"), "page") + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "parse_url" + assert column_expr_argument_count(expr) == 2 diff --git a/src/functions/urls/try_url_decode.incn b/src/functions/urls/try_url_decode.incn index 21c2748..658492a 100644 --- a/src/functions/urls/try_url_decode.incn +++ b/src/functions/urls/try_url_decode.incn @@ -30,3 +30,18 @@ pub def try_url_decode(expr: ColumnExpr) -> ColumnExpr: expr: String expression containing one percent-encoded URL component. """ return registered_application("try_url_decode", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_try_url_decode_builds_registered_application() -> None: + expr = try_url_decode(col("encoded_query")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "try_url_decode" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/urls/url_decode.incn b/src/functions/urls/url_decode.incn index e50fd39..5c2b8b0 100644 --- a/src/functions/urls/url_decode.incn +++ b/src/functions/urls/url_decode.incn @@ -32,3 +32,18 @@ pub def url_decode(expr: ColumnExpr) -> ColumnExpr: expr: String expression containing one percent-encoded URL component. """ return registered_application("url_decode", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_url_decode_builds_registered_application() -> None: + expr = url_decode(col("encoded_query")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "url_decode" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/urls/url_encode.incn b/src/functions/urls/url_encode.incn index f6b4371..7c8d6a0 100644 --- a/src/functions/urls/url_encode.incn +++ b/src/functions/urls/url_encode.incn @@ -30,3 +30,18 @@ pub def url_encode(expr: ColumnExpr) -> ColumnExpr: expr: String expression to percent-encode as a URL component. """ return registered_application("url_encode", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_url_encode_builds_registered_application() -> None: + expr = url_encode(col("query_text")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "url_encode" + assert column_expr_argument_count(expr) == 1 diff --git a/src/session/datafusion_format_functions.incn b/src/session/datafusion_format_functions.incn index 57f7a97..6c0df44 100644 --- a/src/session/datafusion_format_functions.incn +++ b/src/session/datafusion_format_functions.incn @@ -1,19 +1,19 @@ """DataFusion execution bridge for InQL-owned scalar format functions.""" from rust::crc32fast import Hasher as Crc32Hasher -from rust::datafusion::arrow::array import ArrayRef, BooleanArray, Int64Array, StringArray +from rust::datafusion::arrow::array import Array, ArrayRef, BooleanArray, Int64Array, MapArray, StringArray +from rust::datafusion::arrow::array::builder import MapBuilder, StringBuilder from rust::datafusion::arrow::datatypes import DataType +from rust::datafusion::arrow::error import ArrowError from rust::datafusion::execution::context import SessionContext from rust::datafusion_common import DataFusionError, ScalarValue from rust::datafusion_expr import ColumnarValue, ScalarFunctionImplementation, ScalarUDF, Volatility, create_udf -from rust::sha1 import Sha1 -from rust::sha2 import Digest from rust::std::primitive import i64 as RustI64, usize as RustUsize from rust::std::string import String as RustString from rust::std::sync import Arc from rust::url import Url -from rust::xxhash_rust::xxh64 import Xxh64 from std.encoding.hex import hexlify +from std.hash import sha1 as std_sha1, xxh64 as std_xxh64 from std.json import JsonError, JsonKind, JsonValue @@ -68,7 +68,7 @@ pub def register_format_udfs(ctx: SessionContext) -> None: ctx.register_udf( _unary_string_udf("schema_of_csv", Arc.from((args) => _apply_unary_string("schema_of_csv", args.to_vec()))), ) - ctx.register_udf(_binary_string_udf("from_csv", Arc.from((args) => _apply_binary_string("from_csv", args.to_vec())))) + ctx.register_udf(_binary_csv_map_udf("from_csv", Arc.from((args) => _apply_from_csv(args.to_vec())))) ctx.register_udf(_unary_string_udf("to_csv", Arc.from((args) => _apply_unary_string("to_csv", args.to_vec())))) @@ -92,6 +92,17 @@ def _binary_string_udf(name: str, implementation: ScalarFunctionImplementation) return create_udf(name, [DataType.Utf8, DataType.Utf8], DataType.Utf8, Volatility.Immutable, implementation) +def _binary_csv_map_udf(name: str, implementation: ScalarFunctionImplementation) -> ScalarUDF: + """Create one binary string-to-map DataFusion UDF for CSV row parsing.""" + return create_udf(name, [DataType.Utf8, DataType.Utf8], _csv_map_data_type(), Volatility.Immutable, implementation) + + +def _csv_map_data_type() -> DataType: + """Return the Arrow map type emitted by the CSV parser UDF.""" + mut builder = MapBuilder.new(None, StringBuilder.new(), StringBuilder.new()) + return builder.finish().data_type().clone() + + def _apply_unary_string(name: str, args: list[ColumnarValue]) -> Result[ColumnarValue, DataFusionError]: """Apply one unary string function across scalar and array DataFusion inputs.""" _require_arity(args, 1)? @@ -157,6 +168,28 @@ def _apply_binary_string(name: str, args: list[ColumnarValue]) -> Result[Columna return Ok(ColumnarValue.Array(array_ref)) +def _apply_from_csv(args: list[ColumnarValue]) -> Result[ColumnarValue, DataFusionError]: + """Apply the CSV row parser as an Arrow map-returning DataFusion UDF.""" + _require_arity(args, 2)? + if _all_scalar(args): + return Ok( + ColumnarValue.Scalar( + ScalarValue.Map( + Arc.from(_csv_map_array([_string_at(args[0].clone(), 0)?], [_string_at(args[1].clone(), 0)?])?), + ), + ), + ) + + row_count = _row_count(args)? + mut values: list[Option[str]] = [] + mut schemas: list[Option[str]] = [] + for idx in range(row_count): + values.append(_string_at(args[0].clone(), idx)?) + schemas.append(_string_at(args[1].clone(), idx)?) + array_ref: ArrayRef = Arc.from(_csv_map_array(values, schemas)?) + return Ok(ColumnarValue.Array(array_ref)) + + def _eval_unary_string(name: str, value: Option[str]) -> Result[Option[str], DataFusionError]: """Dispatch unary string-returning format functions.""" match name: @@ -197,14 +230,13 @@ def _eval_binary_string(name: str, left: Option[str], right: Option[str]) -> Res "json_extract_path_text" => return _json_path_value(left, right, true) "from_json" => return _from_json(left, right) "try_from_json" => return _try_from_json(left, right) - "from_csv" => return _from_csv(left, right) _ => return _datafusion_error(f"unknown binary format function `{name}`") def _hash_sha1(value: Option[str]) -> Result[Option[str], DataFusionError]: - """Hash one optional string with SHA-1.""" + """Hash one optional string with std.hash SHA-1.""" match value: - Some(text) => return Ok(Some(hexlify(_sha1_digest(_utf8_bytes(text))))) + Some(text) => return Ok(Some(hexlify(std_sha1.digest(_utf8_bytes(text))))) None => return Ok(None) @@ -219,9 +251,9 @@ def _hash_crc32(value: Option[str]) -> Result[Option[str], DataFusionError]: def _hash_xxhash64(value: Option[str]) -> Result[Option[str], DataFusionError]: - """Hash one optional string with xxHash64.""" + """Hash one optional string with std.hash xxHash64.""" match value: - Some(text) => return Ok(Some(hexlify(_u64_to_be_bytes(_xxhash64(_utf8_bytes(text)))))) + Some(text) => return Ok(Some(hexlify(_u64_to_be_bytes(std_xxh64.hash_u64(_utf8_bytes(text)))))) None => return Ok(None) @@ -278,39 +310,18 @@ def _try_url_decode(value: Option[str]) -> Result[Option[str], DataFusionError]: def _parse_url(value: Option[str], part: Option[str]) -> Result[Option[str], DataFusionError]: - """Extract one URL part with the Rust URL parser through Incan interop.""" + """Extract the first query parameter value with the Rust URL parser.""" match value: Some(raw_url) => match part: Some(raw_part) => match Url.parse(raw_url): Ok(parsed) => - normalized = raw_part.lower() - match normalized: - "scheme" => return Ok(Some(f"{parsed.scheme()}")) - "host" => - match parsed.host_str(): - Some(host) => return Ok(Some(f"{host}")) - None => return Ok(None) - "path" => return Ok(Some(f"{parsed.path()}")) - "query" => - match parsed.query(): - Some(query) => return Ok(Some(f"{query}")) - None => return Ok(None) - "fragment" => - match parsed.fragment(): - Some(fragment) => return Ok(Some(f"{fragment}")) - None => return Ok(None) - "port" => - match parsed.port(): - Some(port) => return Ok(Some(str(port))) - None => return Ok(None) - "username" => return Ok(Some(f"{parsed.username()}")) - "password" => - match parsed.password(): - Some(password) => return Ok(Some(f"{password}")) - None => return Ok(None) - _ => return _datafusion_error(f"unknown parse_url part `{raw_part}`") + for pair in parsed.query_pairs(): + key, query_value = pair + if f"{key}" == raw_part: + return Ok(Some(f"{query_value}")) + return Ok(None) Err(err) => return _datafusion_error(err.to_string()) None => return Ok(None) None => return Ok(None) @@ -418,26 +429,36 @@ def _schema_of_csv(value: Option[str]) -> Result[Option[str], DataFusionError]: None => return Ok(None) -def _from_csv(value: Option[str], schema: Option[str]) -> Result[Option[str], DataFusionError]: - """Parse one CSV row string into a JSON array or schema-named object string.""" - match value: - Some(text) => - fields = _csv_fields(text)? - names = _csv_schema_field_names(schema.unwrap_or("")) - if len(names) == 0: - mut values: list[JsonValue] = [] - for field in fields: - values.append(JsonValue.string(field)) - return Ok(Some(_json_text(JsonValue.array(values))?)) - - mut entries: Dict[str, JsonValue] = {} - for idx, name in enumerate(names): - if idx < len(fields): - entries[name] = JsonValue.string(fields[idx]) +def _csv_map_array(values: list[Option[str]], schemas: list[Option[str]]) -> Result[MapArray, DataFusionError]: + """Parse CSV rows into Arrow maps keyed by schema names or positional `_cN` names.""" + mut builder = MapBuilder.new(None, StringBuilder.new(), StringBuilder.new()) + for row_idx, value in enumerate(values): + match value: + Some(text) => + fields = _csv_fields(text)? + names = _csv_schema_field_names(schemas[row_idx].unwrap_or("")) + if len(names) == 0: + for idx, field in enumerate(fields): + builder.keys().append_value(f"_c{idx}") + builder.values().append_value(field) else: - entries[name] = JsonValue.string("") - return Ok(Some(_json_text(JsonValue.object(entries))?)) - None => return Ok(None) + for idx, name in enumerate(names): + builder.keys().append_value(name) + if idx < len(fields): + builder.values().append_value(fields[idx]) + else: + empty_value = str("") + builder.values().append_value(empty_value) + _append_csv_map_slot(builder.append(true))? + None => _append_csv_map_slot(builder.append(false))? + return Ok(builder.finish()) + + +def _append_csv_map_slot(result: Result[None, ArrowError]) -> Result[None, DataFusionError]: + """Convert Arrow map-builder append errors into the DataFusion UDF error channel.""" + match result: + Ok(_) => return Ok(None) + Err(err) => return _datafusion_error(err.to_string()) def _to_csv(value: Option[str]) -> Result[Option[str], DataFusionError]: @@ -746,20 +767,6 @@ def _utf8_bytes(value: str) -> bytes: return RustString.from(value).into_bytes() -def _sha1_digest(data: bytes) -> bytes: - """Return SHA-1 digest bytes.""" - mut hasher = Sha1.default() - Digest.update(hasher, data.as_slice()) - return hasher.finalize_reset().to_vec() - - -def _xxhash64(data: bytes) -> u64: - """Return xxHash64 with the same zero-seed contract as InQL's public helper.""" - mut hasher = Xxh64.default() - hasher.update(data.as_slice()) - return hasher.digest() - - def _u32_to_be_bytes(value: u32) -> bytes: """Encode a 32-bit integer as big-endian bytes.""" number = int(value) diff --git a/tests/test_format_functions.incn b/tests/test_format_functions.incn index 15b3033..3a8bfe3 100644 --- a/tests/test_format_functions.incn +++ b/tests/test_format_functions.incn @@ -4,9 +4,9 @@ from std.testing import assert_raises from functions import col, get_json_object, json_extract_path_text, parse_url -def _call_parse_url_with_unknown_part() -> None: - """Call parse_url with an unsupported literal part for ValueError assertions.""" - parse_url(col("landing_page"), "authority") +def _call_parse_url_with_empty_part() -> None: + """Call parse_url with an empty literal key for ValueError assertions.""" + parse_url(col("landing_page"), "") return @@ -24,6 +24,6 @@ def _call_json_extract_path_text_with_invalid_path() -> None: def test_format_functions__literal_parse_options_are_validated_early() -> None: # -- Arrange / Act / Assert -- - assert_raises[ValueError](_call_parse_url_with_unknown_part) + assert_raises[ValueError](_call_parse_url_with_empty_part) assert_raises[ValueError](_call_get_json_object_with_invalid_path) assert_raises[ValueError](_call_json_extract_path_text_with_invalid_path) diff --git a/tests/test_function_registry.incn b/tests/test_function_registry.incn index 723a805..c5982c1 100644 --- a/tests/test_function_registry.incn +++ b/tests/test_function_registry.incn @@ -444,7 +444,7 @@ def _exercise_current_public_helpers() -> None: sha1(status) crc32(status) xxhash64(status) - parse_url(status, "host") + parse_url(status, "page") url_encode(status) url_decode(status) try_url_decode(status) @@ -941,7 +941,7 @@ def test_function_registry__public_helpers_preserve_existing_behavior() -> None: hash_exprs = [md5(status), sha2(status, 256), sha224(status), sha256(status), sha384(status), sha512(status), sha1( status, ), crc32(status), xxhash64(status)] - format_exprs = [parse_url(status, "host"), url_encode(status), url_decode(status), try_url_decode(status), parse_json( + format_exprs = [parse_url(status, "page"), url_encode(status), url_decode(status), try_url_decode(status), parse_json( status, ), check_json(status), schema_of_json(status), get_json_object(status, "$.type"), json_extract_path_text( status, diff --git a/tests/test_session_projection.incn b/tests/test_session_projection.incn index d22957f..0f2c349 100644 --- a/tests/test_session_projection.incn +++ b/tests/test_session_projection.incn @@ -298,13 +298,13 @@ def test_session_projection__collect_executes_json_url_and_csv_format_functions( "json_try_invalid", try_from_json(lit("{"), "STRUCT"), ).with_column("json_string", to_json(lit("paid"))).with_column( - "url_host", - parse_url(lit("https://example.com/orders?id=7#top"), "host"), + "url_page", + parse_url(lit("https://example.com/orders?page=2&id=7#top"), "page"), ).with_column("url_encoded", url_encode(lit("a b"))).with_column("url_decoded", url_decode(lit("a%20b"))).with_column( "url_try_invalid", try_url_decode(lit("%zz")), ).with_column("csv_schema", schema_of_csv(csv_payload)).with_column( - "csv_json", + "csv_map", from_csv(csv_payload, "STRUCT"), ).with_column("csv_line", to_csv(lit("[\"42\",\"paid\"]"))) df = _collect_or_fail(session, projected) @@ -316,11 +316,12 @@ def test_session_projection__collect_executes_json_url_and_csv_format_functions( assert len(resolved) == 19, "projection should expose all appended format outputs" assert payload.contains("click"), "JSON path extraction should return scalar text" assert payload.contains("{\"id\":7}"), "get_json_object should return nested JSON text" - assert payload.contains("example.com"), "parse_url should extract the host" + assert payload.contains("2"), "parse_url should extract arbitrary query parameter keys" assert payload.contains("a%20b"), "url_encode should percent-encode spaces" assert payload.contains("a b"), "url_decode should decode valid percent escapes" assert payload.contains("STRUCT<"), "schema helpers should emit deterministic schema text" - assert payload.contains("{\"id\":\"42\",\"status\":\"paid\"}"), "from_csv should use schema field names" + assert payload.contains("id: 42"), "from_csv should use schema field names in a logical map value" + assert payload.contains("status: paid"), "from_csv should retain CSV field values in a logical map value" assert payload.contains("42,paid"), "to_csv should serialize JSON arrays as CSV rows" diff --git a/tests/test_substrait_plan.incn b/tests/test_substrait_plan.incn index 56cf70f..b49a8f8 100644 --- a/tests/test_substrait_plan.incn +++ b/tests/test_substrait_plan.incn @@ -489,7 +489,7 @@ def test_plan__core_scalar_extension_mappings_lower_to_substrait() -> None: _assert_scalar_expr_lowers(sha1(col("status"))) _assert_scalar_expr_lowers(crc32(col("status"))) _assert_scalar_expr_lowers(xxhash64(col("status"))) - _assert_scalar_expr_lowers(parse_url(col("status"), "host")) + _assert_scalar_expr_lowers(parse_url(col("status"), "page")) _assert_scalar_expr_lowers(url_encode(col("status"))) _assert_scalar_expr_lowers(url_decode(col("status"))) _assert_scalar_expr_lowers(try_url_decode(col("status"))) From 6124e4cc3b85f93b759a0de7e09b4be1a8c7a721 Mon Sep 17 00:00:00 2001 From: Danny Meijer Date: Fri, 29 May 2026 12:02:47 +0200 Subject: [PATCH 09/13] feature - 39 add model-derived format schemas --- .github/workflows/ci.yml | 2 +- docs/language/reference/functions/format.md | 24 +++++-- .../022_semi_structured_format_functions.md | 11 +++- incan.lock | 18 +++--- src/functions/csv/from_csv.incn | 22 ++++++- src/functions/format_schema.incn | 13 ++++ src/functions/json/from_json.incn | 18 +++++- src/functions/json/try_from_json.incn | 18 +++++- src/functions/mod.incn | 1 + src/lib.incn | 3 + src/substrait/schema.incn | 64 ++++++++++++++++++- tests/test_model_substrait_schema.incn | 37 +++++++++++ tests/test_session_projection.incn | 24 +++++-- 13 files changed, 222 insertions(+), 33 deletions(-) create mode 100644 src/functions/format_schema.incn diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 0c92e87..07a64c6 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -20,7 +20,7 @@ concurrency: env: CARGO_TERM_COLOR: always INCAN_REF: release/v0.3 - EXPECTED_INCAN_VERSION: 0.3.0-rc24 + EXPECTED_INCAN_VERSION: 0.3.0-rc25 RUST_BACKTRACE: 1 INCAN_NO_BANNER: 1 diff --git a/docs/language/reference/functions/format.md b/docs/language/reference/functions/format.md index ed3c859..826be08 100644 --- a/docs/language/reference/functions/format.md +++ b/docs/language/reference/functions/format.md @@ -27,23 +27,33 @@ The format catalog includes deterministic hashes, URL helpers, JSON helpers, and | `json_object_keys(expr)` | Return object keys from a JSON object payload as a JSON array string. | | `get_json_object(expr, path)` | Extract a JSON value at a literal path and return it as JSON text. | | `json_extract_path_text(expr, path)` | Extract a JSON value at a literal path and return scalar strings as plain text. | -| `from_json(expr, schema)` | Validate JSON with an explicit schema description and return a normalized JSON payload string. | -| `try_from_json(expr, schema)` | Validate JSON with an explicit schema description and return null when the payload is invalid. | +| `from_json(expr, schema)` | Validate JSON with a schema string or `RowShapeSpec` and return a normalized JSON payload string. | +| `try_from_json(expr, schema)` | Validate JSON with a schema string or `RowShapeSpec` and return null when the payload is invalid. | | `to_json(expr)` | Serialize a scalar expression as JSON text. | | `schema_of_csv(expr)` | Infer a deterministic schema description from a CSV row string. | -| `from_csv(expr, schema)` | Parse a CSV row string into a logical map keyed by schema field names, or `_cN` positional keys when no schema is provided. | +| `from_csv(expr, schema)` | Parse a CSV row string into a logical map keyed by schema field names from a schema string or `RowShapeSpec`, or `_cN` positional keys when no schema is provided. | | `to_csv(expr)` | Serialize a scalar or JSON array/object payload as a CSV row string. | ```incan from pub::inql.functions import col, from_csv, from_json, get_json_object, parse_url, sha2, to_json +from pub::inql import row_shape_from_model + +@derive(Clone) +model EventPayload: + type_ as "type": str + +@derive(Clone) +model CsvRowSchema: + id: int + status: str projected = ( events .with_column("user_hash", sha2(col("user_id"), 256)) .with_column("campaign", parse_url(col("landing_page"), "utm_campaign")) .with_column("event_type", get_json_object(col("payload"), "$.type")) - .with_column("payload_obj", from_json(col("payload"), "STRUCT")) - .with_column("row_fields", from_csv(col("csv_line"), "STRUCT")) + .with_column("payload_obj", from_json(col("payload"), row_shape_from_model(EventPayload(type="click")))) + .with_column("row_fields", from_csv(col("csv_line"), row_shape_from_model(CsvRowSchema(id=0, status="")))) .with_column("payload_out", to_json(col("event_type"))) ) ``` @@ -52,7 +62,9 @@ Hash helpers operate on UTF-8 string bytes and return lowercase hexadecimal stri `384`, and `512`; other digest lengths are rejected during expression construction. JSON helpers validate, normalize, and project payload text. CSV parsing returns logical map values instead of JSON text. -These helpers do not read external files or introduce a dynamic variant value type. +Explicit-schema JSON and CSV helpers accept either a compact schema string or a `RowShapeSpec`; `row_shape_from_model(...)` +derives that shape from compiler-provided Incan model field metadata. These helpers do not read external files or introduce +a dynamic variant value type. The DataFusion adapter executes the full RFC 022 catalog with native DataFusion functions where available and Incan-authored adapter callbacks for helpers that DataFusion does not expose natively. diff --git a/docs/rfcs/022_semi_structured_format_functions.md b/docs/rfcs/022_semi_structured_format_functions.md index ae4303c..3c70133 100644 --- a/docs/rfcs/022_semi_structured_format_functions.md +++ b/docs/rfcs/022_semi_structured_format_functions.md @@ -48,11 +48,16 @@ Without a separate RFC, format helpers risk leaking ingestion policy into the sc Authors should be able to parse and produce semi-structured scalar values inside relational transformations: ```incan +from pub::inql import row_shape_from_model from pub::inql.functions import col, from_json, get_json_object, sha2, to_json +@derive(Clone) +model EventPayload: + type_ as "type": str + events = ( raw_events - .with_column("payload_obj", from_json(col("payload"), EventPayload)) + .with_column("payload_obj", from_json(col("payload"), row_shape_from_model(EventPayload(type="click")))) .with_column("event_type", get_json_object(col("payload"), "$.type")) .with_column("user_hash", sha2(col("user_id"), 256)) .with_column("payload_out", to_json(col("payload_obj"))) @@ -63,9 +68,9 @@ Reading a JSON file into a dataset remains a session/source operation, not a sca ## Reference-level explanation (precise rules) -InQL should define JSON functions including `from_json`, `to_json`, `get_json_object`, `json_array_length`, `json_object_keys`, `schema_of_json`, `parse_json`, `check_json`, `json_extract_path_text`, and `try_from_json` where recoverable parse behavior is desired. +InQL should define JSON functions including `from_json`, `to_json`, `get_json_object`, `json_array_length`, `json_object_keys`, `schema_of_json`, `parse_json`, `check_json`, `json_extract_path_text`, and `try_from_json` where recoverable parse behavior is desired. Explicit-schema JSON helpers accept schema descriptions and model-derived row shapes. -InQL should define CSV value functions including `from_csv`, `to_csv`, and `schema_of_csv` only insofar as they operate on scalar values or schema descriptions. `from_csv` returns a logical map keyed by schema field names, or `_cN` positional keys when no schema is supplied. These functions must not replace the session CSV read/write contract. +InQL should define CSV value functions including `from_csv`, `to_csv`, and `schema_of_csv` only insofar as they operate on scalar values, schema descriptions, or model-derived row shapes. `from_csv` returns a logical map keyed by schema field names, or `_cN` positional keys when no schema is supplied. These functions must not replace the session CSV read/write contract. InQL should define URL functions including `parse_url`, `url_encode`, `url_decode`, and `try_url_decode`, with exact invalid-input behavior recorded in the registry. `parse_url` extracts query parameter values by key. diff --git a/incan.lock b/incan.lock index dbda203..938e3aa 100644 --- a/incan.lock +++ b/incan.lock @@ -3,7 +3,7 @@ [incan] format = 1 -incan-version = "0.3.0-rc24" +incan-version = "0.3.0-rc25" deps-fingerprint = "sha256:c857db33615865c432fa4c9b3947975ccb4255249143ecc1676c26e85981aafb" cargo-features = [] cargo-no-default-features = false @@ -481,9 +481,9 @@ dependencies = [ [[package]] name = "cc" -version = "1.2.62" +version = "1.2.63" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a1dce859f0832a7d088c4f1119888ab94ef4b5d6795d1ce05afb7fe159d79f98" +checksum = "556e016178bb5662a08681bbe0f00f8e17631781a4dfc8c45e466e4b185ec27f" dependencies = [ "find-msvc-tools", "jobserver", @@ -1824,14 +1824,14 @@ dependencies = [ [[package]] name = "incan_core" -version = "0.3.0-rc24" +version = "0.3.0-rc25" dependencies = [ "serde", ] [[package]] name = "incan_derive" -version = "0.3.0-rc24" +version = "0.3.0-rc25" dependencies = [ "proc-macro2", "quote", @@ -1840,7 +1840,7 @@ dependencies = [ [[package]] name = "incan_stdlib" -version = "0.3.0-rc24" +version = "0.3.0-rc25" dependencies = [ "incan_core", "incan_derive", @@ -1864,7 +1864,7 @@ dependencies = [ [[package]] name = "inql" -version = "0.3.0-rc24" +version = "0.3.0-rc25" dependencies = [ "blake2", "blake3", @@ -2788,9 +2788,9 @@ dependencies = [ [[package]] name = "shlex" -version = "1.3.0" +version = "2.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0fda2ff0d084019ba4d7c6f371c95d8fd75ce3524c3cb8fb653a3023f6323e64" +checksum = "f8fadd59c855ef2080decdef8ff161eb6661b86933c9d82e5ba29dc602a55aba" [[package]] name = "simd-adler32" diff --git a/src/functions/csv/from_csv.incn b/src/functions/csv/from_csv.incn index 17c493c..2132869 100644 --- a/src/functions/csv/from_csv.incn +++ b/src/functions/csv/from_csv.incn @@ -10,6 +10,7 @@ from function_registry import ( v0_1, ) from functions.registry import register_function, registered_application +from functions.format_schema import FormatSchema, format_schema_description from projection_builders import ColumnExpr, str_expr from substrait.function_extensions import FROM_CSV_FUNCTION_ANCHOR @@ -21,18 +22,19 @@ from substrait.function_extensions import FROM_CSV_FUNCTION_ANCHOR error_behavior=FunctionErrorBehavior.InvalidInputDiagnostic, substrait=extension_mapping("from_csv", FROM_CSV_FUNCTION_ANCHOR), )) -pub def from_csv(expr: ColumnExpr, schema: str) -> ColumnExpr: +pub def from_csv(expr: ColumnExpr, schema: FormatSchema) -> ColumnExpr: """ Parse one CSV row string into a logical map keyed by schema field names. Examples: row = from_csv(col("line"), "STRUCT") + typed_row = from_csv(col("line"), row_shape_from_model(Customer(id=0, name=""))) Parameters: expr: String expression containing one CSV row. - schema: Explicit schema description used for output keys, or an empty string for `_cN` positional keys. + schema: Schema description or `RowShapeSpec` used for output keys, or an empty string for `_cN` positional keys. """ - return registered_application("from_csv", [expr, str_expr(schema)]) + return registered_application("from_csv", [expr, str_expr(format_schema_description(schema))]) module tests: @@ -43,8 +45,22 @@ module tests: column_expr_function_name, column_expr_kind, ) + from substrait.schema import RowColumnSpec, RowShapeSpec, SubstraitPrimitiveKind def test_from_csv_builds_registered_application() -> None: expr = from_csv(col("line"), "STRUCT") assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction assert column_expr_function_name(expr) == "from_csv" assert column_expr_argument_count(expr) == 2 + def test_from_csv_accepts_row_shape_schema() -> None: + shape = RowShapeSpec( + model_name="CsvOrder", + columns=[RowColumnSpec(name="id", kind=SubstraitPrimitiveKind.I64, nullable=false), RowColumnSpec( + name="status", + kind=SubstraitPrimitiveKind.String, + nullable=false, + )], + ) + expr = from_csv(col("line"), shape) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "from_csv" + assert column_expr_argument_count(expr) == 2 diff --git a/src/functions/format_schema.incn b/src/functions/format_schema.incn new file mode 100644 index 0000000..c15a4fc --- /dev/null +++ b/src/functions/format_schema.incn @@ -0,0 +1,13 @@ +"""Shared schema-argument normalization for scalar format helpers.""" + +from substrait.schema import RowShapeSpec, row_shape_to_schema_description + + +pub type FormatSchema = Union[str, RowShapeSpec] + + +pub def format_schema_description(schema: FormatSchema) -> str: + """Normalize compact schema strings and model-derived row shapes to one schema description.""" + match schema: + str(text) => return text + RowShapeSpec(shape) => return row_shape_to_schema_description(shape) diff --git a/src/functions/json/from_json.incn b/src/functions/json/from_json.incn index 37984e9..05ae9c9 100644 --- a/src/functions/json/from_json.incn +++ b/src/functions/json/from_json.incn @@ -9,6 +9,7 @@ from function_registry import ( extension_mapping, v0_1, ) +from functions.format_schema import FormatSchema, format_schema_description from functions.registry import register_function, registered_application from projection_builders import ColumnExpr, str_expr from substrait.function_extensions import FROM_JSON_FUNCTION_ANCHOR @@ -21,18 +22,19 @@ from substrait.function_extensions import FROM_JSON_FUNCTION_ANCHOR error_behavior=FunctionErrorBehavior.InvalidInputDiagnostic, substrait=extension_mapping("from_json", FROM_JSON_FUNCTION_ANCHOR), )) -pub def from_json(expr: ColumnExpr, schema: str) -> ColumnExpr: +pub def from_json(expr: ColumnExpr, schema: FormatSchema) -> ColumnExpr: """ Validate JSON using an explicit schema description and return a normalized JSON payload. Examples: payload = from_json(col("payload"), "STRUCT") + typed_payload = from_json(col("payload"), row_shape_from_model(EventPayload(type="click"))) Parameters: expr: String expression containing a JSON payload. - schema: Explicit schema description for the expected JSON payload. + schema: Schema description or `RowShapeSpec` for the expected JSON payload. """ - return registered_application("from_json", [expr, str_expr(schema)]) + return registered_application("from_json", [expr, str_expr(format_schema_description(schema))]) module tests: @@ -43,8 +45,18 @@ module tests: column_expr_function_name, column_expr_kind, ) + from substrait.schema import RowColumnSpec, RowShapeSpec, SubstraitPrimitiveKind def test_from_json_builds_registered_application() -> None: expr = from_json(col("payload"), "STRUCT") assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction assert column_expr_function_name(expr) == "from_json" assert column_expr_argument_count(expr) == 2 + def test_from_json_accepts_row_shape_schema() -> None: + shape = RowShapeSpec( + model_name="EventPayload", + columns=[RowColumnSpec(name="type", kind=SubstraitPrimitiveKind.String, nullable=false)], + ) + expr = from_json(col("payload"), shape) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "from_json" + assert column_expr_argument_count(expr) == 2 diff --git a/src/functions/json/try_from_json.incn b/src/functions/json/try_from_json.incn index 6cc3c83..1e8aa73 100644 --- a/src/functions/json/try_from_json.incn +++ b/src/functions/json/try_from_json.incn @@ -8,6 +8,7 @@ from function_registry import ( extension_mapping, v0_1, ) +from functions.format_schema import FormatSchema, format_schema_description from functions.registry import register_function, registered_application from projection_builders import ColumnExpr, str_expr from substrait.function_extensions import TRY_FROM_JSON_FUNCTION_ANCHOR @@ -19,18 +20,19 @@ from substrait.function_extensions import TRY_FROM_JSON_FUNCTION_ANCHOR null_behavior=FunctionNullBehavior.DependsOnInputs, substrait=extension_mapping("try_from_json", TRY_FROM_JSON_FUNCTION_ANCHOR), )) -pub def try_from_json(expr: ColumnExpr, schema: str) -> ColumnExpr: +pub def try_from_json(expr: ColumnExpr, schema: FormatSchema) -> ColumnExpr: """ Parse JSON with an explicit schema description and return null when the payload is invalid. Examples: payload = try_from_json(col("payload"), "STRUCT") + typed_payload = try_from_json(col("payload"), row_shape_from_model(EventPayload(type="click"))) Parameters: expr: String expression containing a JSON payload. - schema: Explicit schema description for the expected JSON payload. + schema: Schema description or `RowShapeSpec` for the expected JSON payload. """ - return registered_application("try_from_json", [expr, str_expr(schema)]) + return registered_application("try_from_json", [expr, str_expr(format_schema_description(schema))]) module tests: @@ -41,8 +43,18 @@ module tests: column_expr_function_name, column_expr_kind, ) + from substrait.schema import RowColumnSpec, RowShapeSpec, SubstraitPrimitiveKind def test_try_from_json_builds_registered_application() -> None: expr = try_from_json(col("payload"), "STRUCT") assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction assert column_expr_function_name(expr) == "try_from_json" assert column_expr_argument_count(expr) == 2 + def test_try_from_json_accepts_row_shape_schema() -> None: + shape = RowShapeSpec( + model_name="EventPayload", + columns=[RowColumnSpec(name="type", kind=SubstraitPrimitiveKind.String, nullable=false)], + ) + expr = try_from_json(col("payload"), shape) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "try_from_json" + assert column_expr_argument_count(expr) == 2 diff --git a/src/functions/mod.incn b/src/functions/mod.incn index 6a2ae8b..5294069 100644 --- a/src/functions/mod.incn +++ b/src/functions/mod.incn @@ -15,6 +15,7 @@ pub from functions.registry import ( function_registry_function_refs, registered_substrait_mapped_function_refs, ) +pub from functions.format_schema import FormatSchema pub from functions.references.col import col pub from functions.literals.always_false import always_false pub from functions.literals.always_true import always_true diff --git a/src/lib.incn b/src/lib.incn index 79a72eb..a039213 100644 --- a/src/lib.incn +++ b/src/lib.incn @@ -59,6 +59,7 @@ pub from projection_builders import ( ) pub from functions.registry import function_registry pub from functions import ( + FormatSchema, function_registry_canonical_names, function_registry_entries, function_registry_entry, @@ -254,6 +255,8 @@ pub from substrait.schema import ( demo_customer_row_shape, row_shape_to_named_struct, row_shape_field_names, + row_shape_from_model, + row_shape_to_schema_description, row_shape_to_substrait_struct_type, substrait_row_type_encoded_len, ) diff --git a/src/substrait/schema.incn b/src/substrait/schema.incn index efe73d8..a655b19 100644 --- a/src/substrait/schema.incn +++ b/src/substrait/schema.incn @@ -2,7 +2,7 @@ Describe an Incan row shape and lower it to Substrait schema structures. The demo model `DemoCustomer` is hand-authored; `demo_customer_row_shape()` lists columns in the same declaration order. -A future compiler pass would derive `RowShapeSpec` from `model` AST. +`row_shape_from_model(...)` derives `RowShapeSpec` from runtime reflection metadata on model or class values. `NamedStruct` (names + struct) is the full read-root shape; this module emits the inner column types and exposes names separately until Incan can set prost's `r#struct` field ergonomically. @@ -22,6 +22,7 @@ from rust::substrait::proto::type import ( String as SubstraitString, Struct, ) +from rust::incan_stdlib::errors import raise_value_error @derive(Clone) @@ -69,6 +70,67 @@ pub def row_shape_field_names(shape: RowShapeSpec) -> list[str]: return [col.name for col in shape.columns] +def _primitive_kind_from_type_name(type_name: str) -> SubstraitPrimitiveKind: + """Map Incan field reflection type names to the primitive schema kinds InQL can lower today.""" + if type_name == "bool": + return SubstraitPrimitiveKind.Bool + if type_name == "int": + return SubstraitPrimitiveKind.I64 + if type_name == "float": + return SubstraitPrimitiveKind.F64 + if type_name == "str": + return SubstraitPrimitiveKind.String + message = f"unsupported model field type `{type_name}` for InQL row shape" + return raise_value_error(message) + + +def _schema_type_name(kind: SubstraitPrimitiveKind) -> str: + """Return the compact schema spelling used by scalar format helpers.""" + match kind: + SubstraitPrimitiveKind.Bool => return "BOOLEAN" + SubstraitPrimitiveKind.I64 => return "BIGINT" + SubstraitPrimitiveKind.F64 => return "DOUBLE" + SubstraitPrimitiveKind.Timestamp => return "TIMESTAMP" + SubstraitPrimitiveKind.String => return "STRING" + + +def _field_type_name(raw_type_name: str) -> str: + """Return the non-optional part of one reflected field type name.""" + if raw_type_name.startswith("Option[") and raw_type_name.endswith("]"): + return raw_type_name[7:len(raw_type_name) - 1] + return raw_type_name + + +def _field_is_nullable(raw_type_name: str) -> bool: + """Return whether one reflected field type is optional.""" + return raw_type_name.startswith("Option[") and raw_type_name.endswith("]") + + +pub def row_shape_from_model[T](value: T) -> RowShapeSpec: + """Derive a flat row shape from an Incan model or class value using compiler-provided field metadata.""" + mut columns: list[RowColumnSpec] = [] + for field in value.__fields__(): + raw_type_name = str(field.type_name) + field_type_name = _field_type_name(raw_type_name) + columns.append( + RowColumnSpec( + name=str(field.wire_name), + kind=_primitive_kind_from_type_name(field_type_name), + nullable=_field_is_nullable(raw_type_name), + ), + ) + return RowShapeSpec(model_name=str(value.__class_name__()), columns=columns) + + +pub def row_shape_to_schema_description(shape: RowShapeSpec) -> str: + """Render a flat row shape as the compact `STRUCT` schema used by format helpers.""" + mut specs: list[str] = [] + for col in shape.columns: + specs.append(f"{col.name}: {_schema_type_name(col.kind)}") + fields = ", ".join(specs) + return f"STRUCT<{fields}>" + + def _type_from_primitive(kind: SubstraitPrimitiveKind, nullable: bool) -> Type: """Lower one primitive kind + nullability flag into a Substrait `Type`.""" mut n = Nullability.Required diff --git a/tests/test_model_substrait_schema.incn b/tests/test_model_substrait_schema.incn index c705cbb..cde7b90 100644 --- a/tests/test_model_substrait_schema.incn +++ b/tests/test_model_substrait_schema.incn @@ -2,12 +2,21 @@ from substrait.schema import ( demo_customer_row_shape, + row_shape_from_model, row_shape_field_names, + row_shape_to_schema_description, row_shape_to_substrait_struct_type, substrait_row_type_encoded_len, ) +@derive(Clone) +model CsvOrderSchema: + order_id as "id": int + status: str + discount: Option[float] + + def test_demo_customer_row_shape__field_order() -> None: # -- Arrange -- shape = demo_customer_row_shape() @@ -32,3 +41,31 @@ def test_row_shape_to_substrait_struct_type__encodes() -> None: # -- Assert -- assert encoded > 0, "protobuf encoding should be non-empty" + + +def test_row_shape_from_model__uses_reflected_wire_names_and_types() -> None: + # -- Arrange -- + row = CsvOrderSchema(id=42, status="paid", discount=None) + + # -- Act -- + shape = row_shape_from_model(row) + names = row_shape_field_names(shape) + + # -- Assert -- + assert shape.model_name == "CsvOrderSchema", "model name should come from reflection" + assert names[0] == "id", "alias should become the external schema field name" + assert names[1] == "status", "plain field name should be preserved" + assert names[2] == "discount", "optional field should be preserved" + assert shape.columns[2].nullable, "Option fields should be nullable" + + +def test_row_shape_to_schema_description__renders_csv_schema_string() -> None: + # -- Arrange -- + row = CsvOrderSchema(id=42, status="paid", discount=None) + shape = row_shape_from_model(row) + + # -- Act -- + schema = row_shape_to_schema_description(shape) + + # -- Assert -- + assert schema == "STRUCT", "schema description" diff --git a/tests/test_session_projection.incn b/tests/test_session_projection.incn index 0f2c349..1f8eb8b 100644 --- a/tests/test_session_projection.incn +++ b/tests/test_session_projection.incn @@ -57,6 +57,7 @@ from dataset import DataFrame, LazyFrame from session import Session, SessionErrorKind from std.testing import assert_is_err, assert_is_ok, fail_t from substrait.inspect import relation_kind_name, root_names, root_rel, sort_field_count, sort_field_direction_name +from substrait.schema import row_shape_from_model @derive(Clone) @@ -65,6 +66,17 @@ pub model AggregateOrder: pub amount: int +@derive(Clone) +pub model CsvProjectionSchema: + pub id: int + pub status: str + + +@derive(Clone) +pub model JsonProjectionSchema: + pub type_ as "type": str + + const AGGREGATE_ORDERS_CSV_FIXTURE: str = "tests/fixtures/aggregate_orders.csv" @@ -294,9 +306,9 @@ def test_session_projection__collect_executes_json_url_and_csv_format_functions( ).with_column("json_keys", json_object_keys(json_payload)).with_column("json_schema", schema_of_json(json_payload)).with_column( "json_normalized", parse_json(json_payload), - ).with_column("json_from", from_json(json_payload, "STRUCT")).with_column( + ).with_column("json_from", from_json(json_payload, row_shape_from_model(JsonProjectionSchema(type="click")))).with_column( "json_try_invalid", - try_from_json(lit("{"), "STRUCT"), + try_from_json(lit("{"), row_shape_from_model(JsonProjectionSchema(type="click"))), ).with_column("json_string", to_json(lit("paid"))).with_column( "url_page", parse_url(lit("https://example.com/orders?page=2&id=7#top"), "page"), @@ -306,14 +318,17 @@ def test_session_projection__collect_executes_json_url_and_csv_format_functions( ).with_column("csv_schema", schema_of_csv(csv_payload)).with_column( "csv_map", from_csv(csv_payload, "STRUCT"), - ).with_column("csv_line", to_csv(lit("[\"42\",\"paid\"]"))) + ).with_column("csv_model_map", from_csv(csv_payload, row_shape_from_model(CsvProjectionSchema(id=0, status="")))).with_column( + "csv_line", + to_csv(lit("[\"42\",\"paid\"]")), + ) df = _collect_or_fail(session, projected) payload = df.preview_text() resolved = df.resolved_columns() # -- Assert -- assert df.row_count() == 3, "format projections should preserve input rows" - assert len(resolved) == 19, "projection should expose all appended format outputs" + assert len(resolved) == 20, "projection should expose all appended format outputs" assert payload.contains("click"), "JSON path extraction should return scalar text" assert payload.contains("{\"id\":7}"), "get_json_object should return nested JSON text" assert payload.contains("2"), "parse_url should extract arbitrary query parameter keys" @@ -322,6 +337,7 @@ def test_session_projection__collect_executes_json_url_and_csv_format_functions( assert payload.contains("STRUCT<"), "schema helpers should emit deterministic schema text" assert payload.contains("id: 42"), "from_csv should use schema field names in a logical map value" assert payload.contains("status: paid"), "from_csv should retain CSV field values in a logical map value" + assert payload.contains("csv_model_map"), "from_csv should accept a schema derived from an Incan model" assert payload.contains("42,paid"), "to_csv should serialize JSON arrays as CSV rows" From eeb0dd519b94b4c121333c841df7156e3cd272ba Mon Sep 17 00:00:00 2001 From: Danny Meijer Date: Fri, 29 May 2026 12:08:58 +0200 Subject: [PATCH 10/13] docs - 39 polish format schema examples --- src/functions/csv/from_csv.incn | 7 +++++-- src/functions/json/from_json.incn | 5 ++++- src/functions/json/try_from_json.incn | 5 ++++- 3 files changed, 13 insertions(+), 4 deletions(-) diff --git a/src/functions/csv/from_csv.incn b/src/functions/csv/from_csv.incn index 2132869..cf6d3d5 100644 --- a/src/functions/csv/from_csv.incn +++ b/src/functions/csv/from_csv.incn @@ -27,8 +27,11 @@ pub def from_csv(expr: ColumnExpr, schema: FormatSchema) -> ColumnExpr: Parse one CSV row string into a logical map keyed by schema field names. Examples: - row = from_csv(col("line"), "STRUCT") - typed_row = from_csv(col("line"), row_shape_from_model(Customer(id=0, name=""))) + row = from_csv(col("line"), "STRUCT") + + # With a `CsvRowSchema` model in scope: + schema = row_shape_from_model(CsvRowSchema(id=0, status="")) + typed_row = from_csv(col("line"), schema) Parameters: expr: String expression containing one CSV row. diff --git a/src/functions/json/from_json.incn b/src/functions/json/from_json.incn index 05ae9c9..8689f9a 100644 --- a/src/functions/json/from_json.incn +++ b/src/functions/json/from_json.incn @@ -28,7 +28,10 @@ pub def from_json(expr: ColumnExpr, schema: FormatSchema) -> ColumnExpr: Examples: payload = from_json(col("payload"), "STRUCT") - typed_payload = from_json(col("payload"), row_shape_from_model(EventPayload(type="click"))) + + # With an `EventPayload` model in scope: + schema = row_shape_from_model(EventPayload(type="click")) + typed_payload = from_json(col("payload"), schema) Parameters: expr: String expression containing a JSON payload. diff --git a/src/functions/json/try_from_json.incn b/src/functions/json/try_from_json.incn index 1e8aa73..20d9cbc 100644 --- a/src/functions/json/try_from_json.incn +++ b/src/functions/json/try_from_json.incn @@ -26,7 +26,10 @@ pub def try_from_json(expr: ColumnExpr, schema: FormatSchema) -> ColumnExpr: Examples: payload = try_from_json(col("payload"), "STRUCT") - typed_payload = try_from_json(col("payload"), row_shape_from_model(EventPayload(type="click"))) + + # With an `EventPayload` model in scope: + schema = row_shape_from_model(EventPayload(type="click")) + typed_payload = try_from_json(col("payload"), schema) Parameters: expr: String expression containing a JSON payload. From b1e48230d5b54aee0fbf72ea1289e5a7a07cc44d Mon Sep 17 00:00:00 2001 From: Danny Meijer Date: Sat, 30 May 2026 00:32:52 +0200 Subject: [PATCH 11/13] fix - 39 use model type schemas for format helpers --- .github/workflows/ci.yml | 2 +- docs/language/reference/functions/format.md | 20 +++--- .../022_semi_structured_format_functions.md | 12 ++-- incan.lock | 26 +++---- src/function_registry.incn | 72 ++++++++++++++++++- src/functions/csv/from_csv.incn | 34 +++------ src/functions/format_schema.incn | 13 ---- src/functions/json/from_json.incn | 29 +++----- src/functions/json/try_from_json.incn | 29 +++----- src/functions/mod.incn | 1 - src/functions/registry.incn | 6 +- src/lib.incn | 5 -- src/substrait/schema.incn | 36 +++------- src/substrait/schema_registry.incn | 15 +--- tests/test_function_registry.incn | 39 ++++++++-- tests/test_model_substrait_schema.incn | 54 ++++++-------- tests/test_session_projection.incn | 15 ++-- tests/test_substrait_plan.incn | 12 +++- 18 files changed, 204 insertions(+), 216 deletions(-) delete mode 100644 src/functions/format_schema.incn diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 07a64c6..86afab7 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -20,7 +20,7 @@ concurrency: env: CARGO_TERM_COLOR: always INCAN_REF: release/v0.3 - EXPECTED_INCAN_VERSION: 0.3.0-rc25 + EXPECTED_INCAN_VERSION: 0.3.0-rc28 RUST_BACKTRACE: 1 INCAN_NO_BANNER: 1 diff --git a/docs/language/reference/functions/format.md b/docs/language/reference/functions/format.md index 826be08..b8eb2fc 100644 --- a/docs/language/reference/functions/format.md +++ b/docs/language/reference/functions/format.md @@ -27,23 +27,20 @@ The format catalog includes deterministic hashes, URL helpers, JSON helpers, and | `json_object_keys(expr)` | Return object keys from a JSON object payload as a JSON array string. | | `get_json_object(expr, path)` | Extract a JSON value at a literal path and return it as JSON text. | | `json_extract_path_text(expr, path)` | Extract a JSON value at a literal path and return scalar strings as plain text. | -| `from_json(expr, schema)` | Validate JSON with a schema string or `RowShapeSpec` and return a normalized JSON payload string. | -| `try_from_json(expr, schema)` | Validate JSON with a schema string or `RowShapeSpec` and return null when the payload is invalid. | +| `from_json[Model](expr)` | Validate JSON with a schema derived from an Incan model type and return a normalized JSON payload string. | +| `try_from_json[Model](expr)` | Validate JSON with a schema derived from an Incan model type and return null when the payload is invalid. | | `to_json(expr)` | Serialize a scalar expression as JSON text. | | `schema_of_csv(expr)` | Infer a deterministic schema description from a CSV row string. | -| `from_csv(expr, schema)` | Parse a CSV row string into a logical map keyed by schema field names from a schema string or `RowShapeSpec`, or `_cN` positional keys when no schema is provided. | +| `from_csv[Model](expr)` | Parse a CSV row string into a logical map keyed by fields from an Incan model type. | | `to_csv(expr)` | Serialize a scalar or JSON array/object payload as a CSV row string. | ```incan from pub::inql.functions import col, from_csv, from_json, get_json_object, parse_url, sha2, to_json -from pub::inql import row_shape_from_model -@derive(Clone) model EventPayload: type_ as "type": str -@derive(Clone) -model CsvRowSchema: +model CsvRow: id: int status: str @@ -52,8 +49,8 @@ projected = ( .with_column("user_hash", sha2(col("user_id"), 256)) .with_column("campaign", parse_url(col("landing_page"), "utm_campaign")) .with_column("event_type", get_json_object(col("payload"), "$.type")) - .with_column("payload_obj", from_json(col("payload"), row_shape_from_model(EventPayload(type="click")))) - .with_column("row_fields", from_csv(col("csv_line"), row_shape_from_model(CsvRowSchema(id=0, status="")))) + .with_column("payload_obj", from_json[EventPayload](col("payload"))) + .with_column("row_fields", from_csv[CsvRow](col("csv_line"))) .with_column("payload_out", to_json(col("event_type"))) ) ``` @@ -62,9 +59,8 @@ Hash helpers operate on UTF-8 string bytes and return lowercase hexadecimal stri `384`, and `512`; other digest lengths are rejected during expression construction. JSON helpers validate, normalize, and project payload text. CSV parsing returns logical map values instead of JSON text. -Explicit-schema JSON and CSV helpers accept either a compact schema string or a `RowShapeSpec`; `row_shape_from_model(...)` -derives that shape from compiler-provided Incan model field metadata. These helpers do not read external files or introduce -a dynamic variant value type. +Explicit-schema JSON and CSV helpers derive their schema from Incan model type parameters. These helpers do not read +external files or introduce a dynamic variant value type. The DataFusion adapter executes the full RFC 022 catalog with native DataFusion functions where available and Incan-authored adapter callbacks for helpers that DataFusion does not expose natively. diff --git a/docs/rfcs/022_semi_structured_format_functions.md b/docs/rfcs/022_semi_structured_format_functions.md index 3c70133..653c4f6 100644 --- a/docs/rfcs/022_semi_structured_format_functions.md +++ b/docs/rfcs/022_semi_structured_format_functions.md @@ -48,16 +48,14 @@ Without a separate RFC, format helpers risk leaking ingestion policy into the sc Authors should be able to parse and produce semi-structured scalar values inside relational transformations: ```incan -from pub::inql import row_shape_from_model from pub::inql.functions import col, from_json, get_json_object, sha2, to_json -@derive(Clone) model EventPayload: type_ as "type": str events = ( raw_events - .with_column("payload_obj", from_json(col("payload"), row_shape_from_model(EventPayload(type="click")))) + .with_column("payload_obj", from_json[EventPayload](col("payload"))) .with_column("event_type", get_json_object(col("payload"), "$.type")) .with_column("user_hash", sha2(col("user_id"), 256)) .with_column("payload_out", to_json(col("payload_obj"))) @@ -68,9 +66,9 @@ Reading a JSON file into a dataset remains a session/source operation, not a sca ## Reference-level explanation (precise rules) -InQL should define JSON functions including `from_json`, `to_json`, `get_json_object`, `json_array_length`, `json_object_keys`, `schema_of_json`, `parse_json`, `check_json`, `json_extract_path_text`, and `try_from_json` where recoverable parse behavior is desired. Explicit-schema JSON helpers accept schema descriptions and model-derived row shapes. +InQL should define JSON functions including `from_json`, `to_json`, `get_json_object`, `json_array_length`, `json_object_keys`, `schema_of_json`, `parse_json`, `check_json`, `json_extract_path_text`, and `try_from_json` where recoverable parse behavior is desired. Explicit-schema JSON helpers derive schema descriptions from Incan model type parameters. -InQL should define CSV value functions including `from_csv`, `to_csv`, and `schema_of_csv` only insofar as they operate on scalar values, schema descriptions, or model-derived row shapes. `from_csv` returns a logical map keyed by schema field names, or `_cN` positional keys when no schema is supplied. These functions must not replace the session CSV read/write contract. +InQL should define CSV value functions including `from_csv`, `to_csv`, and `schema_of_csv` only insofar as they operate on scalar values or schema descriptions. `from_csv[Model](...)` returns a logical map keyed by fields from the supplied Incan model type. These functions must not replace the session CSV read/write contract. InQL should define URL functions including `parse_url`, `url_encode`, `url_decode`, and `try_url_decode`, with exact invalid-input behavior recorded in the registry. `parse_url` extracts query parameter values by key. @@ -128,8 +126,8 @@ This RFC is additive. It should not change existing CSV ingestion behavior. - Portable concrete hash helpers are `md5`, `sha1`, `sha224`, `sha256`, `sha384`, `sha512`, `crc32`, and `xxhash64`, each with an honest Substrait extension mapping. The DataFusion adapter validates materialized execution for the full helper set, using native DataFusion functions where available and Incan-authored adapter callbacks where DataFusion has no built-in implementation. - `sha2(expr, bit_length)` is a compatibility helper, not a separate backend mapping. It rewrites to `sha224`, `sha256`, `sha384`, or `sha512` for supported literal bit lengths and rejects unsupported values. - URL helpers accept scalar URL or component strings. `parse_url(...)` returns the first query parameter value for a literal key. `url_decode(...)` is strict and fails malformed percent escapes; `try_url_decode(...)` returns null for malformed percent escapes. -- JSON helpers accept scalar JSON payload strings. Strict helpers fail invalid JSON; `try_from_json(...)` returns null for invalid JSON; schema and path helpers are deterministic over the provided payload and literal schema/path arguments. -- CSV helpers accept scalar row strings and explicit schema-description strings. Parsed CSV rows are logical maps rather than JSON text. They operate on payload values only and do not replace the session CSV read/write contract. +- JSON helpers accept scalar JSON payload strings. Strict helpers fail invalid JSON; `try_from_json[Model](...)` returns null for invalid JSON; schema and path helpers are deterministic over the provided payload, model type parameter, and literal path arguments. +- CSV helpers accept scalar row strings and explicit Incan model type parameters. Parsed CSV rows are logical maps rather than JSON text. They operate on payload values only and do not replace the session CSV read/write contract. - Semi-structured variant predicates such as `typeof`, `is_array`, `is_object`, `is_integer`, `is_timestamp`, and `is_null_value` belong to InQL RFC 026. RFC 022 does not accept them as JSON-text parser helpers. ## Implementation Plan diff --git a/incan.lock b/incan.lock index 938e3aa..a55fdf1 100644 --- a/incan.lock +++ b/incan.lock @@ -3,7 +3,7 @@ [incan] format = 1 -incan-version = "0.3.0-rc25" +incan-version = "0.3.0-rc28" deps-fingerprint = "sha256:c857db33615865c432fa4c9b3947975ccb4255249143ecc1676c26e85981aafb" cargo-features = [] cargo-no-default-features = false @@ -1824,14 +1824,14 @@ dependencies = [ [[package]] name = "incan_core" -version = "0.3.0-rc25" +version = "0.3.0-rc28" dependencies = [ "serde", ] [[package]] name = "incan_derive" -version = "0.3.0-rc25" +version = "0.3.0-rc28" dependencies = [ "proc-macro2", "quote", @@ -1840,7 +1840,7 @@ dependencies = [ [[package]] name = "incan_stdlib" -version = "0.3.0-rc25" +version = "0.3.0-rc28" dependencies = [ "incan_core", "incan_derive", @@ -1864,7 +1864,7 @@ dependencies = [ [[package]] name = "inql" -version = "0.3.0-rc25" +version = "0.3.0-rc28" dependencies = [ "blake2", "blake3", @@ -3110,9 +3110,9 @@ checksum = "9ea3136b675547379c4bd395ca6b938e5ad3c3d20fad76e7fe85f9e0d011419c" [[package]] name = "typenum" -version = "1.20.0" +version = "1.20.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "40ce102ab67701b8526c123c1bab5cbe42d7040ccfd0f64af1a385808d2f43de" +checksum = "b6f5e870be6c3b371b77fe0ee0bafb859fa4964b4404c27de1d380043c4dda20" [[package]] name = "typify" @@ -3211,9 +3211,9 @@ checksum = "b6c140620e7ffbb22c2dee59cafe6084a59b5ffc27a8859a5f0d494b5d52b6be" [[package]] name = "uuid" -version = "1.23.1" +version = "1.23.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ddd74a9687298c6858e9b88ec8935ec45d22e8fd5e6394fa1bd4e99a87789c76" +checksum = "d258b83ceec21034727ecee8c382cfa6c3e133699b0742c64571814fb420c9f7" dependencies = [ "getrandom 0.4.2", "js-sys", @@ -3567,18 +3567,18 @@ dependencies = [ [[package]] name = "zerocopy" -version = "0.8.49" +version = "0.8.50" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bce33a6288fa3f072a8c2c7d0f2fdbb90e28298f0135c1f99b96c3db2efcc60b" +checksum = "3b065d4f0e55f82fae73202e189638116a87c55ab6b8e6c2721e13dd9d854ad1" dependencies = [ "zerocopy-derive", ] [[package]] name = "zerocopy-derive" -version = "0.8.49" +version = "0.8.50" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8fd425244944f4ab65ccff928e7323354c5a018c75838362fdce749dfad2ee1e" +checksum = "0b631b19d36a892ab55420c92dbc83ccd79274f25be714855d3074aa71cab639" dependencies = [ "proc-macro2", "quote", diff --git a/src/function_registry.incn b/src/function_registry.incn index f8af3da..f524e25 100644 --- a/src/function_registry.incn +++ b/src/function_registry.incn @@ -218,7 +218,9 @@ pub class FunctionRegistry: function_ref = namespaced_function_ref(spec.namespace, canonical_name) for entry in self.entries: - assert entry.function_ref != function_ref, f"duplicate function reference `{function_ref}`" + if entry.function_ref == function_ref: + assert _entry_matches_spec(entry, canonical_name, spec), f"conflicting duplicate function reference `{function_ref}`" + return assert entry.namespace != spec.namespace or entry.canonical_name != canonical_name, f"duplicate canonical function name `{canonical_name}` in namespace `{spec.namespace}`" self.entries.append( @@ -275,6 +277,74 @@ pub class FunctionRegistry: return [entry.function_ref for entry in self.entries if entry.substrait.kind == SubstraitMappingKind.ExtensionFunction] +def _same_string_list(left: list[str], right: list[str]) -> bool: + """Return whether two string lists contain the same ordered values.""" + if len(left) != len(right): + return false + for idx in range(len(left)): + if left[idx] != right[idx]: + return false + return true + + +def _same_version(left: FunctionVersion, right: FunctionVersion) -> bool: + """Return whether two version markers describe the same package version.""" + return left.major == right.major and left.minor == right.minor + + +def _same_deprecation(left: Option[FunctionDeprecation], right: Option[FunctionDeprecation]) -> bool: + """Return whether two optional deprecation markers describe the same policy.""" + if let Some(left_value) = left: + if let Some(right_value) = right: + return _same_version(left_value.since, right_value.since) and left_value.note == right_value.note and left_value.replacement == right_value.replacement + return false + if let Some(_right_value) = right: + return false + return true + + +def _same_lifecycle(left: FunctionLifecycle, right: FunctionLifecycle) -> bool: + """Return whether two lifecycle records expose the same versioned facts.""" + if not _same_version(left.since, right.since): + return false + if len(left.changed) != len(right.changed): + return false + for idx in range(len(left.changed)): + left_change = left.changed[idx] + right_change = right.changed[idx] + if not _same_version(left_change.version, right_change.version) or left_change.note != right_change.note: + return false + return _same_deprecation(left.deprecated, right.deprecated) + + +def _same_aggregate_modifiers(left: AggregateModifierPolicy, right: AggregateModifierPolicy) -> bool: + """Return whether two aggregate modifier policies are identical.""" + return left.allows_distinct == right.allows_distinct and left.allows_filter == right.allows_filter and left.allows_ordered_input == right.allows_ordered_input + + +def _same_substrait_mapping(left: SubstraitMapping, right: SubstraitMapping) -> bool: + """Return whether two Substrait mapping records are identical.""" + return ( + left.kind == right.kind + and left.uri == right.uri + and left.function_name == right.function_name + and left.anchor == right.anchor + and left.rewrite == right.rewrite + and left.detail == right.detail + ) + + +def _entry_matches_spec(entry: FunctionRegistryEntry, canonical_name: str, spec: FunctionSpec) -> bool: + """Return whether an existing registry entry is an idempotent repeat of one declaration-side spec.""" + return (entry.function_ref == namespaced_function_ref(spec.namespace, canonical_name) and entry.namespace == spec.namespace and entry.canonical_name == canonical_name and entry.policy_category == spec.policy_category and entry.function_class == spec.function_class and _same_string_list( + entry.aliases, + spec.aliases, + ) and entry.alias_policy == spec.alias_policy and _same_lifecycle(entry.lifecycle, spec.lifecycle) and entry.determinism == spec.determinism and entry.null_behavior == spec.null_behavior and entry.error_behavior == spec.error_behavior and _same_aggregate_modifiers( + entry.aggregate_modifiers, + spec.aggregate_modifiers, + ) and _same_substrait_mapping(entry.substrait, spec.substrait)) + + pub def function_ref_for(canonical_name: str) -> str: """Return the durable registry reference for one canonical InQL function name.""" return namespaced_function_ref(CORE_FUNCTION_NAMESPACE, canonical_name) diff --git a/src/functions/csv/from_csv.incn b/src/functions/csv/from_csv.incn index cf6d3d5..4529424 100644 --- a/src/functions/csv/from_csv.incn +++ b/src/functions/csv/from_csv.incn @@ -10,9 +10,9 @@ from function_registry import ( v0_1, ) from functions.registry import register_function, registered_application -from functions.format_schema import FormatSchema, format_schema_description from projection_builders import ColumnExpr, str_expr from substrait.function_extensions import FROM_CSV_FUNCTION_ANCHOR +from substrait.schema import schema_description_for_type @register_function(deterministic_spec( @@ -22,22 +22,17 @@ from substrait.function_extensions import FROM_CSV_FUNCTION_ANCHOR error_behavior=FunctionErrorBehavior.InvalidInputDiagnostic, substrait=extension_mapping("from_csv", FROM_CSV_FUNCTION_ANCHOR), )) -pub def from_csv(expr: ColumnExpr, schema: FormatSchema) -> ColumnExpr: +pub def from_csv[T](expr: ColumnExpr) -> ColumnExpr: """ - Parse one CSV row string into a logical map keyed by schema field names. + Parse one CSV row string into a logical map keyed by an Incan model schema. Examples: - row = from_csv(col("line"), "STRUCT") - - # With a `CsvRowSchema` model in scope: - schema = row_shape_from_model(CsvRowSchema(id=0, status="")) - typed_row = from_csv(col("line"), schema) + row = from_csv[CsvRow](col("line")) Parameters: expr: String expression containing one CSV row. - schema: Schema description or `RowShapeSpec` used for output keys, or an empty string for `_cN` positional keys. """ - return registered_application("from_csv", [expr, str_expr(format_schema_description(schema))]) + return registered_application("from_csv", [expr, str_expr(schema_description_for_type[T]())]) module tests: @@ -48,22 +43,11 @@ module tests: column_expr_function_name, column_expr_kind, ) - from substrait.schema import RowColumnSpec, RowShapeSpec, SubstraitPrimitiveKind + model CsvRow: + id: int + status: str def test_from_csv_builds_registered_application() -> None: - expr = from_csv(col("line"), "STRUCT") - assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction - assert column_expr_function_name(expr) == "from_csv" - assert column_expr_argument_count(expr) == 2 - def test_from_csv_accepts_row_shape_schema() -> None: - shape = RowShapeSpec( - model_name="CsvOrder", - columns=[RowColumnSpec(name="id", kind=SubstraitPrimitiveKind.I64, nullable=false), RowColumnSpec( - name="status", - kind=SubstraitPrimitiveKind.String, - nullable=false, - )], - ) - expr = from_csv(col("line"), shape) + expr = from_csv[CsvRow](col("line")) assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction assert column_expr_function_name(expr) == "from_csv" assert column_expr_argument_count(expr) == 2 diff --git a/src/functions/format_schema.incn b/src/functions/format_schema.incn deleted file mode 100644 index c15a4fc..0000000 --- a/src/functions/format_schema.incn +++ /dev/null @@ -1,13 +0,0 @@ -"""Shared schema-argument normalization for scalar format helpers.""" - -from substrait.schema import RowShapeSpec, row_shape_to_schema_description - - -pub type FormatSchema = Union[str, RowShapeSpec] - - -pub def format_schema_description(schema: FormatSchema) -> str: - """Normalize compact schema strings and model-derived row shapes to one schema description.""" - match schema: - str(text) => return text - RowShapeSpec(shape) => return row_shape_to_schema_description(shape) diff --git a/src/functions/json/from_json.incn b/src/functions/json/from_json.incn index 8689f9a..a59efd1 100644 --- a/src/functions/json/from_json.incn +++ b/src/functions/json/from_json.incn @@ -9,10 +9,10 @@ from function_registry import ( extension_mapping, v0_1, ) -from functions.format_schema import FormatSchema, format_schema_description from functions.registry import register_function, registered_application from projection_builders import ColumnExpr, str_expr from substrait.function_extensions import FROM_JSON_FUNCTION_ANCHOR +from substrait.schema import schema_description_for_type @register_function(deterministic_spec( @@ -22,22 +22,17 @@ from substrait.function_extensions import FROM_JSON_FUNCTION_ANCHOR error_behavior=FunctionErrorBehavior.InvalidInputDiagnostic, substrait=extension_mapping("from_json", FROM_JSON_FUNCTION_ANCHOR), )) -pub def from_json(expr: ColumnExpr, schema: FormatSchema) -> ColumnExpr: +pub def from_json[T](expr: ColumnExpr) -> ColumnExpr: """ - Validate JSON using an explicit schema description and return a normalized JSON payload. + Validate JSON using an Incan model schema and return a normalized JSON payload. Examples: - payload = from_json(col("payload"), "STRUCT") - - # With an `EventPayload` model in scope: - schema = row_shape_from_model(EventPayload(type="click")) - typed_payload = from_json(col("payload"), schema) + payload = from_json[EventPayload](col("payload")) Parameters: expr: String expression containing a JSON payload. - schema: Schema description or `RowShapeSpec` for the expected JSON payload. """ - return registered_application("from_json", [expr, str_expr(format_schema_description(schema))]) + return registered_application("from_json", [expr, str_expr(schema_description_for_type[T]())]) module tests: @@ -48,18 +43,10 @@ module tests: column_expr_function_name, column_expr_kind, ) - from substrait.schema import RowColumnSpec, RowShapeSpec, SubstraitPrimitiveKind + model EventPayload: + type_ as "type": str def test_from_json_builds_registered_application() -> None: - expr = from_json(col("payload"), "STRUCT") - assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction - assert column_expr_function_name(expr) == "from_json" - assert column_expr_argument_count(expr) == 2 - def test_from_json_accepts_row_shape_schema() -> None: - shape = RowShapeSpec( - model_name="EventPayload", - columns=[RowColumnSpec(name="type", kind=SubstraitPrimitiveKind.String, nullable=false)], - ) - expr = from_json(col("payload"), shape) + expr = from_json[EventPayload](col("payload")) assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction assert column_expr_function_name(expr) == "from_json" assert column_expr_argument_count(expr) == 2 diff --git a/src/functions/json/try_from_json.incn b/src/functions/json/try_from_json.incn index 20d9cbc..000877c 100644 --- a/src/functions/json/try_from_json.incn +++ b/src/functions/json/try_from_json.incn @@ -8,10 +8,10 @@ from function_registry import ( extension_mapping, v0_1, ) -from functions.format_schema import FormatSchema, format_schema_description from functions.registry import register_function, registered_application from projection_builders import ColumnExpr, str_expr from substrait.function_extensions import TRY_FROM_JSON_FUNCTION_ANCHOR +from substrait.schema import schema_description_for_type @register_function(deterministic_spec( @@ -20,22 +20,17 @@ from substrait.function_extensions import TRY_FROM_JSON_FUNCTION_ANCHOR null_behavior=FunctionNullBehavior.DependsOnInputs, substrait=extension_mapping("try_from_json", TRY_FROM_JSON_FUNCTION_ANCHOR), )) -pub def try_from_json(expr: ColumnExpr, schema: FormatSchema) -> ColumnExpr: +pub def try_from_json[T](expr: ColumnExpr) -> ColumnExpr: """ - Parse JSON with an explicit schema description and return null when the payload is invalid. + Parse JSON with an Incan model schema and return null when the payload is invalid. Examples: - payload = try_from_json(col("payload"), "STRUCT") - - # With an `EventPayload` model in scope: - schema = row_shape_from_model(EventPayload(type="click")) - typed_payload = try_from_json(col("payload"), schema) + payload = try_from_json[EventPayload](col("payload")) Parameters: expr: String expression containing a JSON payload. - schema: Schema description or `RowShapeSpec` for the expected JSON payload. """ - return registered_application("try_from_json", [expr, str_expr(format_schema_description(schema))]) + return registered_application("try_from_json", [expr, str_expr(schema_description_for_type[T]())]) module tests: @@ -46,18 +41,10 @@ module tests: column_expr_function_name, column_expr_kind, ) - from substrait.schema import RowColumnSpec, RowShapeSpec, SubstraitPrimitiveKind + model EventPayload: + type_ as "type": str def test_try_from_json_builds_registered_application() -> None: - expr = try_from_json(col("payload"), "STRUCT") - assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction - assert column_expr_function_name(expr) == "try_from_json" - assert column_expr_argument_count(expr) == 2 - def test_try_from_json_accepts_row_shape_schema() -> None: - shape = RowShapeSpec( - model_name="EventPayload", - columns=[RowColumnSpec(name="type", kind=SubstraitPrimitiveKind.String, nullable=false)], - ) - expr = try_from_json(col("payload"), shape) + expr = try_from_json[EventPayload](col("payload")) assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction assert column_expr_function_name(expr) == "try_from_json" assert column_expr_argument_count(expr) == 2 diff --git a/src/functions/mod.incn b/src/functions/mod.incn index 5294069..6a2ae8b 100644 --- a/src/functions/mod.incn +++ b/src/functions/mod.incn @@ -15,7 +15,6 @@ pub from functions.registry import ( function_registry_function_refs, registered_substrait_mapped_function_refs, ) -pub from functions.format_schema import FormatSchema pub from functions.references.col import col pub from functions.literals.always_false import always_false pub from functions.literals.always_true import always_true diff --git a/src/functions/registry.incn b/src/functions/registry.incn index c4e8c7b..363bcb1 100644 --- a/src/functions/registry.incn +++ b/src/functions/registry.incn @@ -27,9 +27,9 @@ pub def register_function[F](spec: FunctionSpec) -> (F) -> F: def _register_pending_function[F](func: F) -> F: - """Register one pending declaration-side spec against the decorated helper name.""" - pending_index = function_registry.entry_count() - assert pending_index < len(_pending_function_specs), "function registry decorator had no pending spec" + """Register the most recently queued declaration-side spec against the decorated helper name.""" + pending_index = len(_pending_function_specs) - 1 + assert pending_index >= 0, "function registry decorator had no pending spec" spec = _pending_function_specs[pending_index] mut canonical_name = func.__name__ if let Some(name) = spec.canonical_name: diff --git a/src/lib.incn b/src/lib.incn index a039213..175a194 100644 --- a/src/lib.incn +++ b/src/lib.incn @@ -59,7 +59,6 @@ pub from projection_builders import ( ) pub from functions.registry import function_registry pub from functions import ( - FormatSchema, function_registry_canonical_names, function_registry_entries, function_registry_entry, @@ -248,15 +247,11 @@ pub from session.types import Session, SessionBuilder pub from session.errors import SessionError, SessionErrorKind, format_session_diagnostic, report_session_error pub from substrait.errors import SubstraitLoweringError, SubstraitLoweringErrorKind pub from substrait.schema import ( - DemoCustomer, RowColumnSpec, RowShapeSpec, SubstraitPrimitiveKind, - demo_customer_row_shape, row_shape_to_named_struct, row_shape_field_names, - row_shape_from_model, - row_shape_to_schema_description, row_shape_to_substrait_struct_type, substrait_row_type_encoded_len, ) diff --git a/src/substrait/schema.incn b/src/substrait/schema.incn index a655b19..936fb9e 100644 --- a/src/substrait/schema.incn +++ b/src/substrait/schema.incn @@ -1,9 +1,6 @@ """ Describe an Incan row shape and lower it to Substrait schema structures. -The demo model `DemoCustomer` is hand-authored; `demo_customer_row_shape()` lists columns in the same declaration order. -`row_shape_from_model(...)` derives `RowShapeSpec` from runtime reflection metadata on model or class values. - `NamedStruct` (names + struct) is the full read-root shape; this module emits the inner column types and exposes names separately until Incan can set prost's `r#struct` field ergonomically. @@ -25,12 +22,6 @@ from rust::substrait::proto::type import ( from rust::incan_stdlib::errors import raise_value_error -@derive(Clone) -pub model DemoCustomer: - pub id: int - pub name: str - - @derive(Clone) pub enum SubstraitPrimitiveKind(str): Bool = "bool" @@ -53,25 +44,13 @@ pub model RowShapeSpec: pub columns: list[RowColumnSpec] -pub def demo_customer_row_shape() -> RowShapeSpec: - """Return the row shape for `DemoCustomer` (manual until compiler-derived).""" - return RowShapeSpec( - model_name=str("DemoCustomer"), - columns=[RowColumnSpec(name=str("id"), kind=SubstraitPrimitiveKind.I64, nullable=false), RowColumnSpec( - name=str("name"), - kind=SubstraitPrimitiveKind.String, - nullable=true, - )], - ) - - pub def row_shape_field_names(shape: RowShapeSpec) -> list[str]: """Field names in index order (flat row; no nested structs in this prototype).""" return [col.name for col in shape.columns] def _primitive_kind_from_type_name(type_name: str) -> SubstraitPrimitiveKind: - """Map Incan field reflection type names to the primitive schema kinds InQL can lower today.""" + """Map Incan field reflection type names to primitive schema kinds InQL can lower today.""" if type_name == "bool": return SubstraitPrimitiveKind.Bool if type_name == "int": @@ -106,10 +85,10 @@ def _field_is_nullable(raw_type_name: str) -> bool: return raw_type_name.startswith("Option[") and raw_type_name.endswith("]") -pub def row_shape_from_model[T](value: T) -> RowShapeSpec: - """Derive a flat row shape from an Incan model or class value using compiler-provided field metadata.""" +pub def row_shape_from_type[T]() -> RowShapeSpec: + """Derive a flat row shape from an Incan model or class type parameter.""" mut columns: list[RowColumnSpec] = [] - for field in value.__fields__(): + for field in T.__fields__(): raw_type_name = str(field.type_name) field_type_name = _field_type_name(raw_type_name) columns.append( @@ -119,7 +98,7 @@ pub def row_shape_from_model[T](value: T) -> RowShapeSpec: nullable=_field_is_nullable(raw_type_name), ), ) - return RowShapeSpec(model_name=str(value.__class_name__()), columns=columns) + return RowShapeSpec(model_name=str(T.__class_name__()), columns=columns) pub def row_shape_to_schema_description(shape: RowShapeSpec) -> str: @@ -131,6 +110,11 @@ pub def row_shape_to_schema_description(shape: RowShapeSpec) -> str: return f"STRUCT<{fields}>" +pub def schema_description_for_type[T]() -> str: + """Return the compact schema string for an Incan model or class type parameter.""" + return row_shape_to_schema_description(row_shape_from_type[T]()) + + def _type_from_primitive(kind: SubstraitPrimitiveKind, nullable: bool) -> Type: """Lower one primitive kind + nullability flag into a Substrait `Type`.""" mut n = Nullability.Required diff --git a/src/substrait/schema_registry.incn b/src/substrait/schema_registry.incn index 2e6650e..c2b8652 100644 --- a/src/substrait/schema_registry.incn +++ b/src/substrait/schema_registry.incn @@ -6,7 +6,7 @@ reads. It does not build plans or relations beyond the schema envelopes themselv """ from rust::substrait::proto import NamedStruct -from substrait.schema import RowColumnSpec, RowShapeSpec, SubstraitPrimitiveKind, row_shape_to_named_struct +from substrait.schema import RowColumnSpec, RowShapeSpec, row_shape_to_named_struct from schema_registry import ( named_table_columns as logical_named_table_columns, register_named_table_schema as register_logical_named_table_schema, @@ -14,19 +14,6 @@ from schema_registry import ( ) -def _logical_shape(name: str) -> RowShapeSpec: - """Build one minimal placeholder logical row shape for helper-generated reads.""" - return RowShapeSpec( - model_name=name, - columns=[RowColumnSpec(name="id", kind=SubstraitPrimitiveKind.I64, nullable=false)], - ) - - -def _default_named_struct(name: str) -> NamedStruct: - """Build one default named struct for helper-generated table reads.""" - return row_shape_to_named_struct(_logical_shape(name)) - - pub def unknown_named_struct() -> NamedStruct: """Build an unknown named struct placeholder when schema binding is unavailable.""" return row_shape_to_named_struct(RowShapeSpec(model_name="__unknown__", columns=[])) diff --git a/tests/test_function_registry.incn b/tests/test_function_registry.incn index c5982c1..ffb250e 100644 --- a/tests/test_function_registry.incn +++ b/tests/test_function_registry.incn @@ -265,6 +265,12 @@ from substrait.function_extensions import ( ) +@derive(Clone) +model FormatRegistrySchema: + id: int + type_ as "type": str + + def _contains_text(items: list[str], expected: str) -> bool: """Return whether one list contains the expected string.""" for item in items: @@ -455,11 +461,11 @@ def _exercise_current_public_helpers() -> None: json_extract_path_text(status, "$.type") json_array_length(status) json_object_keys(status) - from_json(status, "STRUCT") - try_from_json(status, "STRUCT") + from_json[FormatRegistrySchema](status) + try_from_json[FormatRegistrySchema](status) to_json(status) schema_of_csv(status) - from_csv(status, "STRUCT") + from_csv[FormatRegistrySchema](status) to_csv(status) return @@ -655,6 +661,28 @@ def test_function_registry__extension_policy_is_separate_from_scalar_class() -> assert extension_entry.substrait.kind == SubstraitMappingKind.ExtensionFunction, "extension-only entries may declare Substrait extension mappings" +def test_function_registry__matching_duplicate_registration_is_idempotent() -> None: + """Assert repeated declaration-side registration keeps one entry when metadata matches exactly.""" + # -- Arrange -- + mut registry = FunctionRegistry.new() + spec = deterministic_spec( + function_class=FunctionClass.Scalar, + lifecycle=FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + null_behavior=FunctionNullBehavior.DependsOnInputs, + substrait=extension_mapping("from_json", FROM_JSON_FUNCTION_ANCHOR), + ) + + # -- Act -- + registry._add_entry("from_json", spec.clone()) + registry._add_entry("from_json", spec) + entry = _local_entry_by_name_or_fail(registry, "from_json") + + # -- Assert -- + assert registry.entry_count() == 1, "matching duplicate declaration-side registration should be idempotent" + assert entry.function_ref == function_ref_for("from_json"), "idempotent registration should preserve the original function ref" + assert entry.substrait.function_name == "from_json", "idempotent registration should preserve mapping metadata" + + def test_function_registry__canonical_name_lookup_stays_core_scoped() -> None: """Assert explicit namespaces keep extension names from shadowing portable core names.""" # -- Arrange -- @@ -946,10 +974,9 @@ def test_function_registry__public_helpers_preserve_existing_behavior() -> None: ), check_json(status), schema_of_json(status), get_json_object(status, "$.type"), json_extract_path_text( status, "$.type", - ), json_array_length(status), json_object_keys(status), from_json(status, "STRUCT"), try_from_json( + ), json_array_length(status), json_object_keys(status), from_json[FormatRegistrySchema](status), try_from_json[FormatRegistrySchema]( status, - "STRUCT", - ), to_json(status), schema_of_csv(status), from_csv(status, "STRUCT"), to_csv(status)] + ), to_json(status), schema_of_csv(status), from_csv[FormatRegistrySchema](status), to_csv(status)] # -- Assert -- assert column_expr_kind(amount) == ColumnExprKind.Column, "col should still build a column reference" diff --git a/tests/test_model_substrait_schema.incn b/tests/test_model_substrait_schema.incn index cde7b90..d1a977f 100644 --- a/tests/test_model_substrait_schema.incn +++ b/tests/test_model_substrait_schema.incn @@ -1,11 +1,11 @@ -"""Tests for model → Substrait NamedStruct prototype.""" +"""Tests for model type metadata to Substrait schema structures.""" from substrait.schema import ( - demo_customer_row_shape, - row_shape_from_model, + row_shape_from_type, row_shape_field_names, row_shape_to_schema_description, row_shape_to_substrait_struct_type, + schema_description_for_type, substrait_row_type_encoded_len, ) @@ -17,55 +17,41 @@ model CsvOrderSchema: discount: Option[float] -def test_demo_customer_row_shape__field_order() -> None: +def test_row_shape_from_type__uses_reflected_wire_names_and_types() -> None: # -- Arrange -- - shape = demo_customer_row_shape() + shape = row_shape_from_type[CsvOrderSchema]() # -- Act -- names = row_shape_field_names(shape) # -- Assert -- - assert shape.model_name == "DemoCustomer", "model name" - assert len(names) == 2, "column count" - assert names[0] == "id", "first field is declaration order" - assert names[1] == "name", "second field" - - -def test_row_shape_to_substrait_struct_type__encodes() -> None: - # -- Arrange -- - shape = demo_customer_row_shape() - - # -- Act -- - row_ty = row_shape_to_substrait_struct_type(shape) - encoded = substrait_row_type_encoded_len(row_ty) - - # -- Assert -- - assert encoded > 0, "protobuf encoding should be non-empty" + assert shape.model_name == "CsvOrderSchema", "model name should come from the type parameter" + assert names[0] == "id", "alias should become the external schema field name" + assert names[1] == "status", "plain field name should be preserved" + assert names[2] == "discount", "optional field should be preserved" + assert shape.columns[2].nullable, "Option fields should be nullable" -def test_row_shape_from_model__uses_reflected_wire_names_and_types() -> None: +def test_schema_description_for_type__renders_compact_struct_schema() -> None: # -- Arrange -- - row = CsvOrderSchema(id=42, status="paid", discount=None) + expected_schema = "STRUCT" # -- Act -- - shape = row_shape_from_model(row) - names = row_shape_field_names(shape) + schema = schema_description_for_type[CsvOrderSchema]() # -- Assert -- - assert shape.model_name == "CsvOrderSchema", "model name should come from reflection" - assert names[0] == "id", "alias should become the external schema field name" - assert names[1] == "status", "plain field name should be preserved" - assert names[2] == "discount", "optional field should be preserved" - assert shape.columns[2].nullable, "Option fields should be nullable" + assert schema == expected_schema, "schema description" -def test_row_shape_to_schema_description__renders_csv_schema_string() -> None: +def test_row_shape_to_substrait_struct_type__encodes() -> None: # -- Arrange -- - row = CsvOrderSchema(id=42, status="paid", discount=None) - shape = row_shape_from_model(row) + shape = row_shape_from_type[CsvOrderSchema]() # -- Act -- + row_ty = row_shape_to_substrait_struct_type(shape) schema = row_shape_to_schema_description(shape) + encoded = substrait_row_type_encoded_len(row_ty) # -- Assert -- - assert schema == "STRUCT", "schema description" + assert schema.contains("id: BIGINT"), "schema should preserve reflected field names before lowering" + assert encoded > 0, "protobuf encoding should be non-empty" diff --git a/tests/test_session_projection.incn b/tests/test_session_projection.incn index 1f8eb8b..f48e165 100644 --- a/tests/test_session_projection.incn +++ b/tests/test_session_projection.incn @@ -57,7 +57,6 @@ from dataset import DataFrame, LazyFrame from session import Session, SessionErrorKind from std.testing import assert_is_err, assert_is_ok, fail_t from substrait.inspect import relation_kind_name, root_names, root_rel, sort_field_count, sort_field_direction_name -from substrait.schema import row_shape_from_model @derive(Clone) @@ -306,9 +305,9 @@ def test_session_projection__collect_executes_json_url_and_csv_format_functions( ).with_column("json_keys", json_object_keys(json_payload)).with_column("json_schema", schema_of_json(json_payload)).with_column( "json_normalized", parse_json(json_payload), - ).with_column("json_from", from_json(json_payload, row_shape_from_model(JsonProjectionSchema(type="click")))).with_column( + ).with_column("json_from", from_json[JsonProjectionSchema](json_payload)).with_column( "json_try_invalid", - try_from_json(lit("{"), row_shape_from_model(JsonProjectionSchema(type="click"))), + try_from_json[JsonProjectionSchema](lit("{")), ).with_column("json_string", to_json(lit("paid"))).with_column( "url_page", parse_url(lit("https://example.com/orders?page=2&id=7#top"), "page"), @@ -317,18 +316,15 @@ def test_session_projection__collect_executes_json_url_and_csv_format_functions( try_url_decode(lit("%zz")), ).with_column("csv_schema", schema_of_csv(csv_payload)).with_column( "csv_map", - from_csv(csv_payload, "STRUCT"), - ).with_column("csv_model_map", from_csv(csv_payload, row_shape_from_model(CsvProjectionSchema(id=0, status="")))).with_column( - "csv_line", - to_csv(lit("[\"42\",\"paid\"]")), - ) + from_csv[CsvProjectionSchema](csv_payload), + ).with_column("csv_line", to_csv(lit("[\"42\",\"paid\"]"))) df = _collect_or_fail(session, projected) payload = df.preview_text() resolved = df.resolved_columns() # -- Assert -- assert df.row_count() == 3, "format projections should preserve input rows" - assert len(resolved) == 20, "projection should expose all appended format outputs" + assert len(resolved) == 19, "projection should expose all appended format outputs" assert payload.contains("click"), "JSON path extraction should return scalar text" assert payload.contains("{\"id\":7}"), "get_json_object should return nested JSON text" assert payload.contains("2"), "parse_url should extract arbitrary query parameter keys" @@ -337,7 +333,6 @@ def test_session_projection__collect_executes_json_url_and_csv_format_functions( assert payload.contains("STRUCT<"), "schema helpers should emit deterministic schema text" assert payload.contains("id: 42"), "from_csv should use schema field names in a logical map value" assert payload.contains("status: paid"), "from_csv should retain CSV field values in a logical map value" - assert payload.contains("csv_model_map"), "from_csv should accept a schema derived from an Incan model" assert payload.contains("42,paid"), "to_csv should serialize JSON arrays as CSV rows" diff --git a/tests/test_substrait_plan.incn b/tests/test_substrait_plan.incn index b49a8f8..7531561 100644 --- a/tests/test_substrait_plan.incn +++ b/tests/test_substrait_plan.incn @@ -206,6 +206,12 @@ from substrait.conformance import ( ) +@derive(Clone) +model FormatPlanSchema: + id: int + type_ as "type": str + + def _core_keys() -> list[CoreScenarioKey]: return [CoreScenarioKey.ReadNamedTable, CoreScenarioKey.ReadLocalFiles, CoreScenarioKey.ReadVirtualTable, CoreScenarioKey.FilterRows, CoreScenarioKey.ProjectComputedColumns, CoreScenarioKey.JoinRelVariants, CoreScenarioKey.CrossRelCartesian, CoreScenarioKey.AggregateGroupingSets, CoreScenarioKey.SortRelOrdering, CoreScenarioKey.FetchRelLimitOffset, CoreScenarioKey.SetRelOperations, CoreScenarioKey.ReferenceRelSharedSubplan] @@ -500,11 +506,11 @@ def test_plan__core_scalar_extension_mappings_lower_to_substrait() -> None: _assert_scalar_expr_lowers(json_extract_path_text(col("status"), "$.type")) _assert_scalar_expr_lowers(json_array_length(col("status"))) _assert_scalar_expr_lowers(json_object_keys(col("status"))) - _assert_scalar_expr_lowers(from_json(col("status"), "STRUCT")) - _assert_scalar_expr_lowers(try_from_json(col("status"), "STRUCT")) + _assert_scalar_expr_lowers(from_json[FormatPlanSchema](col("status"))) + _assert_scalar_expr_lowers(try_from_json[FormatPlanSchema](col("status"))) _assert_scalar_expr_lowers(to_json(col("status"))) _assert_scalar_expr_lowers(schema_of_csv(col("status"))) - _assert_scalar_expr_lowers(from_csv(col("status"), "STRUCT")) + _assert_scalar_expr_lowers(from_csv[FormatPlanSchema](col("status"))) _assert_scalar_expr_lowers(to_csv(col("status"))) From 381de0bbc66c96a43d33c55e68d4be32a89e7de5 Mon Sep 17 00:00:00 2001 From: Danny Meijer Date: Sat, 30 May 2026 10:05:40 +0200 Subject: [PATCH 12/13] fix - 39 refactor DataFusion format UDF catalog --- src/session/datafusion_format_functions.incn | 421 +++++++++++++------ 1 file changed, 299 insertions(+), 122 deletions(-) diff --git a/src/session/datafusion_format_functions.incn b/src/session/datafusion_format_functions.incn index 6c0df44..c4fa86e 100644 --- a/src/session/datafusion_format_functions.incn +++ b/src/session/datafusion_format_functions.incn @@ -7,7 +7,7 @@ from rust::datafusion::arrow::datatypes import DataType from rust::datafusion::arrow::error import ArrowError from rust::datafusion::execution::context import SessionContext from rust::datafusion_common import DataFusionError, ScalarValue -from rust::datafusion_expr import ColumnarValue, ScalarFunctionImplementation, ScalarUDF, Volatility, create_udf +from rust::datafusion_expr import ColumnarValue, ScalarUDF, Volatility, create_udf from rust::std::primitive import i64 as RustI64, usize as RustUsize from rust::std::string import String as RustString from rust::std::sync import Arc @@ -17,84 +17,235 @@ from std.hash import sha1 as std_sha1, xxh64 as std_xxh64 from std.json import JsonError, JsonKind, JsonValue -pub def register_format_udfs(ctx: SessionContext) -> None: - """Register RFC 022 scalar format functions that DataFusion does not expose natively.""" - # DataFusion stores UDF callbacks as 'static Rust closures. Keep the captured function identity as a literal inside - # each callback instead of borrowing a constructor-local name. - ctx.register_udf(_unary_string_udf("sha1", Arc.from((args) => _apply_unary_string("sha1", args.to_vec())))) - ctx.register_udf(_unary_string_udf("crc32", Arc.from((args) => _apply_unary_string("crc32", args.to_vec())))) - ctx.register_udf(_unary_string_udf("xxhash64", Arc.from((args) => _apply_unary_string("xxhash64", args.to_vec())))) - ctx.register_udf( - _unary_string_udf("url_encode", Arc.from((args) => _apply_unary_string("url_encode", args.to_vec()))), - ) - ctx.register_udf( - _unary_string_udf("url_decode", Arc.from((args) => _apply_unary_string("url_decode", args.to_vec()))), - ) - ctx.register_udf( - _unary_string_udf("try_url_decode", Arc.from((args) => _apply_unary_string("try_url_decode", args.to_vec()))), - ) - ctx.register_udf( - _binary_string_udf("parse_url", Arc.from((args) => _apply_binary_string("parse_url", args.to_vec()))), - ) - ctx.register_udf( - _unary_string_udf("parse_json", Arc.from((args) => _apply_unary_string("parse_json", args.to_vec()))), - ) - ctx.register_udf(_unary_bool_udf("check_json", Arc.from((args) => _apply_unary_bool("check_json", args.to_vec())))) - ctx.register_udf( - _unary_string_udf("schema_of_json", Arc.from((args) => _apply_unary_string("schema_of_json", args.to_vec()))), - ) - ctx.register_udf( - _unary_int_udf("json_array_length", Arc.from((args) => _apply_unary_int("json_array_length", args.to_vec()))), - ) - ctx.register_udf( - _unary_string_udf("json_object_keys", Arc.from((args) => _apply_unary_string("json_object_keys", args.to_vec()))), - ) - ctx.register_udf( - _binary_string_udf("get_json_object", Arc.from((args) => _apply_binary_string("get_json_object", args.to_vec()))), - ) - ctx.register_udf( - _binary_string_udf( - "json_extract_path_text", - Arc.from((args) => _apply_binary_string("json_extract_path_text", args.to_vec())), - ), - ) - ctx.register_udf( - _binary_string_udf("from_json", Arc.from((args) => _apply_binary_string("from_json", args.to_vec()))), - ) - ctx.register_udf( - _binary_string_udf("try_from_json", Arc.from((args) => _apply_binary_string("try_from_json", args.to_vec()))), - ) - ctx.register_udf(_unary_string_udf("to_json", Arc.from((args) => _apply_unary_string("to_json", args.to_vec())))) - ctx.register_udf( - _unary_string_udf("schema_of_csv", Arc.from((args) => _apply_unary_string("schema_of_csv", args.to_vec()))), - ) - ctx.register_udf(_binary_csv_map_udf("from_csv", Arc.from((args) => _apply_from_csv(args.to_vec())))) - ctx.register_udf(_unary_string_udf("to_csv", Arc.from((args) => _apply_unary_string("to_csv", args.to_vec())))) - - -def _unary_string_udf(name: str, implementation: ScalarFunctionImplementation) -> ScalarUDF: - """Create one unary string-to-string DataFusion UDF backed by Incan source.""" - return create_udf(name, [DataType.Utf8], DataType.Utf8, Volatility.Immutable, implementation) - - -def _unary_bool_udf(name: str, implementation: ScalarFunctionImplementation) -> ScalarUDF: - """Create one unary string-to-boolean DataFusion UDF backed by Incan source.""" - return create_udf(name, [DataType.Utf8], DataType.Boolean, Volatility.Immutable, implementation) - - -def _unary_int_udf(name: str, implementation: ScalarFunctionImplementation) -> ScalarUDF: - """Create one unary string-to-int DataFusion UDF backed by Incan source.""" - return create_udf(name, [DataType.Utf8], DataType.Int64, Volatility.Immutable, implementation) +@derive(Clone) +enum DataFusionFormatFunction(str): + """Format UDFs that InQL must provide to DataFusion itself.""" + + Sha1 = "sha1" + Crc32 = "crc32" + Xxhash64 = "xxhash64" + UrlEncode = "url_encode" + UrlDecode = "url_decode" + TryUrlDecode = "try_url_decode" + ParseUrl = "parse_url" + ParseJson = "parse_json" + CheckJson = "check_json" + SchemaOfJson = "schema_of_json" + JsonArrayLength = "json_array_length" + JsonObjectKeys = "json_object_keys" + GetJsonObject = "get_json_object" + JsonExtractPathText = "json_extract_path_text" + FromJson = "from_json" + TryFromJson = "try_from_json" + ToJson = "to_json" + SchemaOfCsv = "schema_of_csv" + FromCsv = "from_csv" + ToCsv = "to_csv" -def _binary_string_udf(name: str, implementation: ScalarFunctionImplementation) -> ScalarUDF: - """Create one binary string-to-string DataFusion UDF backed by Incan source.""" - return create_udf(name, [DataType.Utf8, DataType.Utf8], DataType.Utf8, Volatility.Immutable, implementation) - - -def _binary_csv_map_udf(name: str, implementation: ScalarFunctionImplementation) -> ScalarUDF: - """Create one binary string-to-map DataFusion UDF for CSV row parsing.""" - return create_udf(name, [DataType.Utf8, DataType.Utf8], _csv_map_data_type(), Volatility.Immutable, implementation) +pub def register_format_udfs(ctx: SessionContext) -> None: + """Register RFC 022 scalar format functions that DataFusion does not expose natively.""" + for function in _format_udf_catalog(): + ctx.register_udf(_format_udf(function)) + + +def _format_udf_catalog() -> list[DataFusionFormatFunction]: + """Return the complete adapter-owned RFC 022 DataFusion UDF catalog.""" + mut functions: list[DataFusionFormatFunction] = [] + + # Keep this grouped by public function family so it can be compared against the RFC 022 catalog quickly. + functions.append(DataFusionFormatFunction.Sha1) + functions.append(DataFusionFormatFunction.Crc32) + functions.append(DataFusionFormatFunction.Xxhash64) + + # URL helpers are string-valued except parse_url, which accepts a selector key. + functions.append(DataFusionFormatFunction.UrlEncode) + functions.append(DataFusionFormatFunction.UrlDecode) + functions.append(DataFusionFormatFunction.TryUrlDecode) + functions.append(DataFusionFormatFunction.ParseUrl) + + # JSON helpers share text input, but return strings, booleans, ints, or selector-based strings. + functions.append(DataFusionFormatFunction.ParseJson) + functions.append(DataFusionFormatFunction.CheckJson) + functions.append(DataFusionFormatFunction.SchemaOfJson) + functions.append(DataFusionFormatFunction.JsonArrayLength) + functions.append(DataFusionFormatFunction.JsonObjectKeys) + functions.append(DataFusionFormatFunction.GetJsonObject) + functions.append(DataFusionFormatFunction.JsonExtractPathText) + functions.append(DataFusionFormatFunction.FromJson) + functions.append(DataFusionFormatFunction.TryFromJson) + functions.append(DataFusionFormatFunction.ToJson) + + # CSV parsing returns an Arrow map so callers can inspect typed field names in materialized results. + functions.append(DataFusionFormatFunction.SchemaOfCsv) + functions.append(DataFusionFormatFunction.FromCsv) + functions.append(DataFusionFormatFunction.ToCsv) + return functions + + +def _format_udf(function: DataFusionFormatFunction) -> ScalarUDF: + """Create the DataFusion UDF for one adapter-owned format function.""" + match function: + DataFusionFormatFunction.Sha1 => + return create_udf( + function.value(), + [DataType.Utf8], + DataType.Utf8, + Volatility.Immutable, + Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.Sha1, args.to_vec())), + ) + DataFusionFormatFunction.Crc32 => + return create_udf( + function.value(), + [DataType.Utf8], + DataType.Utf8, + Volatility.Immutable, + Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.Crc32, args.to_vec())), + ) + DataFusionFormatFunction.Xxhash64 => + return create_udf( + function.value(), + [DataType.Utf8], + DataType.Utf8, + Volatility.Immutable, + Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.Xxhash64, args.to_vec())), + ) + DataFusionFormatFunction.UrlEncode => + return create_udf( + function.value(), + [DataType.Utf8], + DataType.Utf8, + Volatility.Immutable, + Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.UrlEncode, args.to_vec())), + ) + DataFusionFormatFunction.UrlDecode => + return create_udf( + function.value(), + [DataType.Utf8], + DataType.Utf8, + Volatility.Immutable, + Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.UrlDecode, args.to_vec())), + ) + DataFusionFormatFunction.TryUrlDecode => + return create_udf( + function.value(), + [DataType.Utf8], + DataType.Utf8, + Volatility.Immutable, + Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.TryUrlDecode, args.to_vec())), + ) + DataFusionFormatFunction.ParseUrl => + return create_udf( + function.value(), + [DataType.Utf8, DataType.Utf8], + DataType.Utf8, + Volatility.Immutable, + Arc.from((args) => _apply_binary_string(DataFusionFormatFunction.ParseUrl, args.to_vec())), + ) + DataFusionFormatFunction.ParseJson => + return create_udf( + function.value(), + [DataType.Utf8], + DataType.Utf8, + Volatility.Immutable, + Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.ParseJson, args.to_vec())), + ) + DataFusionFormatFunction.CheckJson => + return create_udf( + function.value(), + [DataType.Utf8], + DataType.Boolean, + Volatility.Immutable, + Arc.from((args) => _apply_unary_bool(DataFusionFormatFunction.CheckJson, args.to_vec())), + ) + DataFusionFormatFunction.SchemaOfJson => + return create_udf( + function.value(), + [DataType.Utf8], + DataType.Utf8, + Volatility.Immutable, + Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.SchemaOfJson, args.to_vec())), + ) + DataFusionFormatFunction.JsonArrayLength => + return create_udf( + function.value(), + [DataType.Utf8], + DataType.Int64, + Volatility.Immutable, + Arc.from((args) => _apply_unary_int(DataFusionFormatFunction.JsonArrayLength, args.to_vec())), + ) + DataFusionFormatFunction.JsonObjectKeys => + return create_udf( + function.value(), + [DataType.Utf8], + DataType.Utf8, + Volatility.Immutable, + Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.JsonObjectKeys, args.to_vec())), + ) + DataFusionFormatFunction.GetJsonObject => + return create_udf( + function.value(), + [DataType.Utf8, DataType.Utf8], + DataType.Utf8, + Volatility.Immutable, + Arc.from((args) => _apply_binary_string(DataFusionFormatFunction.GetJsonObject, args.to_vec())), + ) + DataFusionFormatFunction.JsonExtractPathText => + return create_udf( + function.value(), + [DataType.Utf8, DataType.Utf8], + DataType.Utf8, + Volatility.Immutable, + Arc.from((args) => _apply_binary_string(DataFusionFormatFunction.JsonExtractPathText, args.to_vec())), + ) + DataFusionFormatFunction.FromJson => + return create_udf( + function.value(), + [DataType.Utf8, DataType.Utf8], + DataType.Utf8, + Volatility.Immutable, + Arc.from((args) => _apply_binary_string(DataFusionFormatFunction.FromJson, args.to_vec())), + ) + DataFusionFormatFunction.TryFromJson => + return create_udf( + function.value(), + [DataType.Utf8, DataType.Utf8], + DataType.Utf8, + Volatility.Immutable, + Arc.from((args) => _apply_binary_string(DataFusionFormatFunction.TryFromJson, args.to_vec())), + ) + DataFusionFormatFunction.ToJson => + return create_udf( + function.value(), + [DataType.Utf8], + DataType.Utf8, + Volatility.Immutable, + Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.ToJson, args.to_vec())), + ) + DataFusionFormatFunction.SchemaOfCsv => + return create_udf( + function.value(), + [DataType.Utf8], + DataType.Utf8, + Volatility.Immutable, + Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.SchemaOfCsv, args.to_vec())), + ) + DataFusionFormatFunction.FromCsv => + return create_udf( + function.value(), + [DataType.Utf8, DataType.Utf8], + _csv_map_data_type(), + Volatility.Immutable, + Arc.from((args) => _apply_from_csv(args.to_vec())), + ) + DataFusionFormatFunction.ToCsv => + return create_udf( + function.value(), + [DataType.Utf8], + DataType.Utf8, + Volatility.Immutable, + Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.ToCsv, args.to_vec())), + ) def _csv_map_data_type() -> DataType: @@ -103,39 +254,50 @@ def _csv_map_data_type() -> DataType: return builder.finish().data_type().clone() -def _apply_unary_string(name: str, args: list[ColumnarValue]) -> Result[ColumnarValue, DataFusionError]: +def _apply_unary_string( + function: DataFusionFormatFunction, + args: list[ColumnarValue], +) -> Result[ColumnarValue, DataFusionError]: """Apply one unary string function across scalar and array DataFusion inputs.""" _require_arity(args, 1)? if _all_scalar(args): - return Ok(ColumnarValue.Scalar(ScalarValue.Utf8(_eval_unary_string(name, _string_at(args[0].clone(), 0)?)?))) + return Ok(ColumnarValue.Scalar(ScalarValue.Utf8(_eval_unary_string(function, _string_at(args[0].clone(), 0)?)?))) row_count = _row_count(args)? mut values: list[Option[str]] = [] for idx in range(row_count): - values.append(_eval_unary_string(name, _string_at(args[0].clone(), idx)?)?) + values.append(_eval_unary_string(function, _string_at(args[0].clone(), idx)?)?) array_ref: ArrayRef = Arc.from(StringArray.from(values)) return Ok(ColumnarValue.Array(array_ref)) -def _apply_unary_bool(name: str, args: list[ColumnarValue]) -> Result[ColumnarValue, DataFusionError]: +def _apply_unary_bool( + function: DataFusionFormatFunction, + args: list[ColumnarValue], +) -> Result[ColumnarValue, DataFusionError]: """Apply one unary string-to-bool function across scalar and array DataFusion inputs.""" _require_arity(args, 1)? if _all_scalar(args): - return Ok(ColumnarValue.Scalar(ScalarValue.Boolean(_eval_unary_bool(name, _string_at(args[0].clone(), 0)?)?))) + return Ok( + ColumnarValue.Scalar(ScalarValue.Boolean(_eval_unary_bool(function, _string_at(args[0].clone(), 0)?)?)), + ) row_count = _row_count(args)? mut values: list[Option[bool]] = [] for idx in range(row_count): - values.append(_eval_unary_bool(name, _string_at(args[0].clone(), idx)?)?) + values.append(_eval_unary_bool(function, _string_at(args[0].clone(), idx)?)?) array_ref: ArrayRef = Arc.from(BooleanArray.from(values)) return Ok(ColumnarValue.Array(array_ref)) -def _apply_unary_int(name: str, args: list[ColumnarValue]) -> Result[ColumnarValue, DataFusionError]: +def _apply_unary_int( + function: DataFusionFormatFunction, + args: list[ColumnarValue], +) -> Result[ColumnarValue, DataFusionError]: """Apply one unary string-to-int function across scalar and array DataFusion inputs.""" _require_arity(args, 1)? if _all_scalar(args): - value = _eval_unary_int(name, _string_at(args[0].clone(), 0)?)? + value = _eval_unary_int(function, _string_at(args[0].clone(), 0)?)? match value: Some(number) => return Ok(ColumnarValue.Scalar(ScalarValue.Int64(Some(_int_to_rust_i64(number)?)))) None => return Ok(ColumnarValue.Scalar(ScalarValue.Int64(None))) @@ -143,19 +305,22 @@ def _apply_unary_int(name: str, args: list[ColumnarValue]) -> Result[ColumnarVal row_count = _row_count(args)? mut values: list[Option[int]] = [] for idx in range(row_count): - values.append(_eval_unary_int(name, _string_at(args[0].clone(), idx)?)?) + values.append(_eval_unary_int(function, _string_at(args[0].clone(), idx)?)?) array_ref: ArrayRef = Arc.from(Int64Array.from(values)) return Ok(ColumnarValue.Array(array_ref)) -def _apply_binary_string(name: str, args: list[ColumnarValue]) -> Result[ColumnarValue, DataFusionError]: +def _apply_binary_string( + function: DataFusionFormatFunction, + args: list[ColumnarValue], +) -> Result[ColumnarValue, DataFusionError]: """Apply one binary string function across scalar and array DataFusion inputs.""" _require_arity(args, 2)? if _all_scalar(args): return Ok( ColumnarValue.Scalar( ScalarValue.Utf8( - _eval_binary_string(name, _string_at(args[0].clone(), 0)?, _string_at(args[1].clone(), 0)?)?, + _eval_binary_string(function, _string_at(args[0].clone(), 0)?, _string_at(args[1].clone(), 0)?)?, ), ), ) @@ -163,7 +328,9 @@ def _apply_binary_string(name: str, args: list[ColumnarValue]) -> Result[Columna row_count = _row_count(args)? mut values: list[Option[str]] = [] for idx in range(row_count): - values.append(_eval_binary_string(name, _string_at(args[0].clone(), idx)?, _string_at(args[1].clone(), idx)?)?) + values.append( + _eval_binary_string(function, _string_at(args[0].clone(), idx)?, _string_at(args[1].clone(), idx)?)?, + ) array_ref: ArrayRef = Arc.from(StringArray.from(values)) return Ok(ColumnarValue.Array(array_ref)) @@ -190,47 +357,51 @@ def _apply_from_csv(args: list[ColumnarValue]) -> Result[ColumnarValue, DataFusi return Ok(ColumnarValue.Array(array_ref)) -def _eval_unary_string(name: str, value: Option[str]) -> Result[Option[str], DataFusionError]: +def _eval_unary_string(function: DataFusionFormatFunction, value: Option[str]) -> Result[Option[str], DataFusionError]: """Dispatch unary string-returning format functions.""" - match name: - "sha1" => return _hash_sha1(value) - "crc32" => return _hash_crc32(value) - "xxhash64" => return _hash_xxhash64(value) - "url_encode" => return _url_encode(value) - "url_decode" => return _url_decode(value) - "try_url_decode" => return _try_url_decode(value) - "parse_json" => return _parse_json(value) - "schema_of_json" => return _schema_of_json(value) - "json_object_keys" => return _json_object_keys(value) - "to_json" => return _to_json(value) - "schema_of_csv" => return _schema_of_csv(value) - "to_csv" => return _to_csv(value) - _ => return _datafusion_error(f"unknown unary format function `{name}`") - - -def _eval_unary_bool(name: str, value: Option[str]) -> Result[Option[bool], DataFusionError]: + match function: + DataFusionFormatFunction.Sha1 => return _hash_sha1(value) + DataFusionFormatFunction.Crc32 => return _hash_crc32(value) + DataFusionFormatFunction.Xxhash64 => return _hash_xxhash64(value) + DataFusionFormatFunction.UrlEncode => return _url_encode(value) + DataFusionFormatFunction.UrlDecode => return _url_decode(value) + DataFusionFormatFunction.TryUrlDecode => return _try_url_decode(value) + DataFusionFormatFunction.ParseJson => return _parse_json(value) + DataFusionFormatFunction.SchemaOfJson => return _schema_of_json(value) + DataFusionFormatFunction.JsonObjectKeys => return _json_object_keys(value) + DataFusionFormatFunction.ToJson => return _to_json(value) + DataFusionFormatFunction.SchemaOfCsv => return _schema_of_csv(value) + DataFusionFormatFunction.ToCsv => return _to_csv(value) + _ => return _datafusion_error(f"invalid unary string UDF `{function.value()}`") + + +def _eval_unary_bool(function: DataFusionFormatFunction, value: Option[str]) -> Result[Option[bool], DataFusionError]: """Dispatch unary boolean-returning format functions.""" - match name: - "check_json" => return _check_json(value) - _ => return _datafusion_error(f"unknown boolean format function `{name}`") + match function: + DataFusionFormatFunction.CheckJson => return _check_json(value) + _ => return _datafusion_error(f"invalid unary boolean UDF `{function.value()}`") -def _eval_unary_int(name: str, value: Option[str]) -> Result[Option[int], DataFusionError]: +def _eval_unary_int(function: DataFusionFormatFunction, value: Option[str]) -> Result[Option[int], DataFusionError]: """Dispatch unary integer-returning format functions.""" - match name: - "json_array_length" => return _json_array_length(value) - _ => return _datafusion_error(f"unknown integer format function `{name}`") + match function: + DataFusionFormatFunction.JsonArrayLength => return _json_array_length(value) + _ => return _datafusion_error(f"invalid unary integer UDF `{function.value()}`") -def _eval_binary_string(name: str, left: Option[str], right: Option[str]) -> Result[Option[str], DataFusionError]: +def _eval_binary_string( + function: DataFusionFormatFunction, + left: Option[str], + right: Option[str], +) -> Result[Option[str], DataFusionError]: """Dispatch binary string-returning format functions.""" - match name: - "parse_url" => return _parse_url(left, right) - "get_json_object" => return _json_path_value(left, right, false) - "json_extract_path_text" => return _json_path_value(left, right, true) - "from_json" => return _from_json(left, right) - "try_from_json" => return _try_from_json(left, right) - _ => return _datafusion_error(f"unknown binary format function `{name}`") + match function: + DataFusionFormatFunction.ParseUrl => return _parse_url(left, right) + DataFusionFormatFunction.GetJsonObject => return _json_path_value(left, right, false) + DataFusionFormatFunction.JsonExtractPathText => return _json_path_value(left, right, true) + DataFusionFormatFunction.FromJson => return _from_json(left, right) + DataFusionFormatFunction.TryFromJson => return _try_from_json(left, right) + _ => return _datafusion_error(f"invalid binary string UDF `{function.value()}`") def _hash_sha1(value: Option[str]) -> Result[Option[str], DataFusionError]: @@ -876,8 +1047,14 @@ def _int_to_rust_i64(value: int) -> Result[RustI64, DataFusionError]: match RustI64.try_from(value): Ok(converted) => return Ok(converted) Err(_) => return _datafusion_error(f"value {value} does not fit Rust i64") + # Hashing helpers. + # URL helpers. + # JSON helpers. + # CSV helpers. def _datafusion_error[T](message: str) -> Result[T, DataFusionError]: """Build one DataFusion execution error.""" return Err(DataFusionError.Execution(f"{message}")) + # DataFusion stores UDF callbacks as 'static Rust closures. Keep the enum identity as a literal inside each callback + # instead of capturing the loop-local catalog item by reference. From 1120a6ab7526c57c549c3656b21560c441830f86 Mon Sep 17 00:00:00 2001 From: Danny Meijer Date: Sat, 30 May 2026 15:12:09 +0200 Subject: [PATCH 13/13] fix - 39 use keyword args for DataFusion UDF registration --- src/session/datafusion_format_functions.incn | 200 +++++++++---------- 1 file changed, 100 insertions(+), 100 deletions(-) diff --git a/src/session/datafusion_format_functions.incn b/src/session/datafusion_format_functions.incn index c4fa86e..1c65a1d 100644 --- a/src/session/datafusion_format_functions.incn +++ b/src/session/datafusion_format_functions.incn @@ -88,163 +88,163 @@ def _format_udf(function: DataFusionFormatFunction) -> ScalarUDF: match function: DataFusionFormatFunction.Sha1 => return create_udf( - function.value(), - [DataType.Utf8], - DataType.Utf8, - Volatility.Immutable, - Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.Sha1, args.to_vec())), + name=function.value(), + input_types=[DataType.Utf8], + return_type=DataType.Utf8, + volatility=Volatility.Immutable, + fun=Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.Sha1, args.to_vec())), ) DataFusionFormatFunction.Crc32 => return create_udf( - function.value(), - [DataType.Utf8], - DataType.Utf8, - Volatility.Immutable, - Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.Crc32, args.to_vec())), + name=function.value(), + input_types=[DataType.Utf8], + return_type=DataType.Utf8, + volatility=Volatility.Immutable, + fun=Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.Crc32, args.to_vec())), ) DataFusionFormatFunction.Xxhash64 => return create_udf( - function.value(), - [DataType.Utf8], - DataType.Utf8, - Volatility.Immutable, - Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.Xxhash64, args.to_vec())), + name=function.value(), + input_types=[DataType.Utf8], + return_type=DataType.Utf8, + volatility=Volatility.Immutable, + fun=Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.Xxhash64, args.to_vec())), ) DataFusionFormatFunction.UrlEncode => return create_udf( - function.value(), - [DataType.Utf8], - DataType.Utf8, - Volatility.Immutable, - Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.UrlEncode, args.to_vec())), + name=function.value(), + input_types=[DataType.Utf8], + return_type=DataType.Utf8, + volatility=Volatility.Immutable, + fun=Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.UrlEncode, args.to_vec())), ) DataFusionFormatFunction.UrlDecode => return create_udf( - function.value(), - [DataType.Utf8], - DataType.Utf8, - Volatility.Immutable, - Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.UrlDecode, args.to_vec())), + name=function.value(), + input_types=[DataType.Utf8], + return_type=DataType.Utf8, + volatility=Volatility.Immutable, + fun=Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.UrlDecode, args.to_vec())), ) DataFusionFormatFunction.TryUrlDecode => return create_udf( - function.value(), - [DataType.Utf8], - DataType.Utf8, - Volatility.Immutable, - Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.TryUrlDecode, args.to_vec())), + name=function.value(), + input_types=[DataType.Utf8], + return_type=DataType.Utf8, + volatility=Volatility.Immutable, + fun=Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.TryUrlDecode, args.to_vec())), ) DataFusionFormatFunction.ParseUrl => return create_udf( - function.value(), - [DataType.Utf8, DataType.Utf8], - DataType.Utf8, - Volatility.Immutable, - Arc.from((args) => _apply_binary_string(DataFusionFormatFunction.ParseUrl, args.to_vec())), + name=function.value(), + input_types=[DataType.Utf8, DataType.Utf8], + return_type=DataType.Utf8, + volatility=Volatility.Immutable, + fun=Arc.from((args) => _apply_binary_string(DataFusionFormatFunction.ParseUrl, args.to_vec())), ) DataFusionFormatFunction.ParseJson => return create_udf( - function.value(), - [DataType.Utf8], - DataType.Utf8, - Volatility.Immutable, - Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.ParseJson, args.to_vec())), + name=function.value(), + input_types=[DataType.Utf8], + return_type=DataType.Utf8, + volatility=Volatility.Immutable, + fun=Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.ParseJson, args.to_vec())), ) DataFusionFormatFunction.CheckJson => return create_udf( - function.value(), - [DataType.Utf8], - DataType.Boolean, - Volatility.Immutable, - Arc.from((args) => _apply_unary_bool(DataFusionFormatFunction.CheckJson, args.to_vec())), + name=function.value(), + input_types=[DataType.Utf8], + return_type=DataType.Boolean, + volatility=Volatility.Immutable, + fun=Arc.from((args) => _apply_unary_bool(DataFusionFormatFunction.CheckJson, args.to_vec())), ) DataFusionFormatFunction.SchemaOfJson => return create_udf( - function.value(), - [DataType.Utf8], - DataType.Utf8, - Volatility.Immutable, - Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.SchemaOfJson, args.to_vec())), + name=function.value(), + input_types=[DataType.Utf8], + return_type=DataType.Utf8, + volatility=Volatility.Immutable, + fun=Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.SchemaOfJson, args.to_vec())), ) DataFusionFormatFunction.JsonArrayLength => return create_udf( - function.value(), - [DataType.Utf8], - DataType.Int64, - Volatility.Immutable, - Arc.from((args) => _apply_unary_int(DataFusionFormatFunction.JsonArrayLength, args.to_vec())), + name=function.value(), + input_types=[DataType.Utf8], + return_type=DataType.Int64, + volatility=Volatility.Immutable, + fun=Arc.from((args) => _apply_unary_int(DataFusionFormatFunction.JsonArrayLength, args.to_vec())), ) DataFusionFormatFunction.JsonObjectKeys => return create_udf( - function.value(), - [DataType.Utf8], - DataType.Utf8, - Volatility.Immutable, - Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.JsonObjectKeys, args.to_vec())), + name=function.value(), + input_types=[DataType.Utf8], + return_type=DataType.Utf8, + volatility=Volatility.Immutable, + fun=Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.JsonObjectKeys, args.to_vec())), ) DataFusionFormatFunction.GetJsonObject => return create_udf( - function.value(), - [DataType.Utf8, DataType.Utf8], - DataType.Utf8, - Volatility.Immutable, - Arc.from((args) => _apply_binary_string(DataFusionFormatFunction.GetJsonObject, args.to_vec())), + name=function.value(), + input_types=[DataType.Utf8, DataType.Utf8], + return_type=DataType.Utf8, + volatility=Volatility.Immutable, + fun=Arc.from((args) => _apply_binary_string(DataFusionFormatFunction.GetJsonObject, args.to_vec())), ) DataFusionFormatFunction.JsonExtractPathText => return create_udf( - function.value(), - [DataType.Utf8, DataType.Utf8], - DataType.Utf8, - Volatility.Immutable, - Arc.from((args) => _apply_binary_string(DataFusionFormatFunction.JsonExtractPathText, args.to_vec())), + name=function.value(), + input_types=[DataType.Utf8, DataType.Utf8], + return_type=DataType.Utf8, + volatility=Volatility.Immutable, + fun=Arc.from((args) => _apply_binary_string(DataFusionFormatFunction.JsonExtractPathText, args.to_vec())), ) DataFusionFormatFunction.FromJson => return create_udf( - function.value(), - [DataType.Utf8, DataType.Utf8], - DataType.Utf8, - Volatility.Immutable, - Arc.from((args) => _apply_binary_string(DataFusionFormatFunction.FromJson, args.to_vec())), + name=function.value(), + input_types=[DataType.Utf8, DataType.Utf8], + return_type=DataType.Utf8, + volatility=Volatility.Immutable, + fun=Arc.from((args) => _apply_binary_string(DataFusionFormatFunction.FromJson, args.to_vec())), ) DataFusionFormatFunction.TryFromJson => return create_udf( - function.value(), - [DataType.Utf8, DataType.Utf8], - DataType.Utf8, - Volatility.Immutable, - Arc.from((args) => _apply_binary_string(DataFusionFormatFunction.TryFromJson, args.to_vec())), + name=function.value(), + input_types=[DataType.Utf8, DataType.Utf8], + return_type=DataType.Utf8, + volatility=Volatility.Immutable, + fun=Arc.from((args) => _apply_binary_string(DataFusionFormatFunction.TryFromJson, args.to_vec())), ) DataFusionFormatFunction.ToJson => return create_udf( - function.value(), - [DataType.Utf8], - DataType.Utf8, - Volatility.Immutable, - Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.ToJson, args.to_vec())), + name=function.value(), + input_types=[DataType.Utf8], + return_type=DataType.Utf8, + volatility=Volatility.Immutable, + fun=Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.ToJson, args.to_vec())), ) DataFusionFormatFunction.SchemaOfCsv => return create_udf( - function.value(), - [DataType.Utf8], - DataType.Utf8, - Volatility.Immutable, - Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.SchemaOfCsv, args.to_vec())), + name=function.value(), + input_types=[DataType.Utf8], + return_type=DataType.Utf8, + volatility=Volatility.Immutable, + fun=Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.SchemaOfCsv, args.to_vec())), ) DataFusionFormatFunction.FromCsv => return create_udf( - function.value(), - [DataType.Utf8, DataType.Utf8], - _csv_map_data_type(), - Volatility.Immutable, - Arc.from((args) => _apply_from_csv(args.to_vec())), + name=function.value(), + input_types=[DataType.Utf8, DataType.Utf8], + return_type=_csv_map_data_type(), + volatility=Volatility.Immutable, + fun=Arc.from((args) => _apply_from_csv(args.to_vec())), ) DataFusionFormatFunction.ToCsv => return create_udf( - function.value(), - [DataType.Utf8], - DataType.Utf8, - Volatility.Immutable, - Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.ToCsv, args.to_vec())), + name=function.value(), + input_types=[DataType.Utf8], + return_type=DataType.Utf8, + volatility=Volatility.Immutable, + fun=Arc.from((args) => _apply_unary_string(DataFusionFormatFunction.ToCsv, args.to_vec())), )