Skip to content

feat: arrow_upper/array_lower UDFs for datafusion 54#362

Merged
sunng87 merged 17 commits into
dependabot/cargo/datafusion-54.0.0from
feat/array-upper-lower-udf
Jun 24, 2026
Merged

feat: arrow_upper/array_lower UDFs for datafusion 54#362
sunng87 merged 17 commits into
dependabot/cargo/datafusion-54.0.0from
feat/array-upper-lower-udf

Conversation

@sunng87

@sunng87 sunng87 commented Jun 22, 2026

Copy link
Copy Markdown
Member

No description provided.

dependabot Bot and others added 10 commits June 12, 2026 11:07
Bumps [rust_decimal](https://github.com/paupino/rust-decimal) from 1.42.0 to 1.42.1.
- [Release notes](https://github.com/paupino/rust-decimal/releases)
- [Changelog](https://github.com/paupino/rust-decimal/blob/master/CHANGELOG.md)
- [Commits](paupino/rust-decimal@1.42.0...1.42.1)

---
updated-dependencies:
- dependency-name: rust_decimal
  dependency-version: 1.42.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [getset](https://github.com/jbaublitz/getset) from 0.1.6 to 0.1.7.
- [Release notes](https://github.com/jbaublitz/getset/releases)
- [Commits](jbaublitz/getset@v0.1.6...v0.1.7)

---
updated-dependencies:
- dependency-name: getset
  dependency-version: 0.1.7
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [postgres-types](https://github.com/rust-postgres/rust-postgres) from 0.2.13 to 0.2.14.
- [Release notes](https://github.com/rust-postgres/rust-postgres/releases)
- [Commits](rust-postgres/rust-postgres@postgres-types-v0.2.13...postgres-types-v0.2.14)

---
updated-dependencies:
- dependency-name: postgres-types
  dependency-version: 0.2.14
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
DataFusion ships `array_length` but not `array_upper`/`array_lower`, so
Postgres client queries using them only "worked" via hardcoded blacklist
token substitution -- any non-byte-matching variant failed at planning
with `Invalid function 'array_upper'`.

Implement both as scalar UDFs (`array_bounds_udf.rs`) with Postgres
semantics:
  * arrow lists are always 1-based, so array_lower(arr,1) -> 1 and
    array_upper(arr,1) -> array_length(arr,1)
  * NULL for null array, out-of-range dim, or empty array
  * accepts List/LargeList/FixedSizeList + Int32/Int64 dims, returns Int32

Register them in `setup_pg_catalog`.

The dbeaver startup test went from FAILED -> PASS (18/19 queries now run
for real). The two dbeaver queries that still can't run are blacklisted:
one needs dynamic `generate_series` bounds (DF only takes literal Int64),
the other needs Postgres regnamespace/oid-alias string resolution.
Neither is fixable by adding the UDF; both get type-compatible NULL stubs.

Includes unit + end-to-end integration tests.
These are all Postgres `oid` (int4) alias types: casting a *name* to one
yields the matching catalog oid. DataFusion has no such types, so a bare
`'trigger'::regtype` was stripped to the string `'trigger'` and then
crashed an oid comparison with "Cannot cast string 'trigger' to Int32".

Generalize the existing regclass-only rewrite rule into
`RewriteRegCastToSubquery`, with a registry of name->oid lookup queries:
  * regclass      -> pg_class.oid      (schema-qualified relation name)
  * regnamespace  -> pg_namespace.oid  (schema name)
  * regtype       -> pg_type.oid       (schema-qualified type name)
  * regproc       -> pg_proc.oid       (schema-qualified function name)

Direction guard: only the *forward* direction (name -> oid) is rewritten,
detected by a string-literal or `$1` placeholder operand. The *reverse*
direction (`<oid-column>::regtype`, e.g. `prorettype::regtype::text` for
display, or the left side of `prorettype::regtype != 'trigger'::regtype`)
has a column operand and is left untouched; it is stripped to the bare
oid column by RemoveUnsupportedTypes, which is correct since the column
already is an oid. `regclass` is re-added to the unsupported-types set so
its reverse direction is stripped too (it was removed earlier for the
forward case, now handled by the rewrite).

Results -- client startup tests, base branch -> this branch:
  * pgcli : FAILED -> PASS (the `'trigger'::regtype` crash is fixed)
  * dbeaver: PASS (already green from the array_upper work)
  * psql  : 3 failures -> 1 failure; the two regclass/regtype-related
            queries now pass. The remaining failure is a DataFusion
            internal "ScalarSubqueryExpr evaluated before the subquery was
            executed" bug in `IN (subquery UNION ALL VALUES(...))`, which
            is blacklisted here as it can't be fixed at the SQL-rewrite layer.

The full workspace test suite is green.
Bumps [log](https://github.com/rust-lang/log) from 0.4.32 to 0.4.33.
- [Release notes](https://github.com/rust-lang/log/releases)
- [Changelog](https://github.com/rust-lang/log/blob/master/CHANGELOG.md)
- [Commits](rust-lang/log@0.4.32...0.4.33)

---
updated-dependencies:
- dependency-name: log
  dependency-version: 0.4.33
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* fix ordered_param_types and add unit test

Refactor sorting of parameter types to sort numerically based on placeholder keys. Add unit test to verify correct sorting of parameter types.

* fix cargo fmt
Bumps [pgwire](https://github.com/sunng87/pgwire) from 0.40.1 to 0.40.3.
- [Release notes](https://github.com/sunng87/pgwire/releases)
- [Changelog](https://github.com/sunng87/pgwire/blob/master/CHANGELOG.md)
- [Commits](sunng87/pgwire@v0.40.1...v0.40.3)

---
updated-dependencies:
- dependency-name: pgwire
  dependency-version: 0.40.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
The previous commit added 3 BLACKLIST_SQL_MAPPING entries to keep client
startup tests green. Two of them can be replaced with general, correct
rewrite rules; only the one that genuinely needs schema awareness remains.

1. dbeaver type lookup (REMOVED):
   `generate_series(1, array_upper(current_schemas(false), 1))` failed
   because our array_upper UDF returns Int32 (correct Postgres int4) but
   DataFusion's generate_series wants Int64. Postgres implicitly coerces
   int4->int8 here; DF does not. New rule `CastArrayBoundsForGenerateSeries`
   wraps array_upper/array_lower args of generate_series in ::bigint, the
   way Postgres would.

2. psql foreign-key \d lookup (REMOVED):
   `VALUES ('16417'::pg_catalog.regclass)` broke DF's subquery
   decorrelator because the regclass rewrite embedded a (SELECT ...) inside
   VALUES. Fix it at the source: a *numeric* string operand like '16417'
   should resolve directly to that oid (Postgres behavior), not do a name
   lookup. RewriteRegCastToSubquery now emits a literal oid for numeric
   operands, which both is more correct and avoids the nested subquery.

3. dbeaver relation-size lookup (KEPT):
   `WHERE c.relnamespace='public'` compares an oid column to a bare string,
   needing Postgres' implicit string->oid coercion. The explicit
   `'public'::regnamespace` form is already handled, but detecting that a
   bare string needs coercing requires knowing the column is an oid column
   -- schema awareness the AST-rewrite layer lacks. Hardcoding pg_catalog
   column names would risk regressing user tables with same-named columns,
   so this single entry stays, with an accurate comment.

Net: BLACKLIST_SQL_MAPPING goes from +3 (previous commit) to +1 vs the
base branch, and that one is the genuinely hard case. Full test suite +
clippy are green.
@sunng87 sunng87 changed the title feat: arrow_upper/array_lower UDFs feat: arrow_upper/array_lower UDFs for datafusion 54 Jun 22, 2026
sunng87 and others added 7 commits June 22, 2026 17:19
Bumps [bytes](https://github.com/tokio-rs/bytes) from 1.11.1 to 1.12.0.
- [Release notes](https://github.com/tokio-rs/bytes/releases)
- [Changelog](https://github.com/tokio-rs/bytes/blob/master/CHANGELOG.md)
- [Commits](tokio-rs/bytes@v1.11.1...v1.12.0)

---
updated-dependencies:
- dependency-name: bytes
  dependency-version: 1.12.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* feat: postgres-correct oid/string comparison via Field metadata + analyzer rule

Postgres catalog columns like pg_class.relnamespace or pg_attribute.atttypid
are object identifiers stored as Int32, but clients compare them against
string literals (WHERE relnamespace = 'public', WHERE atttypid = '23') and
expect Postgres name/numeric oid resolution. DataFusion has no notion of
"oid": it coerces Int32_col = Utf8('public') to CAST(col AS Utf8) = 'public',
silently returning zero rows for name strings.

This implements the design as a standalone feature:

1. oid_field.rs -- annotate oid/oid-alias columns via Arrow Field metadata
   (key "pg.oid_alias" -> kind: oid/regclass/regnamespace/regtype/regproc).
   The metadata survives into the logical plan schema, readable via
   ExprSchema::metadata(). pg_namespace/pg_class/pg_attribute oid columns
   are annotated.

2. oid_coercion_rule.rs -- an AnalyzerRule (OidStringCoercion) that rewrites
   oid-alias column comparisons against string literals using Postgres
   semantics:
     - numeric string -> literal Int32 oid (e.g. '16417' -> 16417)
     - name string    -> scalar subquery SELECT oid FROM <table>
       WHERE <namecol> = $s (regnamespace->pg_namespace, regclass->pg_class,
       regproc->pg_proc)
   The rule is order-independent (handles both the pre-coercion `col = 'x'`
   form and the post-coercion `CAST(col AS Utf8) = 'x'` form) and supports
   =, <>, and IN (...).

   Name resolution works in the sync analyzer because
   PgCatalogSchemaProvider can build a table provider synchronously
   (dynamic tables defer data fetch to execution via a streaming source).
   The rule captures an OidLookupProvider handle to build the lookup
   subquery plans; data materializes lazily at execution.

3. Registered in setup_pg_catalog.

Verified end-to-end: relnamespace = '<name>' and relnamespace = '<oid>'
return the same rows as the int-oid baseline (which previously returned
empty). 29 lib tests + full workspace green.

* feat: stamp oid-alias metadata into pg_catalog feather exports

The oid-coercion analyzer rule relies on Arrow field metadata
(`pg.oid_alias=<kind>`) to know which Int32 columns are object identifiers.
Previously only the handful of dynamic tables (pg_namespace, pg_class,
pg_attribute) were annotated by hand. The ~50 static tables loaded from the
embedded feather files (pg_type, pg_proc, pg_aggregate, pg_operator, ...)
had no metadata, so string-vs-oid comparisons on them silently returned
zero rows.

This moves the annotation to the source of truth: the export scripts that
read the real schema from PostgreSQL.

export_pg_catalog_arrow.sh (embedded exporter):
  - Map every oid-alias type (oid, regproc, regprocedure, regoper,
    regoperator, regclass, regtype, regnamespace, regrole, regconfig,
    regdictionary, regcollation) to int32 (the raw oid), not string.
  - reg* columns are cast with ::oid at SELECT time so Postgres returns
    the integer, not its display (name) form.
  - Attach field metadata `pg.oid_alias=<pg_type>` to those columns.
    Arrow IPC round-trips this metadata faithfully (verified pyarrow ->
    feather -> arrow-rs).

pg_to_arrow.py (generic exporter): same treatment, detecting oid-alias
columns from pg_type at runtime.

Regenerated all 65 feather files. 251 oid-alias columns are now annotated
(216 oid, 35 regproc); the 21 reg* columns previously stored as string
(e.g. pg_type.typinput, pg_aggregate.aggfnoid) are now int32 + metadata.

The dynamic tables (pg_namespace/pg_class/pg_attribute) keep their
hand-annotated oid_field() calls since they build schemas at runtime; the
feather metadata covers everything else transparently.

Added end-to-end test feather_metadata_reaches_rule_for_static_tables:
pg_type.typinput (regproc, annotated solely via the feather file) resolves
a proc-name string via pg_proc name->oid lookup, matching the int-oid
baseline. 60 tests pass.

* chore: format and cleanup

* refactor: remove RewriteRegclassCastToSubquery, rely on oid-coercion rule

The oid-coercion analyzer rule + Field metadata (added in be40d6b/0300310)
now handle `'name'::regclass::oid` casts in comparison predicates natively
at the logical-plan layer, making the SQL-rewrite rule redundant for that
purpose. Remove it.

To support this, fix RemoveUnsupportedTypes to peel nested casts to a
fixed point. Previously the visitor replaced an `Expr::Cast` with its inner
expression in a single pass and never re-visited the replacement, so
`'x'::regclass::oid` became `'x'::regclass` (the residual REGCLASS then
being rejected by DataFusion as an unsupported SQL type at plan time).
Looping to a fixed point fully unwinds it to `'x'`, which the oid-coercion
rule resolves against any oid-alias column it is compared with.

Net: -162 lines. The `'x'::regclass::oid` pattern in comparisons (the only
shape exercised by client queries -- pgadbc, psql) is now covered by:
  1. RemoveUnsupportedTypes  ->  peels to bare 'x'
  2. OidStringCoercion rule  ->  resolves 'x' via pg_namespace/pg_class/pg_proc

Trade-off: schema-qualified names in casts (e.g.
`oid = 'pg_catalog.pg_type'::regclass::oid`) no longer split into
schema+name. The old rule attempted this via parse_ident()+current_schema(),
but that path was never verified to resolve correctly and relies on UDFs
DataFusion may not support. Bare-oid columns never supported name
resolution in either approach by design (no names-for-oid table).

All 59 remaining tests pass (the removed rule's own test was deleted).
Added regression assertions to test_remove_unsupported_types pinning the
full-peel behavior.

* refactor: unify oid-alias type names, rename RemoveUnsupportedTypes

Two cleanups to the oid machinery:

1. Single source of truth. The list of oid-alias type names previously lived
   in two places that could drift:
     * RemoveUnsupportedTypes had a hardcoded set of 6 names to strip
       (regclass, regproc, regtype, regtype[], regnamespace, oid)
     * oid_field::kind had 5 named constants for metadata annotation
   Both must enumerate the same Postgres oid-alias family, so consolidate
   them into one constant `oid_field::OID_ALIAS_TYPE_NAMES`, now covering
   the full pg_catalog oid-alias set (oid, regproc, regprocedure, regoper,
   regoperator, regclass, regtype, regnamespace, regrole, regconfig,
   regdictionary, regcollation) -- matching what the feather export scripts
   stamp as metadata. RemoveOidTypeCast derives its strip set from it,
   generating the bare / `pg_catalog.`-qualified / array spellings.

   Net effect: cast types like `::regrole` that DF rejects are now stripped
   too (previously would error); no client query regressed (over-stripping
   is harmless since these casts can only ever resolve to oid behavior).

2. Rename RemoveUnsupportedTypes -> RemoveOidTypeCast. The old name was a
   misnomer inherited from when the rule was generic; it now exists for one
   specific purpose: stripping oid-alias cast type names so they survive
   DataFusion's sql_to_plan (which has no notion of oid types) and reach the
   oid-coercion analyzer rule at the logical-plan layer.

   Visitor renamed accordingly. oid_field()'s debug_assert now validates
   against OID_ALIAS_TYPE_NAMES instead of a separate match.

Added regression assertions: newly-covered `::regrole` and the array variant
`::regtype[]` (the latter used by pgcli's `proallargtypes::regtype[]`).
59 tests pass.

* chore: format
Merging master's oid-coercion work (PR #363) into the DF54 branch broke
compilation and the dbeaver e2e test. Two distinct problems, both rooted in
the DF53->DF54 upgrade:

1. DataFusion 54 API: Cast/TryCast lost `data_type` in favor of
   `field: FieldRef`. Updated the oid-coercion-rule tests; the production
   code already used `Cast { expr, .. }`.

2. Type-coercion direction change: DF53 coerced `Int32 = Utf8` leniently
   (column -> string, returning empty); DF54 coerces the other way
   (string -> Int32) and ERRORS on non-numeric names. Master removed the
   SQL-layer RewriteRegCastToSubquery relying on DF53's leniency, but on
   DF54 that removal makes explicit `'name'::regclass` casts (stripped to
   bare strings by RemoveOidTypeCast) hit the broken coercion path.

   Restore RewriteRegCastToSubquery (run before RemoveOidTypeCast) so
   explicit forward oid-alias casts resolve to an oid (literal for numeric
   strings, name-lookup subquery otherwise) in ANY position -- projection,
   join ON, filter. This is complementary to master's oid-coercion
   analyzer rule, which handles the implicit `oid_col = 'name'` case via
   Field metadata.

Also:
- Extend the oid-coercion analyzer rule to rewrite LogicalPlan::Join
  filters (not just Filter), and to peek through CAST('str' AS int) so it
  stays order-independent under DF54's appended-after-TypeCoercion ordering.
- Remove stale RewriteRegCastToSubquery tests left mis-merged, then
  re-add them against the restored rule; refresh the now-inaccurate
  parser.rs blacklist comment.
- Drop redundant PostgreSqlDialect/Parser imports from the test module
  (now at crate top level).

70 tests pass (38 lib + 32 integration), no new clippy warnings.
…pePlanner

Eliminates RewriteRegCastToSubquery and RemoveOidTypeCast (two SQL/AST-layer
rewrite rules, ~570 lines) in favor of a single, metadata-aware mechanism
that resolves oid-alias casts and comparisons at the analyzer layer -- the
architectural goal of pushing oid resolution off the AST layer entirely.

Key insight: DataFusion 54 exposes a TypePlanner extension point consulted
*before* the default type conversion that rejects `regclass`/`regtype`/etc.
Registering a TypePlanner that maps each oid-alias type name to an int4 Field
carrying `pg.oid_alias` metadata means DF accepts the casts (no more
"Unsupported SQL type") AND the kind survives into the logical plan inside
the resulting Expr::Cast.field. The oid-coercion analyzer rule then reads
that metadata to resolve names -> oids.

Changes:
* oid_type_planner.rs (new): PgOidTypePlanner maps regclass/regproc/regtype/
  regnamespace/oid (+ pg_catalog.-prefixed) to int4 Field with kind metadata.
  Registered via SessionStateBuilder in setup_pg_catalog.
* oid_coercion_rule.rs: PlanRewriter now uses map_expressions (covers every
  plan node -- projections, filters, join ON filters -- not just Filter/Join).
  Added a Cast/TryCast arm that reads the cast's OWN field metadata (not just
  the column's), so oid-column casts like `classoid = 'pg_namespace'::regclass`
  resolve even though classoid is bare `oid`-kind. Added regtype -> pg_type
  name resolution. Added kind_from_field + immediate_str_literal helpers.
* rules.rs: deleted RewriteRegCastToSubquery, RemoveOidTypeCast, and their
  4 test functions.
* parser.rs: dropped both rules from the registration list + their imports.
* oid_field.rs: refreshed OID_ALIAS_TYPE_NAMES doc (set is now consumed by
  the TypePlanner, not a strip rule).

Net: -387 lines, 69 tests pass (was 70; the 4 deleted SQL-rewrite tests are
superseded by analyzer-rule + type-planner tests), zero new clippy warnings.

This completes the goal of removing all SQL-level oid-cast rewriting: both
the cast-stripping (RemoveOidTypeCast) and the cast-rebuilding
(RewriteRegCastToSubquery) are gone, leaving resolution entirely to the
metadata-aware analyzer rule.
The oid-coercion analyzer rule (added with the TypePlanner work) now resolves
`c.relnamespace = 'public'` -- a bare string compared against an oid-alias
Int32 column -- to the namespace oid via a pg_namespace lookup, the way
Postgres does. That was the sole reason this query was blacklisted (the rest
of it -- pg_total_relation_size / pg_relation_size -- already worked), so the
NULL-fallback entry is no longer needed: the real query executes directly.

6 of 7 blacklist entries remain. Each is blocked by an unrelated, non-oid
DataFusion limitation, verified by probing the exact original SQL through the
full pipeline:
  - pgcli FK columns:    unnest() of a subquery result
  - pgcli types:         correlated scalar subqueries
  - psql \d policies:    `array(SELECT ...)` constructor unsupported
  - statistics:          `any(stxkind)` -> array_has type mismatch
  - publications:        duplicate NULL projection names
  - grafana search_path: missing current_setting() UDF

69 tests pass (dbeaver sends the un-blacklisted query and succeeds).
@sunng87 sunng87 merged commit 2f71534 into dependabot/cargo/datafusion-54.0.0 Jun 24, 2026
3 of 7 checks passed
@sunng87 sunng87 deleted the feat/array-upper-lower-udf branch June 24, 2026 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants