feat: arrow_upper/array_lower UDFs for datafusion 54#362
Merged
sunng87 merged 17 commits intoJun 24, 2026
Conversation
Bumps [rust_decimal](https://github.com/paupino/rust-decimal) from 1.42.0 to 1.42.1. - [Release notes](https://github.com/paupino/rust-decimal/releases) - [Changelog](https://github.com/paupino/rust-decimal/blob/master/CHANGELOG.md) - [Commits](paupino/rust-decimal@1.42.0...1.42.1) --- updated-dependencies: - dependency-name: rust_decimal dependency-version: 1.42.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [getset](https://github.com/jbaublitz/getset) from 0.1.6 to 0.1.7. - [Release notes](https://github.com/jbaublitz/getset/releases) - [Commits](jbaublitz/getset@v0.1.6...v0.1.7) --- updated-dependencies: - dependency-name: getset dependency-version: 0.1.7 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [postgres-types](https://github.com/rust-postgres/rust-postgres) from 0.2.13 to 0.2.14. - [Release notes](https://github.com/rust-postgres/rust-postgres/releases) - [Commits](rust-postgres/rust-postgres@postgres-types-v0.2.13...postgres-types-v0.2.14) --- updated-dependencies: - dependency-name: postgres-types dependency-version: 0.2.14 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
DataFusion ships `array_length` but not `array_upper`/`array_lower`, so
Postgres client queries using them only "worked" via hardcoded blacklist
token substitution -- any non-byte-matching variant failed at planning
with `Invalid function 'array_upper'`.
Implement both as scalar UDFs (`array_bounds_udf.rs`) with Postgres
semantics:
* arrow lists are always 1-based, so array_lower(arr,1) -> 1 and
array_upper(arr,1) -> array_length(arr,1)
* NULL for null array, out-of-range dim, or empty array
* accepts List/LargeList/FixedSizeList + Int32/Int64 dims, returns Int32
Register them in `setup_pg_catalog`.
The dbeaver startup test went from FAILED -> PASS (18/19 queries now run
for real). The two dbeaver queries that still can't run are blacklisted:
one needs dynamic `generate_series` bounds (DF only takes literal Int64),
the other needs Postgres regnamespace/oid-alias string resolution.
Neither is fixable by adding the UDF; both get type-compatible NULL stubs.
Includes unit + end-to-end integration tests.
These are all Postgres `oid` (int4) alias types: casting a *name* to one
yields the matching catalog oid. DataFusion has no such types, so a bare
`'trigger'::regtype` was stripped to the string `'trigger'` and then
crashed an oid comparison with "Cannot cast string 'trigger' to Int32".
Generalize the existing regclass-only rewrite rule into
`RewriteRegCastToSubquery`, with a registry of name->oid lookup queries:
* regclass -> pg_class.oid (schema-qualified relation name)
* regnamespace -> pg_namespace.oid (schema name)
* regtype -> pg_type.oid (schema-qualified type name)
* regproc -> pg_proc.oid (schema-qualified function name)
Direction guard: only the *forward* direction (name -> oid) is rewritten,
detected by a string-literal or `$1` placeholder operand. The *reverse*
direction (`<oid-column>::regtype`, e.g. `prorettype::regtype::text` for
display, or the left side of `prorettype::regtype != 'trigger'::regtype`)
has a column operand and is left untouched; it is stripped to the bare
oid column by RemoveUnsupportedTypes, which is correct since the column
already is an oid. `regclass` is re-added to the unsupported-types set so
its reverse direction is stripped too (it was removed earlier for the
forward case, now handled by the rewrite).
Results -- client startup tests, base branch -> this branch:
* pgcli : FAILED -> PASS (the `'trigger'::regtype` crash is fixed)
* dbeaver: PASS (already green from the array_upper work)
* psql : 3 failures -> 1 failure; the two regclass/regtype-related
queries now pass. The remaining failure is a DataFusion
internal "ScalarSubqueryExpr evaluated before the subquery was
executed" bug in `IN (subquery UNION ALL VALUES(...))`, which
is blacklisted here as it can't be fixed at the SQL-rewrite layer.
The full workspace test suite is green.
Bumps [log](https://github.com/rust-lang/log) from 0.4.32 to 0.4.33. - [Release notes](https://github.com/rust-lang/log/releases) - [Changelog](https://github.com/rust-lang/log/blob/master/CHANGELOG.md) - [Commits](rust-lang/log@0.4.32...0.4.33) --- updated-dependencies: - dependency-name: log dependency-version: 0.4.33 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* fix ordered_param_types and add unit test Refactor sorting of parameter types to sort numerically based on placeholder keys. Add unit test to verify correct sorting of parameter types. * fix cargo fmt
Bumps [pgwire](https://github.com/sunng87/pgwire) from 0.40.1 to 0.40.3. - [Release notes](https://github.com/sunng87/pgwire/releases) - [Changelog](https://github.com/sunng87/pgwire/blob/master/CHANGELOG.md) - [Commits](sunng87/pgwire@v0.40.1...v0.40.3) --- updated-dependencies: - dependency-name: pgwire dependency-version: 0.40.3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
The previous commit added 3 BLACKLIST_SQL_MAPPING entries to keep client
startup tests green. Two of them can be replaced with general, correct
rewrite rules; only the one that genuinely needs schema awareness remains.
1. dbeaver type lookup (REMOVED):
`generate_series(1, array_upper(current_schemas(false), 1))` failed
because our array_upper UDF returns Int32 (correct Postgres int4) but
DataFusion's generate_series wants Int64. Postgres implicitly coerces
int4->int8 here; DF does not. New rule `CastArrayBoundsForGenerateSeries`
wraps array_upper/array_lower args of generate_series in ::bigint, the
way Postgres would.
2. psql foreign-key \d lookup (REMOVED):
`VALUES ('16417'::pg_catalog.regclass)` broke DF's subquery
decorrelator because the regclass rewrite embedded a (SELECT ...) inside
VALUES. Fix it at the source: a *numeric* string operand like '16417'
should resolve directly to that oid (Postgres behavior), not do a name
lookup. RewriteRegCastToSubquery now emits a literal oid for numeric
operands, which both is more correct and avoids the nested subquery.
3. dbeaver relation-size lookup (KEPT):
`WHERE c.relnamespace='public'` compares an oid column to a bare string,
needing Postgres' implicit string->oid coercion. The explicit
`'public'::regnamespace` form is already handled, but detecting that a
bare string needs coercing requires knowing the column is an oid column
-- schema awareness the AST-rewrite layer lacks. Hardcoding pg_catalog
column names would risk regressing user tables with same-named columns,
so this single entry stays, with an accurate comment.
Net: BLACKLIST_SQL_MAPPING goes from +3 (previous commit) to +1 vs the
base branch, and that one is the genuinely hard case. Full test suite +
clippy are green.
Bumps [bytes](https://github.com/tokio-rs/bytes) from 1.11.1 to 1.12.0. - [Release notes](https://github.com/tokio-rs/bytes/releases) - [Changelog](https://github.com/tokio-rs/bytes/blob/master/CHANGELOG.md) - [Commits](tokio-rs/bytes@v1.11.1...v1.12.0) --- updated-dependencies: - dependency-name: bytes dependency-version: 1.12.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* feat: postgres-correct oid/string comparison via Field metadata + analyzer rule
Postgres catalog columns like pg_class.relnamespace or pg_attribute.atttypid
are object identifiers stored as Int32, but clients compare them against
string literals (WHERE relnamespace = 'public', WHERE atttypid = '23') and
expect Postgres name/numeric oid resolution. DataFusion has no notion of
"oid": it coerces Int32_col = Utf8('public') to CAST(col AS Utf8) = 'public',
silently returning zero rows for name strings.
This implements the design as a standalone feature:
1. oid_field.rs -- annotate oid/oid-alias columns via Arrow Field metadata
(key "pg.oid_alias" -> kind: oid/regclass/regnamespace/regtype/regproc).
The metadata survives into the logical plan schema, readable via
ExprSchema::metadata(). pg_namespace/pg_class/pg_attribute oid columns
are annotated.
2. oid_coercion_rule.rs -- an AnalyzerRule (OidStringCoercion) that rewrites
oid-alias column comparisons against string literals using Postgres
semantics:
- numeric string -> literal Int32 oid (e.g. '16417' -> 16417)
- name string -> scalar subquery SELECT oid FROM <table>
WHERE <namecol> = $s (regnamespace->pg_namespace, regclass->pg_class,
regproc->pg_proc)
The rule is order-independent (handles both the pre-coercion `col = 'x'`
form and the post-coercion `CAST(col AS Utf8) = 'x'` form) and supports
=, <>, and IN (...).
Name resolution works in the sync analyzer because
PgCatalogSchemaProvider can build a table provider synchronously
(dynamic tables defer data fetch to execution via a streaming source).
The rule captures an OidLookupProvider handle to build the lookup
subquery plans; data materializes lazily at execution.
3. Registered in setup_pg_catalog.
Verified end-to-end: relnamespace = '<name>' and relnamespace = '<oid>'
return the same rows as the int-oid baseline (which previously returned
empty). 29 lib tests + full workspace green.
* feat: stamp oid-alias metadata into pg_catalog feather exports
The oid-coercion analyzer rule relies on Arrow field metadata
(`pg.oid_alias=<kind>`) to know which Int32 columns are object identifiers.
Previously only the handful of dynamic tables (pg_namespace, pg_class,
pg_attribute) were annotated by hand. The ~50 static tables loaded from the
embedded feather files (pg_type, pg_proc, pg_aggregate, pg_operator, ...)
had no metadata, so string-vs-oid comparisons on them silently returned
zero rows.
This moves the annotation to the source of truth: the export scripts that
read the real schema from PostgreSQL.
export_pg_catalog_arrow.sh (embedded exporter):
- Map every oid-alias type (oid, regproc, regprocedure, regoper,
regoperator, regclass, regtype, regnamespace, regrole, regconfig,
regdictionary, regcollation) to int32 (the raw oid), not string.
- reg* columns are cast with ::oid at SELECT time so Postgres returns
the integer, not its display (name) form.
- Attach field metadata `pg.oid_alias=<pg_type>` to those columns.
Arrow IPC round-trips this metadata faithfully (verified pyarrow ->
feather -> arrow-rs).
pg_to_arrow.py (generic exporter): same treatment, detecting oid-alias
columns from pg_type at runtime.
Regenerated all 65 feather files. 251 oid-alias columns are now annotated
(216 oid, 35 regproc); the 21 reg* columns previously stored as string
(e.g. pg_type.typinput, pg_aggregate.aggfnoid) are now int32 + metadata.
The dynamic tables (pg_namespace/pg_class/pg_attribute) keep their
hand-annotated oid_field() calls since they build schemas at runtime; the
feather metadata covers everything else transparently.
Added end-to-end test feather_metadata_reaches_rule_for_static_tables:
pg_type.typinput (regproc, annotated solely via the feather file) resolves
a proc-name string via pg_proc name->oid lookup, matching the int-oid
baseline. 60 tests pass.
* chore: format and cleanup
* refactor: remove RewriteRegclassCastToSubquery, rely on oid-coercion rule
The oid-coercion analyzer rule + Field metadata (added in be40d6b/0300310)
now handle `'name'::regclass::oid` casts in comparison predicates natively
at the logical-plan layer, making the SQL-rewrite rule redundant for that
purpose. Remove it.
To support this, fix RemoveUnsupportedTypes to peel nested casts to a
fixed point. Previously the visitor replaced an `Expr::Cast` with its inner
expression in a single pass and never re-visited the replacement, so
`'x'::regclass::oid` became `'x'::regclass` (the residual REGCLASS then
being rejected by DataFusion as an unsupported SQL type at plan time).
Looping to a fixed point fully unwinds it to `'x'`, which the oid-coercion
rule resolves against any oid-alias column it is compared with.
Net: -162 lines. The `'x'::regclass::oid` pattern in comparisons (the only
shape exercised by client queries -- pgadbc, psql) is now covered by:
1. RemoveUnsupportedTypes -> peels to bare 'x'
2. OidStringCoercion rule -> resolves 'x' via pg_namespace/pg_class/pg_proc
Trade-off: schema-qualified names in casts (e.g.
`oid = 'pg_catalog.pg_type'::regclass::oid`) no longer split into
schema+name. The old rule attempted this via parse_ident()+current_schema(),
but that path was never verified to resolve correctly and relies on UDFs
DataFusion may not support. Bare-oid columns never supported name
resolution in either approach by design (no names-for-oid table).
All 59 remaining tests pass (the removed rule's own test was deleted).
Added regression assertions to test_remove_unsupported_types pinning the
full-peel behavior.
* refactor: unify oid-alias type names, rename RemoveUnsupportedTypes
Two cleanups to the oid machinery:
1. Single source of truth. The list of oid-alias type names previously lived
in two places that could drift:
* RemoveUnsupportedTypes had a hardcoded set of 6 names to strip
(regclass, regproc, regtype, regtype[], regnamespace, oid)
* oid_field::kind had 5 named constants for metadata annotation
Both must enumerate the same Postgres oid-alias family, so consolidate
them into one constant `oid_field::OID_ALIAS_TYPE_NAMES`, now covering
the full pg_catalog oid-alias set (oid, regproc, regprocedure, regoper,
regoperator, regclass, regtype, regnamespace, regrole, regconfig,
regdictionary, regcollation) -- matching what the feather export scripts
stamp as metadata. RemoveOidTypeCast derives its strip set from it,
generating the bare / `pg_catalog.`-qualified / array spellings.
Net effect: cast types like `::regrole` that DF rejects are now stripped
too (previously would error); no client query regressed (over-stripping
is harmless since these casts can only ever resolve to oid behavior).
2. Rename RemoveUnsupportedTypes -> RemoveOidTypeCast. The old name was a
misnomer inherited from when the rule was generic; it now exists for one
specific purpose: stripping oid-alias cast type names so they survive
DataFusion's sql_to_plan (which has no notion of oid types) and reach the
oid-coercion analyzer rule at the logical-plan layer.
Visitor renamed accordingly. oid_field()'s debug_assert now validates
against OID_ALIAS_TYPE_NAMES instead of a separate match.
Added regression assertions: newly-covered `::regrole` and the array variant
`::regtype[]` (the latter used by pgcli's `proallargtypes::regtype[]`).
59 tests pass.
* chore: format
Merging master's oid-coercion work (PR #363) into the DF54 branch broke compilation and the dbeaver e2e test. Two distinct problems, both rooted in the DF53->DF54 upgrade: 1. DataFusion 54 API: Cast/TryCast lost `data_type` in favor of `field: FieldRef`. Updated the oid-coercion-rule tests; the production code already used `Cast { expr, .. }`. 2. Type-coercion direction change: DF53 coerced `Int32 = Utf8` leniently (column -> string, returning empty); DF54 coerces the other way (string -> Int32) and ERRORS on non-numeric names. Master removed the SQL-layer RewriteRegCastToSubquery relying on DF53's leniency, but on DF54 that removal makes explicit `'name'::regclass` casts (stripped to bare strings by RemoveOidTypeCast) hit the broken coercion path. Restore RewriteRegCastToSubquery (run before RemoveOidTypeCast) so explicit forward oid-alias casts resolve to an oid (literal for numeric strings, name-lookup subquery otherwise) in ANY position -- projection, join ON, filter. This is complementary to master's oid-coercion analyzer rule, which handles the implicit `oid_col = 'name'` case via Field metadata. Also: - Extend the oid-coercion analyzer rule to rewrite LogicalPlan::Join filters (not just Filter), and to peek through CAST('str' AS int) so it stays order-independent under DF54's appended-after-TypeCoercion ordering. - Remove stale RewriteRegCastToSubquery tests left mis-merged, then re-add them against the restored rule; refresh the now-inaccurate parser.rs blacklist comment. - Drop redundant PostgreSqlDialect/Parser imports from the test module (now at crate top level). 70 tests pass (38 lib + 32 integration), no new clippy warnings.
…pePlanner Eliminates RewriteRegCastToSubquery and RemoveOidTypeCast (two SQL/AST-layer rewrite rules, ~570 lines) in favor of a single, metadata-aware mechanism that resolves oid-alias casts and comparisons at the analyzer layer -- the architectural goal of pushing oid resolution off the AST layer entirely. Key insight: DataFusion 54 exposes a TypePlanner extension point consulted *before* the default type conversion that rejects `regclass`/`regtype`/etc. Registering a TypePlanner that maps each oid-alias type name to an int4 Field carrying `pg.oid_alias` metadata means DF accepts the casts (no more "Unsupported SQL type") AND the kind survives into the logical plan inside the resulting Expr::Cast.field. The oid-coercion analyzer rule then reads that metadata to resolve names -> oids. Changes: * oid_type_planner.rs (new): PgOidTypePlanner maps regclass/regproc/regtype/ regnamespace/oid (+ pg_catalog.-prefixed) to int4 Field with kind metadata. Registered via SessionStateBuilder in setup_pg_catalog. * oid_coercion_rule.rs: PlanRewriter now uses map_expressions (covers every plan node -- projections, filters, join ON filters -- not just Filter/Join). Added a Cast/TryCast arm that reads the cast's OWN field metadata (not just the column's), so oid-column casts like `classoid = 'pg_namespace'::regclass` resolve even though classoid is bare `oid`-kind. Added regtype -> pg_type name resolution. Added kind_from_field + immediate_str_literal helpers. * rules.rs: deleted RewriteRegCastToSubquery, RemoveOidTypeCast, and their 4 test functions. * parser.rs: dropped both rules from the registration list + their imports. * oid_field.rs: refreshed OID_ALIAS_TYPE_NAMES doc (set is now consumed by the TypePlanner, not a strip rule). Net: -387 lines, 69 tests pass (was 70; the 4 deleted SQL-rewrite tests are superseded by analyzer-rule + type-planner tests), zero new clippy warnings. This completes the goal of removing all SQL-level oid-cast rewriting: both the cast-stripping (RemoveOidTypeCast) and the cast-rebuilding (RewriteRegCastToSubquery) are gone, leaving resolution entirely to the metadata-aware analyzer rule.
The oid-coercion analyzer rule (added with the TypePlanner work) now resolves `c.relnamespace = 'public'` -- a bare string compared against an oid-alias Int32 column -- to the namespace oid via a pg_namespace lookup, the way Postgres does. That was the sole reason this query was blacklisted (the rest of it -- pg_total_relation_size / pg_relation_size -- already worked), so the NULL-fallback entry is no longer needed: the real query executes directly. 6 of 7 blacklist entries remain. Each is blocked by an unrelated, non-oid DataFusion limitation, verified by probing the exact original SQL through the full pipeline: - pgcli FK columns: unnest() of a subquery result - pgcli types: correlated scalar subqueries - psql \d policies: `array(SELECT ...)` constructor unsupported - statistics: `any(stxkind)` -> array_has type mismatch - publications: duplicate NULL projection names - grafana search_path: missing current_setting() UDF 69 tests pass (dbeaver sends the un-blacklisted query and succeeds).
2f71534
into
dependabot/cargo/datafusion-54.0.0
3 of 7 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.