Skip to content

pgstac sql changes for updated table layout#444

Draft
bitner wants to merge 11 commits into
mainfrom
v010-pr1-load-path-core
Draft

pgstac sql changes for updated table layout#444
bitner wants to merge 11 commits into
mainfrom
v010-pr1-load-path-core

Conversation

@bitner
Copy link
Copy Markdown
Collaborator

@bitner bitner commented May 20, 2026

No description provided.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates PgSTAC’s SQL schema and associated tests to support the v0.10 “split” item storage layout (top-level STAC fields split into dedicated columns), adds fragment-backed deduplication and a field registry, and updates queryables/CQL translation to prefer native promoted columns for better indexing.

Changes:

  • Introduces split columns on items, plus item_fragments (dedup) and item_field_registry (path/type sampling).
  • Updates queryables + CQL generation to resolve promoted fields to native columns and generate native-column index DDL.
  • Updates pgTap/basic SQL fixtures and expected outputs to match the new storage model and null-handling semantics.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/pgstac/sql/003a_items.sql Implements split items schema, fragment storage/dedup, field registry functions, and staging ingest changes.
src/pgstac/sql/002_collections.sql Adds collections.fragment_config and derives defaults from item_assets.
src/pgstac/sql/002a_queryables.sql Resolves queryables to split columns/native promoted paths; updates managed index generation and field matching.
src/pgstac/sql/002b_cql.sql Avoids applying cast wrappers to native promoted columns to preserve index-only scans.
src/pgstac/sql/004_search.sql Switches full-text search inputs from content->properties to split properties column.
src/pgstac/sql/004a_collectionsearch.sql Enhances collections_asitems to expose collection metadata via a properties column for CQL2 filtering.
src/pgstac/sql/998_idempotent_post.sql Registers promoted native-column queryables for the split schema and backfills older installs.
src/pgstac/sql/001a_jsonutils.sql Updates merge_jsonb/strip_jsonb to preserve explicit JSON null values.
src/pgstac/pgstac.sql Regenerated aggregate SQL reflecting the new schema/functions.
src/pgstac/tests/pgtap/003_items.sql Updates items tests for split columns, fragments, registry, and JSON null preservation.
src/pgstac/tests/pgtap/004_search.sql Updates search tests for promoted columns / split storage expectations.
src/pgstac/tests/pgtap/002a_queryables.sql Adds assertions for promoted/native queryable resolution and index DDL.
src/pgstac/tests/pgtap.sql Updates pgTap plan count for new tests.
src/pgstac/tests/basic/crud_functions.sql Updates basic CRUD script to query split columns.
src/pgstac/tests/basic/crud_functions.sql.out Updates expected output for CRUD basic test.
src/pgstac/tests/basic/cql2_searches.sql.out Updates expected hydrated output (e.g., stac_version presence).
src/pgstac/tests/basic/hydration.sql Adds basic test covering JSON null survival through dehydrate/hydrate.
src/pgstac/tests/basic/hydration.sql.out Expected output for the new hydration test.
scripts/container-scripts/test Updates DB setup for fragment_config; changes default test-suite selection.
scripts/benchmark_dehydrate.sql Adds a benchmarking script for dehydrate variants and hashing costs.
scripts/benchmark_dehydrate_v2.sql Adds an alternative benchmarking harness for dehydrate variants.
docker/pgstac/Dockerfile Adds plprofiler packages and preloading; adjusts environment defaults.

Comment thread src/pgstac/sql/003a_items.sql Outdated
Comment thread src/pgstac/sql/003a_items.sql Outdated
Comment thread src/pgstac/sql/003a_items.sql
Comment thread src/pgstac/sql/003a_items.sql Outdated
Comment thread src/pgstac/tests/pgtap/004_search.sql Outdated
Comment thread scripts/container-scripts/test
Copy link
Copy Markdown
Contributor

Copilot AI commented May 20, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • apt.postgresql.org
    • Triggering command: /usr/lib/apt/methods/http /usr/lib/apt/methods/http (dns block)
  • astral.sh
    • Triggering command: /usr/bin/curl curl -LsSf REDACTED git conf�� amd64.deb --global /opt/pipx_bin/git committer.email (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot finished work on behalf of bitner May 20, 2026 20:06
Comment thread src/pgstac/sql/001a_jsonutils.sql Outdated
-- '"𒍟※"' _a → return SQL NULL (sentinel used by strip_jsonb to mark removed values)
-- Objects are merged key-by-key recursively; same-length arrays are merged element-by-element.
-- Any other type: _a wins.
CREATE OR REPLACE FUNCTION merge_jsonb(_a jsonb, _b jsonb) RETURNS jsonb AS $$
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot do we still need merge_jsonb and strip_jsonb with the new fragments setup? or at least, do we need to modify things so that we aren't using the null marker any more? are there any further optimizations that we can do here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the v0.10 fragment layout neither function is called from any live code path — confirmed by grep across src/pgstac/sql/:

  • strip_jsonb is already marked "RETAINED FOR USE BY MIGRATION SCRIPTS ONLY" in its own comment. The new ingest path uses strip_fragment_col() instead, which extracts shared keys into item_fragments without the sentinel; strip_jsonb can be dropped once the migration path is finalised.

  • merge_jsonb is also unreachable from v0.10 SQL: content_hydrate now merges with the plain || operator (COALESCE(frag_content->'assets','{}') || COALESCE(_item.assets,'{}')) and the sentinel case in merge_jsonb will never be triggered because no new code writes '𒍟※' into the database. It can be dropped at the same time.

  • The null marker / sentinel (𒍟※): it only lives in strip_jsonb/merge_jsonb on the SQL side — both dead paths above. On the Python side, hydration.py's dehydrate() / hydrate_py() still use DO_NOT_MERGE_MARKER and load.py still calls dehydrate(), but that whole Python load path targets the old content column that no longer exists in the v0.10 items table, so it too is pending redesign.

Suggested cleanup when ready:

  1. Remove merge_jsonb from 001a_jsonutils.sql (no callers in v0.10 SQL).
  2. Remove strip_jsonb from 001a_jsonutils.sql once the migration path no longer needs it.
  3. Drop DO_NOT_MERGE_MARKER, dehydrate(), and apply_marked_keys() from hydration.py when load.py is updated for v0.10 split-column ingest (they're only meaningful with the old content column).

No further SQL optimisations are needed here for now; the two functions are simply dead weight that will be cleaned up as part of the pypgstac/migration work.

Copy link
Copy Markdown
Contributor

Copilot AI commented May 20, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • apt.postgresql.org
    • Triggering command: /usr/lib/apt/methods/http /usr/lib/apt/methods/http (dns block)
  • astral.sh
    • Triggering command: /usr/bin/curl curl -LsSf REDACTED git (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot finished work on behalf of bitner May 20, 2026 20:07
Copy link
Copy Markdown
Contributor

Copilot AI commented May 20, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • apt.postgresql.org
    • Triggering command: /usr/lib/apt/methods/http /usr/lib/apt/methods/http minimal_3.13.5-1--control pes_13.0.0_all.d/tmp/apt-dpkg-install-OjeaYT/75-python3-setuptools-whl_78.1.1-0.1_all.deb 6.5_all.deb 3.4.�� eadline-common_8.2-6_all.deb -common_8.2-6_all.deb ine8t64_8.2-6_amd64.deb n3.13-stdlib_3.1sed 13_3.13.5-2+deb1-e n3-stdlib_3.13.5s/postgresql\/17\/man/man/ 3.13.5-1_amd64.deb 3.15�� .0-1_amd64.deb libgdbm-compat4t64_1.24-2_amd64.deb -tree committer.name (dns block)
    • Triggering command: /usr/lib/apt/methods/http /usr/lib/apt/methods/http (dns block)
  • astral.sh
    • Triggering command: /usr/bin/curl curl -LsSf REDACTED git conf�� amd64.deb --global k/_temp/copilot-developer-action-main/dist/ripgrep/bin/linux-x64/rg user.email (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

-- collection_fragment_config_default: Derive a fragment_config text[] from item_assets if present.
-- Returns one serialized path per item_asset key (e.g. 'assets.thumbnail'), or NULL if no item_assets.
-- This is used when creating/upserting a collection to auto-populate fragment_config.
CREATE OR REPLACE FUNCTION collection_fragment_config_default(content jsonb) RETURNS text[] AS $$
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot we need to make sure that we have a good default here. this should include ignoring base fields like stac_extensions and stac_version and any other fields like the proj fields that are likely to be the same for every item in a collection and for things in item_assets, it should look for things that are not going to change from item to item, but we should make sure not to include things like href that will be different per item. look at various collections and items on the microsoft planetary computer (https://planetarycomputer.microsoft.com/api/stac/v1/openapi.json) and come up with a plan so that we can more smartly identify a good base set of paths to allow stripping that highly duplicated data.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in db79a72. collection_fragment_config_default now:

  1. Always includes stac_version and stac_extensions — always identical across a collection.
  2. When item_assets is present, generates depth-3 paths for each stable asset sub-key (e.g. assets.thumbnail.type, assets.thumbnail.roles, assets.thumbnail.title, assets.image.eo:bands) so that the fragment stores the shared per-asset metadata while href and other per-item fields stay in the per-item assets column where they belong.

The following sub-keys are excluded from fragmentation:
href, file:size, file:checksum, file:local_path, alternate, storage:path, storage:platform, storage:region, storage:requester_pays, storage:tier.

This required extending the whole fragment tooling to handle arbitrary-depth paths (see same commit) — extract_fragment, strip_fragment_col, and content_hydrate all now support depth-3+.

Comment thread src/pgstac/sql/002_collections.sql Outdated


-- create_collection: Insert a new collection and derive fragment_config from item_assets.
CREATE OR REPLACE FUNCTION create_collection(data jsonb) RETURNS VOID AS $$
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot, let's provide an overload that defaults to null to specify the partition_trunc and another one to specify the fragment_config for this function and the other crud functions here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in commit 5fb916a. create_collection, update_collection, and upsert_collection now each have a no-arg overload (for backwards compatibility) plus a full-param variant with optional _partition_trunc text DEFAULT NULL and _fragment_config text[] DEFAULT NULL parameters. When either is NULL the collection-level defaults are used.




CREATE OR REPLACE FUNCTION queryable(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot does it still make sense to have this function that returns all the separate outs or should we break this into smaller functions that each only return a single output. it seems like some of the other functions introduced here might already be working towards that

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multi-output form of queryable() still makes sense for the main CQL translation path — all four outputs (path, expression, wrapper, nulled_wrapper) are consumed together in a single predicate, so a single lookup is more efficient than four separate calls.

The newer single-purpose helpers (queryable_uses_native_path, queryable_index_field, indexdef) were carved off because they serve distinct, non-translation concerns (index DDL generation, index-consistency views). That pattern is the right direction: add focused single-output wrappers only when there are callers that genuinely only need one output and the table lookup cost matters. For now, the core translation path is better served by the multi-output function.

Comment thread src/pgstac/sql/998_idempotent_post.sql Outdated
-- CQL2 queries and auto-created indexes will use the native column, not JSONB extraction.
-- First block: insert new rows only. Second block below: update existing rows that have
-- property_path=NULL (handles databases upgraded from pre-v0.10 without re-installing).
WITH promoted_queryables(name, definition, property_path, property_wrapper) AS (
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot we are repeating ourselves a lot here, can we do something so that we only specify this values list once?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidated in commit 5fb916a. The column/value list is now defined once in a promoted_queryables_defaults() SETOF function in 002a_queryables.sql; the INSERT … ON CONFLICT DO NOTHING in 998_idempotent_post.sql uses INSERT INTO queryables SELECT * FROM promoted_queryables_defaults() so the values are specified in exactly one place.

Copy link
Copy Markdown
Contributor

Copilot AI commented May 20, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • apt.postgresql.org
    • Triggering command: /usr/lib/apt/methods/http /usr/lib/apt/methods/http (dns block)
    • Triggering command: /usr/lib/apt/methods/http /usr/lib/apt/methods/http git flex bison (dns block)
    • Triggering command: /usr/lib/apt/methods/http /usr/lib/apt/methods/http run --bundle dpkg-deb --fs�� /tmp/apt-dpkg-install-XQMS62/0-python3-minimal_3.13.5-1_amd64.deb (dns block)
  • astral.sh
    • Triggering command: /usr/bin/curl curl -LsSf REDACTED git conf�� amd64.deb --global ptables credential.usern/usr/bin/python3.13 (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

bitner added 5 commits May 27, 2026 17:34
Introduce shared promoted-property metadata, collection fragment configuration, and the queryable plumbing needed for split-storage items.
Store fragment-backed item content in dedicated columns, remove the non-hydrated search path, and fold in the follow-up trigger and test hardening for the split-storage flow.
Add checked-in Planetary Computer fixtures, loader entrypoints, and supporting tooling so load-path comparisons can be reproduced locally and in CI.
Refresh repo instructions, changelog notes, and the container test entrypoint to match the new pgstac-migrate packaging and sample-data workflow.
Ignore Compound Engineering workspace files and local planning docs so they do not keep reappearing in feature branches.
@bitner bitner force-pushed the v010-pr1-load-path-core branch from 2413faa to ef9ba25 Compare May 28, 2026 13:35
Previously, hydration used a shallow one-level merge on assets and properties
that would lose data when both fragment and per-item had nested values under
the same parent object key. This affected depth-3+ fragment configurations.

Changes:
- Add jsonb_merge_recursive() in 001a_jsonutils.sql for deep-merge with item
  precedence; objects merge recursively while non-objects default to item value
- Switch content_hydrate() to use jsonb_merge_recursive for both assets and
  properties, preserving the split-storage invariant
- Add guard tests for depth-2, depth-3, depth-4, and mixed root+deep fragments
  to catch future regressions on arbitrary-depth paths
- Regenerate pgstac.sql via pgpkg stageversion

These changes ensure round-trip equality holds even when fragment_config
specifies paths like assets.image.meta.checksum (depth-3) or
properties.deep.l1.l2.l3 (depth-4) that coexist with sibling per-item values
under shared parent objects.
bitner added 5 commits May 28, 2026 10:41
…n requirements

Removed in this commit:

1. collection_base_item(jsonb) and collection_base_item(text) functions
   - Existed only for pre-v0.10 GENERATED column collections.base_item
   - Dead code in v0.10; will be dropped in incremental migration
   - Migration: pgstac--0.9.11--0.10.0.sql will include DROP FUNCTION

2. jsonb_merge_shallow(jsonb, jsonb) function and grant
   - Unused; all hydration now uses jsonb_merge_recursive for depth-3+ paths
   - Removed 30 lines + grant line; regenerated pgstac.sql

3. Misleading backwards-compat comment from 002_collections.sql
   - Fragment config already uses JSON array format in v0.10
   - Legacy dot-delimited support in fragment_path_array retained for
     operators with existing operator-configured values only

Documentation added to .plans/V0.10.0_PLAN.md section 5.8.9:

- Detailed migration handling plan for each removed item
- Clear statement of what remains (query_to_cql2, fragment_path_array legacy)
- Migration test requirements (scripts/test --migrations)

Removals are safe because:
- collection_base_item: not called by any code; will be cleanly dropped
- jsonb_merge_shallow: completely unused; replaced by recursive merge
- Tests still pass: PGTap 334/334, all basic SQL tests pass
- pgstac.sql regenerated via pgpkg stageversion

This cleanup improves maintainability and reduces confusion about what is
currently active vs. what is only for migration compatibility.
…ystem

Fragment paths are brand new in this PR with zero deployments, so there is no legacy format to support. Removed:

- fragment_path_array() legacy dot-delimited format fallback code
- Comments claiming legacy format support
- Misleading "dot-delimited" descriptor in extract_fragment documentation

All fragment paths now use exclusively JSON-array serialization format (e.g., [\"assets\",\"B1\",\"type\"])
…n format

All fragment_config settings in tests now use the canonical JSON-array serialization (e.g., [\"properties\",\"eo:cloud_cover\"]) instead of legacy dot-delimited format (e.g., properties.eo:cloud_cover).
…ay fragment paths

The fragment_config now stores JSON-array serialized paths like [\"stac_version\"]. Updated the logic that checks whether these root-level keys should be stripped from per-item storage to properly parse fragment_path_array() results.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants