Skip to content

Add item field registry and fragment-based split storage#442

Draft
bitner wants to merge 41 commits into
mainfrom
v010-pr3-pr4-item-registry-and-fragments
Draft

Add item field registry and fragment-based split storage#442
bitner wants to merge 41 commits into
mainfrom
v010-pr3-pr4-item-registry-and-fragments

Conversation

@bitner
Copy link
Copy Markdown
Collaborator

@bitner bitner commented May 18, 2026

Summary

  • add fragment-based split storage for items together with fragment dedup and garbage collection
  • add per-collection item field registry tracking and refresh/update helpers
  • expand PGTap coverage for split-storage hydration, fragment reuse, registry updates, and legacy fallback behavior
  • regenerate staged SQL and migrations, and keep the incremental migration compatible with partitioned-table foreign key limitations

Validation

  • scripts/runinpypgstac test --pgtap --basicsql
  • scripts/runinpypgstac test --pypgstac
  • scripts/runinpypgstac --no-cache --build test --migrations

Notes

  • merged latest origin/main into this branch after PR2 landed
  • keeping this as draft pending final review of the broader restructure sequence

bitner and others added 30 commits May 5, 2026 17:00
Co-authored-by: Pete Gadomski <pete.gadomski@gmail.com>
Co-authored-by: Pete Gadomski <pete.gadomski@gmail.com>
…ions

- Update pgstac-migrate pyproject.toml to require pgpkg>=0.1.1 (includes routine body-change detection)
- Regenerate migrations with pgpkg 0.1.1 which correctly includes search/search_query replacements
- Suppress unsafe DROP FUNCTION statements for routines that exist in target schema
- Fix PGTap test 116 to check column names in alphabetical order (migration adds columns at end)
- Update test plan count from 229 to 248 (tests added for GC, context_count, statslastupdated)
- Validate migration chain end-to-end with all tests passing
- All precommit hooks passing (migrations, pgtap, pypgstac)
- expand pgstac-migrate README with full CLI/API/env var docs and troubleshooting
- make psycopg[binary] mandatory in pgstac-migrate and pypgstac
- make psycopg-pool mandatory in pypgstac
- remove redundant psycopg optional/group wiring and update test script flags
- remove pgstac-migrate upper bound in pypgstac dependency
- update release workflow paths and uv setup/build step
- refresh docs/changelog references for pgpkg>=0.1.1
- regenerate uv lockfiles
…ash-and-dead-code-rerun

# Conflicts:
#	src/pgstac-migrate/pyproject.toml
#	src/pgstac-migrate/uv.lock
…d-code-rerun

# Conflicts:
#	.github/instructions/scripts.instructions.md
#	.gitignore
#	AGENTS.md
#	CLAUDE.md
#	src/pgstac/migrations/pgstac--0.9.11--unreleased.sql
bitner added 11 commits May 13, 2026 14:56
…ge columns and registry sampling

Phase 1+2+3+4 combined:
- Add item_fragments table (collection, hash, content) for fragment dedup
- Add item_field_registry table (collection, path, is_leaf, value_kinds) for field discovery
- Extend items table with split columns: bbox, links, assets, properties, extra, fragment_id FK
- Add 6 promoted float8 columns: eo_cloud_cover, eo_snow_cover, gsd, view_off_nadir, view_sun_azimuth, view_sun_elevation
- Update content_dehydrate() to populate split columns with dual-write to legacy content
- Update content_hydrate() to prefer split columns over legacy content when populated
- Add jsonb_field_rows() recursive walker (IMMUTABLE PARALLEL SAFE)
- Add update_field_registry_from_sample() for batch path registration
- Add update_field_registry_from_items() with BERNOULLI(5%)/LIMIT 1000 sampling
- Add refresh_field_registry() maintenance function for path aging
- Add extract_fragment(), pgstac_hash_fragment(), get_or_create_fragment() with dedup
- Add gc_fragments() maintenance function for unused fragment cleanup
- Extend staging trigger to assign fragment_id and queue registry updates via run_or_queue
…egistry functions

- Add item_fragments table for deduplicated fragment storage (collection, hash, content)
- Add item_field_registry table for tracking JSONB field paths per collection
- Add split columns to items table (bbox, links, assets, properties, extra,
  eo_cloud_cover, eo_snow_cover, gsd, view_off_nadir, view_sun_azimuth,
  view_sun_elevation, fragment_id)
- content_dehydrate: populates split columns + dual-write legacy content field
- content_hydrate: prefers split columns when fragment_id IS NOT NULL
- Batch fragment ops in staging trigger: O(1) bulk INSERT + UPDATE JOIN
- extract_fragment: pure SQL function (IMMUTABLE PARALLEL SAFE)
- get_or_create_fragment: INSERT-first 2-query pattern
- gc_fragments: single set-based DELETE (no FOR LOOP)
- jsonb_field_rows: recursive JSONB path walker with max_depth guard
- update_field_registry_from_sample: batch UPSERT from caller-supplied array
- update_field_registry_from_items: BERNOULLI(5) sampling for large tables
- refresh_field_registry: expire stale paths older than retention_interval
- items_before_update_trigger: WHEN guard prevents re-hashing on non-content updates
- Fix staging trigger DELETE: use TG_TABLE_NAME instead of hard-coded items_staging
- Migration: fix items_fragment_id_fkey to not use NOT VALID (unsupported on partitioned tables)
…istry-and-fragments

# Conflicts:
#	src/pgstac/migrations/pgstac--0.9.11--unreleased.sql
#	src/pgstac/migrations/pgstac--unreleased.sql
#	src/pgstac/pgstac.sql
#	src/pgstac/sql/003a_items.sql
#	src/pgstac/tests/pgtap/003_items.sql
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant