Summary
Reorganize the server data model so the data record is a first-class citizen that maps to a group of columns defined by a user-defined, versioned Schema, building toward a project-level extraction table that every record maps toward. The inherited Argilla model was too general-purpose yet static: a dataset's "schema" is only an implicit bag of Field rows, records store extracted data as an unbound records.fields JSON blob, and the Pandera DataFrameSchema that actually describes the extraction shape lives only in the SDK/object storage — the server never validates against it. LanceDB (with native schema evolution + vector/FTS indexes) replaces Elasticsearch.
Design and phased plan are committed under docs/superpowers/:
- Spec:
docs/superpowers/specs/2026-06-27-schema-centric-data-model-design.md
- Phase 1 plan:
docs/superpowers/plans/2026-06-27-schema-registry-and-versioning.md
Core model (decided)
- Postgres is source of truth for record cell-values; LanceDB is a derived index (replaces ES entirely).
- Schema ↔ Dataset is 1:1, collapsed into one primitive — the
schema row is the extraction table (UI: "Dataset"; model: Schema).
- Schema body = Pandera
DataFrameSchema in object storage, versioned via native version_id/etag; a thin DB registry row carries identity + version pointer + lineage + a denormalized columns_cache.
- A Record pins its own
schema_version_id; reference (the document) is the cross-schema join key — a document's full extraction = union of its records across schemas.
- Question is the per-cell review-config primitive, bound to schema column(s): 1 normally, N when
type=table. Suggestion/Response layer on top.
- The Queue (persisted; reference-ordered, cross-schema, distribution/overlap) becomes the first-class UI surface; the Dataset recedes.
Delivery
Approach A — isolated models/v2 + contexts/v2 + api/v2 mounted at /api/v2, additive Alembic tables alongside v1. v1 back-compat is not a goal: v1 is kept only as a migration source, then deleted (server + frontend retirement ledger in spec §9).
Phases
First target: server data model + API. SDK and frontend follow.
Summary
Reorganize the server data model so the data record is a first-class citizen that maps to a group of columns defined by a user-defined, versioned Schema, building toward a project-level extraction table that every record maps toward. The inherited Argilla model was too general-purpose yet static: a dataset's "schema" is only an implicit bag of
Fieldrows, records store extracted data as an unboundrecords.fieldsJSON blob, and the PanderaDataFrameSchemathat actually describes the extraction shape lives only in the SDK/object storage — the server never validates against it. LanceDB (with native schema evolution + vector/FTS indexes) replaces Elasticsearch.Design and phased plan are committed under
docs/superpowers/:docs/superpowers/specs/2026-06-27-schema-centric-data-model-design.mddocs/superpowers/plans/2026-06-27-schema-registry-and-versioning.mdCore model (decided)
schemarow is the extraction table (UI: "Dataset"; model:Schema).DataFrameSchemain object storage, versioned via nativeversion_id/etag; a thin DB registry row carries identity + version pointer + lineage + a denormalizedcolumns_cache.schema_version_id;reference(the document) is the cross-schema join key — a document's full extraction = union of its records across schemas.type=table. Suggestion/Response layer on top.Delivery
Approach A — isolated
models/v2+contexts/v2+api/v2mounted at/api/v2, additive Alembic tables alongside v1. v1 back-compat is not a goal: v1 is kept only as a migration source, then deleted (server + frontend retirement ledger in spec §9).Phases
schema_versionbinding, validated bulk-upsert,referencegrouping)search_engine/(ES/OpenSearch)First target: server data model + API. SDK and frontend follow.