Skip to content

Schema-centric v2 data model: make the Record a first-class citizen mapped to a versioned Schema #218

Description

@JonnyTran

Summary

Reorganize the server data model so the data record is a first-class citizen that maps to a group of columns defined by a user-defined, versioned Schema, building toward a project-level extraction table that every record maps toward. The inherited Argilla model was too general-purpose yet static: a dataset's "schema" is only an implicit bag of Field rows, records store extracted data as an unbound records.fields JSON blob, and the Pandera DataFrameSchema that actually describes the extraction shape lives only in the SDK/object storage — the server never validates against it. LanceDB (with native schema evolution + vector/FTS indexes) replaces Elasticsearch.

Design and phased plan are committed under docs/superpowers/:

  • Spec: docs/superpowers/specs/2026-06-27-schema-centric-data-model-design.md
  • Phase 1 plan: docs/superpowers/plans/2026-06-27-schema-registry-and-versioning.md

Core model (decided)

  • Postgres is source of truth for record cell-values; LanceDB is a derived index (replaces ES entirely).
  • Schema ↔ Dataset is 1:1, collapsed into one primitive — the schema row is the extraction table (UI: "Dataset"; model: Schema).
  • Schema body = Pandera DataFrameSchema in object storage, versioned via native version_id/etag; a thin DB registry row carries identity + version pointer + lineage + a denormalized columns_cache.
  • A Record pins its own schema_version_id; reference (the document) is the cross-schema join key — a document's full extraction = union of its records across schemas.
  • Question is the per-cell review-config primitive, bound to schema column(s): 1 normally, N when type=table. Suggestion/Response layer on top.
  • The Queue (persisted; reference-ordered, cross-schema, distribution/overlap) becomes the first-class UI surface; the Dataset recedes.

Delivery

Approach A — isolated models/v2 + contexts/v2 + api/v2 mounted at /api/v2, additive Alembic tables alongside v1. v1 back-compat is not a goal: v1 is kept only as a migration source, then deleted (server + frontend retirement ledger in spec §9).

Phases

  • Phase 1 — Schema registry & versioning + Pandera validation (this effort)
  • Phase 2 — Records (schema_version binding, validated bulk-upsert, reference grouping)
  • Phase 3 — LanceDB index engine + sync + search; retire search_engine/ (ES/OpenSearch)
  • Phase 4 — Annotation v2 (question column-binding, suggestion, response)
  • Phase 5 — Persisted Queue + distribution
  • Phase 6 — v1→v2 migrator + server/frontend retirements

First target: server data model + API. SDK and frontend follow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions