Schema-centric v2 data model: make the Record a first-class citizen mapped to a versioned Schema

## Summary

Reorganize the server data model so the **data record is a first-class citizen** that maps to a group of columns defined by a **user-defined, versioned Schema**, building toward a **project-level extraction table that every record maps toward**. The inherited Argilla model was too general-purpose yet static: a dataset's "schema" is only an implicit bag of `Field` rows, records store extracted data as an unbound `records.fields` JSON blob, and the Pandera `DataFrameSchema` that actually describes the extraction shape lives only in the SDK/object storage — the server never validates against it. LanceDB (with native schema evolution + vector/FTS indexes) replaces Elasticsearch.

Design and phased plan are committed under `docs/superpowers/`:
- Spec: `docs/superpowers/specs/2026-06-27-schema-centric-data-model-design.md`
- Phase 1 plan: `docs/superpowers/plans/2026-06-27-schema-registry-and-versioning.md`

## Core model (decided)

- **Postgres is source of truth** for record cell-values; **LanceDB is a derived index** (replaces ES entirely).
- **Schema ↔ Dataset is 1:1, collapsed into one primitive** — the `schema` row *is* the extraction table (UI: "Dataset"; model: `Schema`).
- Schema body = Pandera `DataFrameSchema` in **object storage**, versioned via native `version_id`/`etag`; a thin **DB registry** row carries identity + version pointer + lineage + a denormalized `columns_cache`.
- A **Record** pins its own `schema_version_id`; `reference` (the document) is the **cross-schema join key** — a document's full extraction = union of its records across schemas.
- **Question** is the per-cell review-config primitive, bound to schema column(s): 1 normally, N when `type=table`. Suggestion/Response layer on top.
- The **Queue** (persisted; reference-ordered, cross-schema, distribution/overlap) becomes the first-class UI surface; the Dataset recedes.

## Delivery

Approach A — isolated `models/v2` + `contexts/v2` + `api/v2` mounted at `/api/v2`, additive Alembic tables alongside v1. **v1 back-compat is not a goal**: v1 is kept only as a migration source, then deleted (server + frontend retirement ledger in spec §9).

## Phases

- [ ] **Phase 1 — Schema registry & versioning + Pandera validation** (this effort)
- [ ] Phase 2 — Records (`schema_version` binding, validated bulk-upsert, `reference` grouping)
- [ ] Phase 3 — LanceDB index engine + sync + search; retire `search_engine/` (ES/OpenSearch)
- [ ] Phase 4 — Annotation v2 (question column-binding, suggestion, response)
- [ ] Phase 5 — Persisted Queue + distribution
- [ ] Phase 6 — v1→v2 migrator + server/frontend retirements

First target: **server data model + API**. SDK and frontend follow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Schema-centric v2 data model: make the Record a first-class citizen mapped to a versioned Schema #218

Summary

Core model (decided)

Delivery

Phases

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

Schema-centric v2 data model: make the Record a first-class citizen mapped to a versioned Schema #218

Description

Summary

Core model (decided)

Delivery

Phases

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions