Skip to content

schemas not replicated uniformly across cluster nodes #58

@fabracht

Description

@fabracht

Summary

Sibling of #57. The same broadcast plumbing that fails to deliver constraints uniformly also covers schemas: DB_SCHEMA is in the broadcast entity list at crates/mqdb-cluster/src/cluster/node_controller/replication_ops.rs:81-85 and :345-349. The 0.3.2 CHANGELOG already flagged "schema/constraint replication topology" as a known gap; this issue tracks the schema half explicitly so it isn't lost behind the constraint fix.

Symptoms

In the cluster-rebalance-stores E2E run (PR #56), an existing observation block already prints whether schema get posts and schema get comments succeed via node 4 after a rebalance-driven join. Across runs the result varies based on which schema_partition(entity) happens to land on node 4 — when the partition isn't there, node 4 reaches the schema only via forwarding to whichever node holds it locally. Tracing on the constraint side (issue #57) directly showed db_constraints.list_all() returning differently per node (1: 2, 2: 0, 3: 1, 4: 0). Schemas use the identical write/replication path so the same shape applies.

Why it matters specifically for schemas

  • Write validation. SchemaStore::is_valid_for_write (crates/mqdb-cluster/src/cluster/db/schema_store.rs) is consulted on every per-entity write. A node that doesn't hold the schema locally either has to forward the validation check (extra hop) or skip it (silent acceptance of malformed records).
  • Schema state machine. Transitions through SchemaState::{Active, PendingAdd, PendingDrop, Dropped} are point-in-time decisions. If node A still sees Active while node B has advanced to PendingDrop, writes routed to one node can be accepted while the other rejects them — observable as flapping behavior on the same record.
  • mqdb schema list / schema get. Both queries return only what the receiving node knows locally unless explicit forwarding is wired in, so the visible schema set depends on which node the client connects to.
  • Snapshot delivery on join (PR add partition snapshot exports for schema/index/unique/fk/constraint stores #54). A new node's partition snapshot only includes schemas where c.partition() == partition. Schemas whose natural partition didn't rebalance to the new node never arrive in its SchemaStore — same as constraints.

Root cause (shared with #57)

Schemas are written via a ReplicationWrite with the entity's own schema_partition — so the broadcast loop at replication_ops.rs:350-369 should send to every alive node. Empirically (per the constraint trace in #57) those broadcasts don't reliably reach every node, and joining nodes only catch up via per-partition snapshots which are scoped to the partition being snapshotted. The two failure modes:

  1. Live broadcast misses. A node marked alive by heartbeat may still drop or fail to apply incoming WriteRequests during warmup, with no retry. handle_write_request logs and continues on apply_write failure (replication_ops.rs:95-97).
  2. Late-join replay. A new node receives partition snapshots only for partitions assigned to it. Schemas living in unassigned partitions are never delivered.

Suggested directions (same options as #57)

  1. Promote schemas to true broadcast state — apply locally first, then dual-write to every alive node WITH retry/idempotency, identical to how topic_index / wildcards / client_locations are handled. Schemas are small and rarely change; the cost is negligible.
  2. Replicate via Raft. Schemas are cluster-management data, not per-partition workload. Routing through Raft (which already replicates the log to every node) gives uniform delivery for free, at the cost of coupling schema mutations to leader-elected availability.
  3. On node join, request a full schema/constraint catalog from the leader — independent of partition snapshots. A small explicit catch-up step alongside the snapshot stream, instead of relying on partition assignments to carry metadata.

A combined design touching both schemas and constraints would close #57 and this together.

Repro

Same as #57: run examples/cluster-rebalance-stores/run.sh with an enterprise license. The "OBSERVATION (missing): schema get comments via node 4" output line in failing runs reflects this gap. To prove the local-store gap directly, add a tracing::info! in SchemaStore::apply_replicated printing the count after each apply and re-run.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions