You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sibling of #57. The same broadcast plumbing that fails to deliver constraints uniformly also covers schemas: DB_SCHEMA is in the broadcast entity list at crates/mqdb-cluster/src/cluster/node_controller/replication_ops.rs:81-85 and :345-349. The 0.3.2 CHANGELOG already flagged "schema/constraint replication topology" as a known gap; this issue tracks the schema half explicitly so it isn't lost behind the constraint fix.
Symptoms
In the cluster-rebalance-stores E2E run (PR #56), an existing observation block already prints whether schema get posts and schema get comments succeed via node 4 after a rebalance-driven join. Across runs the result varies based on which schema_partition(entity) happens to land on node 4 — when the partition isn't there, node 4 reaches the schema only via forwarding to whichever node holds it locally. Tracing on the constraint side (issue #57) directly showed db_constraints.list_all() returning differently per node (1: 2, 2: 0, 3: 1, 4: 0). Schemas use the identical write/replication path so the same shape applies.
Why it matters specifically for schemas
Write validation.SchemaStore::is_valid_for_write (crates/mqdb-cluster/src/cluster/db/schema_store.rs) is consulted on every per-entity write. A node that doesn't hold the schema locally either has to forward the validation check (extra hop) or skip it (silent acceptance of malformed records).
Schema state machine. Transitions through SchemaState::{Active, PendingAdd, PendingDrop, Dropped} are point-in-time decisions. If node A still sees Active while node B has advanced to PendingDrop, writes routed to one node can be accepted while the other rejects them — observable as flapping behavior on the same record.
mqdb schema list / schema get. Both queries return only what the receiving node knows locally unless explicit forwarding is wired in, so the visible schema set depends on which node the client connects to.
Snapshot delivery on join (PR add partition snapshot exports for schema/index/unique/fk/constraint stores #54). A new node's partition snapshot only includes schemas where c.partition() == partition. Schemas whose natural partition didn't rebalance to the new node never arrive in its SchemaStore — same as constraints.
Schemas are written via a ReplicationWrite with the entity's own schema_partition — so the broadcast loop at replication_ops.rs:350-369 should send to every alive node. Empirically (per the constraint trace in #57) those broadcasts don't reliably reach every node, and joining nodes only catch up via per-partition snapshots which are scoped to the partition being snapshotted. The two failure modes:
Live broadcast misses. A node marked alive by heartbeat may still drop or fail to apply incoming WriteRequests during warmup, with no retry. handle_write_request logs and continues on apply_write failure (replication_ops.rs:95-97).
Late-join replay. A new node receives partition snapshots only for partitions assigned to it. Schemas living in unassigned partitions are never delivered.
Promote schemas to true broadcast state — apply locally first, then dual-write to every alive node WITH retry/idempotency, identical to how topic_index / wildcards / client_locations are handled. Schemas are small and rarely change; the cost is negligible.
Replicate via Raft. Schemas are cluster-management data, not per-partition workload. Routing through Raft (which already replicates the log to every node) gives uniform delivery for free, at the cost of coupling schema mutations to leader-elected availability.
On node join, request a full schema/constraint catalog from the leader — independent of partition snapshots. A small explicit catch-up step alongside the snapshot stream, instead of relying on partition assignments to carry metadata.
A combined design touching both schemas and constraints would close #57 and this together.
Repro
Same as #57: run examples/cluster-rebalance-stores/run.sh with an enterprise license. The "OBSERVATION (missing): schema get comments via node 4" output line in failing runs reflects this gap. To prove the local-store gap directly, add a tracing::info! in SchemaStore::apply_replicated printing the count after each apply and re-run.
PR rebuild fk reverse index after partition snapshot import #56 — closes the FK reverse-index snapshot gap; that fix is correct in its own scope but cascade-through-rebalanced-node behavior depends on the constraint half of this pair landing.
0.3.2 CHANGELOG — original mention of "schema/constraint replication topology" as a known gap.
Summary
Sibling of #57. The same broadcast plumbing that fails to deliver constraints uniformly also covers schemas:
DB_SCHEMAis in the broadcast entity list atcrates/mqdb-cluster/src/cluster/node_controller/replication_ops.rs:81-85and:345-349. The 0.3.2 CHANGELOG already flagged "schema/constraint replication topology" as a known gap; this issue tracks the schema half explicitly so it isn't lost behind the constraint fix.Symptoms
In the cluster-rebalance-stores E2E run (PR #56), an existing observation block already prints whether
schema get postsandschema get commentssucceed via node 4 after a rebalance-driven join. Across runs the result varies based on whichschema_partition(entity)happens to land on node 4 — when the partition isn't there, node 4 reaches the schema only via forwarding to whichever node holds it locally. Tracing on the constraint side (issue #57) directly showeddb_constraints.list_all()returning differently per node (1: 2, 2: 0, 3: 1, 4: 0). Schemas use the identical write/replication path so the same shape applies.Why it matters specifically for schemas
SchemaStore::is_valid_for_write(crates/mqdb-cluster/src/cluster/db/schema_store.rs) is consulted on every per-entity write. A node that doesn't hold the schema locally either has to forward the validation check (extra hop) or skip it (silent acceptance of malformed records).SchemaState::{Active, PendingAdd, PendingDrop, Dropped}are point-in-time decisions. If node A still seesActivewhile node B has advanced toPendingDrop, writes routed to one node can be accepted while the other rejects them — observable as flapping behavior on the same record.mqdb schema list/schema get. Both queries return only what the receiving node knows locally unless explicit forwarding is wired in, so the visible schema set depends on which node the client connects to.c.partition() == partition. Schemas whose natural partition didn't rebalance to the new node never arrive in itsSchemaStore— same as constraints.Root cause (shared with #57)
Schemas are written via a
ReplicationWritewith the entity's ownschema_partition— so the broadcast loop atreplication_ops.rs:350-369should send to every alive node. Empirically (per the constraint trace in #57) those broadcasts don't reliably reach every node, and joining nodes only catch up via per-partition snapshots which are scoped to the partition being snapshotted. The two failure modes:WriteRequests during warmup, with no retry.handle_write_requestlogs and continues onapply_writefailure (replication_ops.rs:95-97).Suggested directions (same options as #57)
topic_index/wildcards/client_locationsare handled. Schemas are small and rarely change; the cost is negligible.A combined design touching both schemas and constraints would close #57 and this together.
Repro
Same as #57: run examples/cluster-rebalance-stores/run.sh with an enterprise license. The "OBSERVATION (missing): schema get comments via node 4" output line in failing runs reflects this gap. To prove the local-store gap directly, add a
tracing::info!inSchemaStore::apply_replicatedprinting the count after each apply and re-run.Related