Skip to content

constraints not replicated uniformly across cluster nodes #57

@fabracht

Description

@fabracht

Summary

Discovered while running examples/cluster-rebalance-stores/run.sh on PR #56. Constraints (both unique and foreign_key) end up in the local ConstraintStore of only a subset of cluster nodes. FK enforcement is therefore non-uniform: cascades, RESTRICT, and orphan rejection only work fully on whichever nodes happen to hold the constraint locally.

Evidence

3-node cluster + 4th node joining via rebalance, with a unique constraint on posts.title and an FK constraint comments.post_id → posts.id registered through node 1. Tracing in StoreManager::rebuild_fk_indexes_after_import (added during PR #56 investigation, since removed) showed db_constraints.list_all() returning, per node:

Node Total constraints FK constraints
1 (leader, where constraint was registered) 2 1
2 0 0
3 1 (FK only, missing unique) 1
4 (rebalance target) 0 0

E2E observation in the same run: cascade-via-node-4 leaves up to 12 of 12 eligible children alive in runs where schema_partition(\"comments\") doesn't land on node 4 — children whose data partition is on node 4 are missed because no node knows to scatter the FK reverse-lookup to node 4.

Root cause

StoreManager::constraint_add_replicated writes to PartitionId::ZERO and entity DB_CONSTRAINT. The async replication pipeline only delivers that write to nodes that are replicas of the constraint's natural partition (schema_partition(entity) for entity-scoped constraints, or PartitionId::ZERO for the legacy path). Constraints are reachable from any node via forwarding but are absent from every non-replica's local ConstraintStore.

Code references:

  • Constraint write construction: crates/mqdb-cluster/src/cluster/store_manager/constraint_ops.rs:13-37
  • Constraint apply path: crates/mqdb-cluster/src/cluster/store_manager/apply.rs:208-221
  • Constraint partitioning: crates/mqdb-cluster/src/cluster/db/constraint_store.rs:184-186 (entity-scoped, schema_partition(entity_str()))
  • Snapshot delivery (PR add partition snapshot exports for schema/index/unique/fk/constraint stores #54): crates/mqdb-cluster/src/cluster/db/constraint_store.rs::export_for_partition — only includes constraints where c.partition() == partition, so constraints don't ride along on unrelated partition snapshots either.

This is the same shape as the schema replication topology gap first noted in the 0.3.2 CHANGELOG entry.

Why it matters

Cascade and RESTRICT enforcement is non-uniform across the cluster:

  • A delete processed on a node without the FK constraint locally cannot scatter reverse-lookup correctly, so children on other nodes' primaries are missed.
  • A FK orphan check on a non-constraint-holding node may need a forward, increasing latency and creating a window for inconsistent state.
  • Rebalance compounds the issue: a node freshly promoted to primary for some data partitions may not hold the FK constraints relevant to that data, so the constraint-aware paths (update_fk_reverse_index, ON DELETE handling) silently no-op for writes hitting that node.

PR #56 closed the FkReverseIndex snapshot gap (the rebuild step is correct for whatever constraints the importing node holds locally), but the broader cascade-through-rebalanced-node behavior won't be fully correct until constraints are reliably present on every node.

Suggested directions

Two options worth weighing:

  1. Cluster-wide broadcast state. Treat db_constraints like topic_index/wildcards/client_locations — apply locally first, then forward to all alive nodes. Constraints are small and rarely change; the overhead is negligible. This would make db_constraints.list_all() authoritative on every node.

  2. Replicate via Raft instead of async. Constraints are cluster-management state, not per-partition data; routing them through Raft (which already replicates to every node) would deliver them uniformly without new broadcast machinery. The downside is coupling constraint adds to leader-elected Raft availability.

Both would benefit from a related fix to schemas for the same reason.

Repro

# generate a license (private key in laboverwire/mqdb-license-gen):
mqdb-license-gen generate --private-key /path/to/private.pem \
    --customer test@local --tier enterprise --features cluster --duration 1d > /tmp/lic.key

export MQDB_LICENSE_FILE=/tmp/lic.key
./examples/cluster-rebalance-stores/run.sh

Run produces an OBSERVATION line FK cascade via node 4: N of M eligible children removed with M usually less than 21 — the missing children are those whose data-partition primaries don't hold the FK constraint locally. Re-running yields different counts depending on hash distribution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions