Skip to content

feat(validator): direct-DB writes for crown_holders + live current crown#390

Merged
entrius merged 17 commits into
testfrom
feat/validator-direct-db-storage
May 28, 2026
Merged

feat(validator): direct-DB writes for crown_holders + live current crown#390
entrius merged 17 commits into
testfrom
feat/validator-direct-db-storage

Conversation

@anderdc
Copy link
Copy Markdown
Collaborator

@anderdc anderdc commented May 27, 2026

Summary

The validator owns its dashboard-write surface directly instead of being mirrored by the alw-utils sync-validator-state daemon. Originally two changes; now also includes the cadence fix that makes the direct path actually produce gap-free coverage, plus a merge of test.

1. Round-end direct write

Replaces the alw-utils daemon's port-and-mirror pattern with direct writes from the validator's scoring path:

  • New allways/validator/storage/ module (psycopg v3) mirroring the gittensor layout (database/queries/repository/storage).
  • replay_crown_time_window gains an optional intervals_out capture list — exposes the per-interval crown data it was already computing internally; no change to return value or scoring math.
  • calculate_miner_rewards expands captured intervals to per-block crown_holders rows, collects rate events for the window, and flushes everything plus sync_cursor advancement in one transaction at end-of-round.
  • Eliminates the validator-vs-daemon drift class.

2. Per-forward live current-crown write

The dashboard refreshes the current-crown header within one forward pass instead of waiting for the next round flush:

  • snapshot_current_crown_holders(self) reconstructs rates/busy/active at self.block and evaluates crown_holders_at_instant once per direction. No event-stream walk.
  • DatabaseStorage.upsert_current_crown_snapshot atomically replaces each direction's rows (delete then insert), preserving k-way tie semantics.
  • Called at the tail of forward(), after deadline-sensitive RPC, wrapped in try/except so DB outages can't propagate.
  • Now applies the reservation-pin overlay and the Filter unexecutable rates from crown eligibility #395 executability filter so the live table credits the exact same holder the scoring/ledger path does — no more showing an out-of-bounds-rate or pre-pin holder the ledger drops. Bounds come from the 300-block-TTL bounds_cache, so it stays a zero-RPC, sub-ms per-step write.

3. Block-based scoring cadence + gap-free window tiling ⚠️ behavior change

This is the fix that makes the direct-write path correct, and it changes reward distribution — review accordingly.

The scoring round fired on a forward-step count (step % SCORING_WINDOW_BLOCKS) and wrote only the trailing SCORING_WINDOW_BLOCKS window. But a forward pass spans several blocks (~5 on mainnet, more on testnet under RPC limits). At ~5 blocks/step a 600-block window fired only every ~3,000 blocks, so:

  • the per-block crown_holders ledger was ~20% covered (a regression vs the continuous alw-utils indexer this PR replaces — the dashboard grid would look ~80% empty on prod), and
  • reward scoring sampled ~2h out of every ~10h instead of running continuously, and the intended "~2h cadence" was really ~10h.

Fix (root cause = step-vs-block unit mismatch):

  • due_for_scoring() — block-based gate: fire once on a fresh process, then every SCORING_WINDOW_BLOCKS blocks, independent of forward-pass length.
  • scoring_window_bounds() — anchor window_start to the last-scored block so consecutive rounds tile with no per-block gap; capped at MAX_SCORING_BACKFILL_BLOCKS so a post-outage catch-up round can't replay an unbounded span.
  • Validator seeds last_scored_block one window back (preserving the "one trailing window on fresh start" behavior); score_and_reward_miners advances it only after a round completes, so a failed round retries its window rather than skipping it.
  • Window shortened 2h → 1h (SCORING_WINDOW_BLOCKS 600 → 300). Halves miner feedback lag — scores and the per-block ledger refresh every ~1h. The window is the only smoothing (SCORING_EMA_ALPHA = 1.0); 1h is near-identical to 2h for steady miners, just tracks recent activity tighter. Every lookback (prune cutoffs, event-watcher backfill, the 2× cap) follows the constant.

Both helpers are pure and unit-tested (tiling, overshoot, backfill cap, fresh-seed). The replay/flush/storage machinery is unchanged — it was already parameterized on window_start/window_end.

Consensus note: this shifts reward scoring from a sparse ~20% sample to full contiguous coverage every ~1h. Emissions distribution will move accordingly. Intended and more correct (crown time held between the old sparse samples was previously uncredited), but it is a live-network behavior change.

Housekeeping

Both DB writes remain gated by STORE_DB_RESULTS — dark by default; disabled validators run identically to before.

Companion PRs:

  • schema: entrius/allways-db#21 — adds the current_crown_holders table
  • dashboard reader: entrius/das-allways#28getCurrent() reads the new table with fallback to legacy path

Rollout order

  1. Apply allways-db#21 migration
  2. Deploy das-allways#28 (falls back to legacy path while the new table is empty)
  3. Deploy this PR
  4. Set STORE_DB_RESULTS=1 on a validator — verify rows in crown_holders, rate_history, current_crown_holders, and that crown_holders coverage is contiguous (no gaps between scoring rounds)
  5. After a roll-out window, stop the alw-utils/sync-validator-state daemon

Test plan

  • 616 tests pass (7 new: block-based gate + gap-free tiling + backfill cap)
  • pre-commit (ruff lint + format) clean
  • Staging validator with STORE_DB_RESULTS=1: current_crown_holders has 1+ rows/direction within one forward step
  • Verify crown_holders tiles gap-free across consecutive scoring rounds (round N window_end == round N+1 window_start)
  • Confirm emissions move as expected under continuous (vs sampled) scoring before wide rollout
  • Force a DB outage mid-forward — verify forward continues and votes still land

anderdc and others added 3 commits May 26, 2026 16:32
Replaces the alw-utils/sync-validator-state daemon's port-and-mirror
pattern with direct writes from the validator's scoring path. The
daemon was a hand-maintained Python copy of replay_crown_time_window
that ran on a 12s poll and re-derived crown winners from raw events;
two copies of the same algorithm are now collapsed to one.

Mechanics:
- New `allways/validator/storage/` module mirrors the gittensor
  storage layout (database.py / queries.py / repository.py /
  storage.py). Uses psycopg v3 to match gittensor.
- `replay_crown_time_window` gains an optional `intervals_out`
  capture list — the per-interval data the function was already
  computing internally, exposed as a side-channel without changing
  its return value or the scoring math.
- `calculate_miner_rewards` expands captured intervals to per-block
  crown_holders rows, collects rate events for the window from
  state_store, and flushes both plus sync_cursor advancement in a
  single transaction at the end of the round.
- `DatabaseStorage` is instantiated on the Validator alongside
  state_store; closed on validator exit.

Default-disabled: STORE_DB_RESULTS env var gates everything. When
unset, `is_enabled()` returns False, `intervals_out` stays None,
the tee is a no-op, and no DB connection is attempted. Validators
that don't write to the dashboard DB run identically to before.
The `allways_network` block in docker-compose.vali.yml is also
commented out by default, mirroring gittensor.

All 563 existing tests pass; the scoring test fixture's
SimpleNamespace gained a disabled DatabaseStorage stub.
The round-end flush (every 600 blocks ≈ 2h) is the right cadence for
the per-block scoring ledger in crown_holders, but it makes the
dashboard's "who holds the crown right now" widget stale for hours
at a time. Users have complained that crown holders aren't really
"known" between rounds.

New live writer:
- snapshot_current_crown_holders(self) reconstructs rates/busy/active
  at self.block and evaluates crown_holders_at_instant once per
  direction. No event-stream walk — sub-millisecond per call.
- DatabaseStorage.upsert_current_crown_snapshot replaces each
  direction's rows in current_crown_holders atomically (delete then
  insert), preserving tie semantics: a k-way tie writes k rows with
  credit=1/k. Readers that want a single winner pick one (e.g. first
  by hotkey).
- Called at the tail of forward(), after vote_initiate / finalize /
  timeout-extension RPC so DB latency can't push deadline-sensitive
  votes past their block windows. Wrapped in try/except so any DB
  hiccup is logged and forward continues.

Same STORE_DB_RESULTS gate as the round-end flush; disabled validators
pay no cost (is_enabled() short-circuits everything).

Requires the current_crown_holders table — see corresponding
allways-db migration.

All 563 existing tests pass; no test changes needed since the new
path is gated by the existing MagicMock stub returning is_enabled=False.
anderdc and others added 10 commits May 27, 2026 16:59
Applies four fixes flagged by an independent audit of the validator
direct-DB-storage rollout (replacing alw-utils/sync-validator-state):

1. DB ceilings + clearer startup logging. New connect_timeout=2s and
   server-side statement_timeout=2000ms cap every individual write so
   a slow/down Postgres can never stall the forward loop. Single
   connect attempt at validator init — no retry, no backoff. The
   DatabaseStorage constructor now logs the gating decision and the
   connection result explicitly:
     - "STORE_DB_RESULTS not set — validator DB storage disabled"
     - "STORE_DB_RESULTS=1 — connecting to Postgres for dashboard writes"
     - "Validator DB storage enabled"  (on success)
     - "STORE_DB_RESULTS=1 but Postgres connection failed — dashboard
        writes disabled for this process" (on failure)

2. Halt-path DB writes. The daemon explicitly cleared crown_holders
   for the halted window and advanced sync_cursor. Without this, the
   dashboard kept showing pre-halt holders while the validator had
   recycled the pool. Adds DatabaseStorage.flush_halt_window and
   wires it into score_and_reward_miners' halted branch.

3. Live current-crown halt-awareness. snapshot_current_crown_holders
   now accepts halted=True and returns empty rows for every direction;
   the per-forward writer in forward.py checks contract_is_halted
   (renamed from _contract_is_halted to make it module-public) and
   passes it through. During a halt the live table is cleared so the
   UI matches the recycle semantics.

4. Off-by-one sync_cursor. flush_scoring_window writes range
   [window_start, window_end) exclusive of window_end, so the last
   block actually flushed is window_end - 1. The cursor was claiming
   window_end — readers were lied to about freshness by one block.

Skipping the multi-validator cursor GREATEST fix from the audit per
operator: only a single validator writes the dashboard DB in practice.

All 563 existing tests pass — no test changes needed since the new
paths are gated by the existing MagicMock stub returning is_enabled=False.
Two perf reductions on the forward-step cost introduced by the
direct-DB-storage rollout, both targeting calls that fire every ~12s
under STORE_DB_RESULTS=1.

1. Halt RPC cache. contract_is_halted now delegates to
   bounds_cache.halted() with a 5-block (~60s) TTL. The per-forward
   live-crown snapshot was hitting the substrate RPC every step;
   halt only flips via admin tx so a short cache cuts ~50-200ms
   off most forward steps. RPC failures fall back to the last
   cached value (or False if none) so transient flakes don't
   churn the live-crown writer.

2. Batched rate reconstruction. state_store gains
   get_latest_rates_before(from_chain, to_chain, block) — one
   query per direction with ROW_NUMBER() OVER (PARTITION BY hotkey
   ORDER BY block DESC, id DESC) instead of N point lookups.
   reconstruct_window_start_state filters the dict by membership
   in rewardable_hotkeys in Python. Same tie-break semantics as
   the single-row form. Saves ~10-30ms on every snapshot, more
   as miner count grows.

   New SQL lives in allways/validator/state_store_queries.py — first
   step toward migrating state_store SQL out of inline method bodies
   per the project's "SQL goes in queries.py" convention.

All 563 tests pass.
contract_is_halted was running once per forward step (~12s) to gate
the live-crown snapshot's halt-aware path. Halt is rare and the
dashboard already signals it loudly via the HaltBanner + top-right
"paused" indicator (both fed by /halt off contract_events, not the
contract RPC) — so an ~2h lag between an actual halt and the live
current_crown_holders table being cleared is acceptable, and worth
not burning 1 RPC per minute (or even 1 every 5 blocks under the
cache) just to detect a state that flips a few times a year.

Changes:
- forward.py per-step snapshot block no longer calls
  contract_is_halted; snapshot_current_crown_holders called bare.
- snapshot_current_crown_holders loses the halted= parameter (dead
  with no caller).
- _flush_halt_window now also clears current_crown_holders by
  calling upsert_current_crown_snapshot with empty rows per
  direction. One clear per scoring round at halt-detection time.

Net halt-RPC frequency returns to pre-direct-DB-storage cadence
(~1 per scoring round, ~every 2h), even though the bounds_cache.halted()
helper stays in place for any future caller. All 563 tests pass.
…-db-storage

# Conflicts:
#	allways/validator/scoring.py
The queries-module convention isn't worth a separate file for a single
constant; inline the SQL at its one call site (matching the singular
get_latest_rate_before form) and remove the module.
snapshot_current_crown_holders now passes executable_rate_check into
crown_holders_at_instant, matching the scoring/ledger path (#395). Bounds
come from the 300-block-TTL bounds_cache, so no per-step RPC; both-zero on
read failure is the unset sentinel (permissive), preserving prior behavior.
Closes the gap where the live current_crown_holders table could show an
out-of-bounds-rate holder the historical ledger drops.
The scoring round fired on a forward-step count (step % SCORING_WINDOW_BLOCKS)
and wrote only the trailing SCORING_WINDOW_BLOCKS window. A forward pass spans
several blocks (~5 on mainnet, more on testnet under RPC limits), so rounds
fired far less often than once per window — at ~5 blocks/step a 600-block
window fired every ~3000 blocks, leaving ~80% of blocks unscored. That made the
per-block crown_holders ledger sparse (a regression vs the continuous alw-utils
indexer this PR replaces) and meant reward scoring sampled ~2h out of every
~10h instead of running continuously.

Fix:
- due_for_scoring(): block-based gate — fire once on a fresh process, then
  every SCORING_WINDOW_BLOCKS *blocks*, independent of forward-pass length.
- scoring_window_bounds(): anchor window_start to the last-scored block so
  consecutive rounds tile with no per-block gap, capped at
  MAX_SCORING_BACKFILL_BLOCKS so a post-outage catch-up round can't replay an
  unbounded span.
- Validator seeds last_scored_block one window back; score_and_reward_miners
  advances it only after a round completes (a failed round retries its window).

Both helpers are pure and unit-tested for tiling, overshoot, and the backfill
cap. Replay/flush/storage machinery is unchanged — it's already window-param'd.
@anderdc anderdc force-pushed the feat/validator-direct-db-storage branch from b919097 to 6bf3c8d Compare May 28, 2026 21:15
anderdc added 4 commits May 28, 2026 16:19
A long stall isn't worth recovering — the event-watcher only reconstructs
~one window back on restart anyway. Cap at 2*SCORING_WINDOW_BLOCKS: enough
headroom over one window that forward-pass overshoot can't re-open a gap,
while a post-stall round just resumes a couple hours back instead of ~16h.
Halves the miner feedback lag — scores, the per-block crown_holders ledger,
and weights-driving data all refresh every ~1h instead of ~2h. The window is
the only smoothing (EMA alpha=1.0), and 1h is near-identical to 2h for steady
miners while tracking recent activity tighter; the main trade is a slightly
sharper busy-while-fulfilling dip, a pre-existing crown/busy property.

SCORING_WINDOW_BLOCKS 600 -> 300; every lookback (window_start, prune cutoffs,
event-watcher backfill, last_scored seed) and the 2x backfill cap follow the
constant automatically. Tests that hardcoded blocks for the 600-wide window
repositioned to fit 300 (no assertion changes).
This PR is the wiring — the module is now called from the scoring run and
the per-forward snapshot, so the 'follow-up' notes were inaccurate.
@entrius entrius merged commit eeb2991 into test May 28, 2026
3 checks passed
@entrius entrius deleted the feat/validator-direct-db-storage branch May 28, 2026 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants