feat(validator): direct-DB writes for crown_holders + live current crown#390
Merged
Conversation
Replaces the alw-utils/sync-validator-state daemon's port-and-mirror pattern with direct writes from the validator's scoring path. The daemon was a hand-maintained Python copy of replay_crown_time_window that ran on a 12s poll and re-derived crown winners from raw events; two copies of the same algorithm are now collapsed to one. Mechanics: - New `allways/validator/storage/` module mirrors the gittensor storage layout (database.py / queries.py / repository.py / storage.py). Uses psycopg v3 to match gittensor. - `replay_crown_time_window` gains an optional `intervals_out` capture list — the per-interval data the function was already computing internally, exposed as a side-channel without changing its return value or the scoring math. - `calculate_miner_rewards` expands captured intervals to per-block crown_holders rows, collects rate events for the window from state_store, and flushes both plus sync_cursor advancement in a single transaction at the end of the round. - `DatabaseStorage` is instantiated on the Validator alongside state_store; closed on validator exit. Default-disabled: STORE_DB_RESULTS env var gates everything. When unset, `is_enabled()` returns False, `intervals_out` stays None, the tee is a no-op, and no DB connection is attempted. Validators that don't write to the dashboard DB run identically to before. The `allways_network` block in docker-compose.vali.yml is also commented out by default, mirroring gittensor. All 563 existing tests pass; the scoring test fixture's SimpleNamespace gained a disabled DatabaseStorage stub.
The round-end flush (every 600 blocks ≈ 2h) is the right cadence for the per-block scoring ledger in crown_holders, but it makes the dashboard's "who holds the crown right now" widget stale for hours at a time. Users have complained that crown holders aren't really "known" between rounds. New live writer: - snapshot_current_crown_holders(self) reconstructs rates/busy/active at self.block and evaluates crown_holders_at_instant once per direction. No event-stream walk — sub-millisecond per call. - DatabaseStorage.upsert_current_crown_snapshot replaces each direction's rows in current_crown_holders atomically (delete then insert), preserving tie semantics: a k-way tie writes k rows with credit=1/k. Readers that want a single winner pick one (e.g. first by hotkey). - Called at the tail of forward(), after vote_initiate / finalize / timeout-extension RPC so DB latency can't push deadline-sensitive votes past their block windows. Wrapped in try/except so any DB hiccup is logged and forward continues. Same STORE_DB_RESULTS gate as the round-end flush; disabled validators pay no cost (is_enabled() short-circuits everything). Requires the current_crown_holders table — see corresponding allways-db migration. All 563 existing tests pass; no test changes needed since the new path is gated by the existing MagicMock stub returning is_enabled=False.
6 tasks
Applies four fixes flagged by an independent audit of the validator
direct-DB-storage rollout (replacing alw-utils/sync-validator-state):
1. DB ceilings + clearer startup logging. New connect_timeout=2s and
server-side statement_timeout=2000ms cap every individual write so
a slow/down Postgres can never stall the forward loop. Single
connect attempt at validator init — no retry, no backoff. The
DatabaseStorage constructor now logs the gating decision and the
connection result explicitly:
- "STORE_DB_RESULTS not set — validator DB storage disabled"
- "STORE_DB_RESULTS=1 — connecting to Postgres for dashboard writes"
- "Validator DB storage enabled" (on success)
- "STORE_DB_RESULTS=1 but Postgres connection failed — dashboard
writes disabled for this process" (on failure)
2. Halt-path DB writes. The daemon explicitly cleared crown_holders
for the halted window and advanced sync_cursor. Without this, the
dashboard kept showing pre-halt holders while the validator had
recycled the pool. Adds DatabaseStorage.flush_halt_window and
wires it into score_and_reward_miners' halted branch.
3. Live current-crown halt-awareness. snapshot_current_crown_holders
now accepts halted=True and returns empty rows for every direction;
the per-forward writer in forward.py checks contract_is_halted
(renamed from _contract_is_halted to make it module-public) and
passes it through. During a halt the live table is cleared so the
UI matches the recycle semantics.
4. Off-by-one sync_cursor. flush_scoring_window writes range
[window_start, window_end) exclusive of window_end, so the last
block actually flushed is window_end - 1. The cursor was claiming
window_end — readers were lied to about freshness by one block.
Skipping the multi-validator cursor GREATEST fix from the audit per
operator: only a single validator writes the dashboard DB in practice.
All 563 existing tests pass — no test changes needed since the new
paths are gated by the existing MagicMock stub returning is_enabled=False.
Two perf reductions on the forward-step cost introduced by the direct-DB-storage rollout, both targeting calls that fire every ~12s under STORE_DB_RESULTS=1. 1. Halt RPC cache. contract_is_halted now delegates to bounds_cache.halted() with a 5-block (~60s) TTL. The per-forward live-crown snapshot was hitting the substrate RPC every step; halt only flips via admin tx so a short cache cuts ~50-200ms off most forward steps. RPC failures fall back to the last cached value (or False if none) so transient flakes don't churn the live-crown writer. 2. Batched rate reconstruction. state_store gains get_latest_rates_before(from_chain, to_chain, block) — one query per direction with ROW_NUMBER() OVER (PARTITION BY hotkey ORDER BY block DESC, id DESC) instead of N point lookups. reconstruct_window_start_state filters the dict by membership in rewardable_hotkeys in Python. Same tie-break semantics as the single-row form. Saves ~10-30ms on every snapshot, more as miner count grows. New SQL lives in allways/validator/state_store_queries.py — first step toward migrating state_store SQL out of inline method bodies per the project's "SQL goes in queries.py" convention. All 563 tests pass.
contract_is_halted was running once per forward step (~12s) to gate the live-crown snapshot's halt-aware path. Halt is rare and the dashboard already signals it loudly via the HaltBanner + top-right "paused" indicator (both fed by /halt off contract_events, not the contract RPC) — so an ~2h lag between an actual halt and the live current_crown_holders table being cleared is acceptable, and worth not burning 1 RPC per minute (or even 1 every 5 blocks under the cache) just to detect a state that flips a few times a year. Changes: - forward.py per-step snapshot block no longer calls contract_is_halted; snapshot_current_crown_holders called bare. - snapshot_current_crown_holders loses the halted= parameter (dead with no caller). - _flush_halt_window now also clears current_crown_holders by calling upsert_current_crown_snapshot with empty rows per direction. One clear per scoring round at halt-detection time. Net halt-RPC frequency returns to pre-direct-DB-storage cadence (~1 per scoring round, ~every 2h), even though the bounds_cache.halted() helper stays in place for any future caller. All 563 tests pass.
…-db-storage # Conflicts: # allways/validator/scoring.py
The queries-module convention isn't worth a separate file for a single constant; inline the SQL at its one call site (matching the singular get_latest_rate_before form) and remove the module.
snapshot_current_crown_holders now passes executable_rate_check into crown_holders_at_instant, matching the scoring/ledger path (#395). Bounds come from the 300-block-TTL bounds_cache, so no per-step RPC; both-zero on read failure is the unset sentinel (permissive), preserving prior behavior. Closes the gap where the live current_crown_holders table could show an out-of-bounds-rate holder the historical ledger drops.
The scoring round fired on a forward-step count (step % SCORING_WINDOW_BLOCKS) and wrote only the trailing SCORING_WINDOW_BLOCKS window. A forward pass spans several blocks (~5 on mainnet, more on testnet under RPC limits), so rounds fired far less often than once per window — at ~5 blocks/step a 600-block window fired every ~3000 blocks, leaving ~80% of blocks unscored. That made the per-block crown_holders ledger sparse (a regression vs the continuous alw-utils indexer this PR replaces) and meant reward scoring sampled ~2h out of every ~10h instead of running continuously. Fix: - due_for_scoring(): block-based gate — fire once on a fresh process, then every SCORING_WINDOW_BLOCKS *blocks*, independent of forward-pass length. - scoring_window_bounds(): anchor window_start to the last-scored block so consecutive rounds tile with no per-block gap, capped at MAX_SCORING_BACKFILL_BLOCKS so a post-outage catch-up round can't replay an unbounded span. - Validator seeds last_scored_block one window back; score_and_reward_miners advances it only after a round completes (a failed round retries its window). Both helpers are pure and unit-tested for tiling, overshoot, and the backfill cap. Replay/flush/storage machinery is unchanged — it's already window-param'd.
b919097 to
6bf3c8d
Compare
added 4 commits
May 28, 2026 16:19
A long stall isn't worth recovering — the event-watcher only reconstructs ~one window back on restart anyway. Cap at 2*SCORING_WINDOW_BLOCKS: enough headroom over one window that forward-pass overshoot can't re-open a gap, while a post-stall round just resumes a couple hours back instead of ~16h.
Halves the miner feedback lag — scores, the per-block crown_holders ledger, and weights-driving data all refresh every ~1h instead of ~2h. The window is the only smoothing (EMA alpha=1.0), and 1h is near-identical to 2h for steady miners while tracking recent activity tighter; the main trade is a slightly sharper busy-while-fulfilling dip, a pre-existing crown/busy property. SCORING_WINDOW_BLOCKS 600 -> 300; every lookback (window_start, prune cutoffs, event-watcher backfill, last_scored seed) and the 2x backfill cap follow the constant automatically. Tests that hardcoded blocks for the 600-wide window repositioned to fit 300 (no assertion changes).
This PR is the wiring — the module is now called from the scoring run and the per-forward snapshot, so the 'follow-up' notes were inaccurate.
entrius
approved these changes
May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The validator owns its dashboard-write surface directly instead of being mirrored by the alw-utils
sync-validator-statedaemon. Originally two changes; now also includes the cadence fix that makes the direct path actually produce gap-free coverage, plus a merge oftest.1. Round-end direct write
Replaces the alw-utils daemon's port-and-mirror pattern with direct writes from the validator's scoring path:
allways/validator/storage/module (psycopg v3) mirroring the gittensor layout (database/queries/repository/storage).replay_crown_time_windowgains an optionalintervals_outcapture list — exposes the per-interval crown data it was already computing internally; no change to return value or scoring math.calculate_miner_rewardsexpands captured intervals to per-blockcrown_holdersrows, collects rate events for the window, and flushes everything plus sync_cursor advancement in one transaction at end-of-round.2. Per-forward live current-crown write
The dashboard refreshes the current-crown header within one forward pass instead of waiting for the next round flush:
snapshot_current_crown_holders(self)reconstructs rates/busy/active atself.blockand evaluatescrown_holders_at_instantonce per direction. No event-stream walk.DatabaseStorage.upsert_current_crown_snapshotatomically replaces each direction's rows (delete then insert), preserving k-way tie semantics.forward(), after deadline-sensitive RPC, wrapped in try/except so DB outages can't propagate.bounds_cache, so it stays a zero-RPC, sub-ms per-step write.3. Block-based scoring cadence + gap-free window tiling⚠️ behavior change
This is the fix that makes the direct-write path correct, and it changes reward distribution — review accordingly.
The scoring round fired on a forward-step count (
step % SCORING_WINDOW_BLOCKS) and wrote only the trailingSCORING_WINDOW_BLOCKSwindow. But a forward pass spans several blocks (~5 on mainnet, more on testnet under RPC limits). At ~5 blocks/step a 600-block window fired only every ~3,000 blocks, so:crown_holdersledger was ~20% covered (a regression vs the continuous alw-utils indexer this PR replaces — the dashboard grid would look ~80% empty on prod), andFix (root cause = step-vs-block unit mismatch):
due_for_scoring()— block-based gate: fire once on a fresh process, then everySCORING_WINDOW_BLOCKSblocks, independent of forward-pass length.scoring_window_bounds()— anchorwindow_startto the last-scored block so consecutive rounds tile with no per-block gap; capped atMAX_SCORING_BACKFILL_BLOCKSso a post-outage catch-up round can't replay an unbounded span.last_scored_blockone window back (preserving the "one trailing window on fresh start" behavior);score_and_reward_minersadvances it only after a round completes, so a failed round retries its window rather than skipping it.SCORING_WINDOW_BLOCKS600 → 300). Halves miner feedback lag — scores and the per-block ledger refresh every ~1h. The window is the only smoothing (SCORING_EMA_ALPHA = 1.0); 1h is near-identical to 2h for steady miners, just tracks recent activity tighter. Every lookback (prune cutoffs, event-watcher backfill, the 2× cap) follows the constant.Both helpers are pure and unit-tested (tiling, overshoot, backfill cap, fresh-seed). The replay/flush/storage machinery is unchanged — it was already parameterized on
window_start/window_end.Consensus note: this shifts reward scoring from a sparse ~20% sample to full contiguous coverage every ~1h. Emissions distribution will move accordingly. Intended and more correct (crown time held between the old sparse samples was previously uncredited), but it is a live-network behavior change.
Housekeeping
origin/test(brings Crown calc uses pinned rate during the reserved-not-busy window #391 pinned-rate, Filter unexecutable rates from crown eligibility #395/Fix swap quote row selection and close TAO→BTC crown loophole #420 executability filter, Tighten timeout-extension evidence + persist dest-tip snapshot #421/Refactor: extract lock+commit boilerplate from ValidatorStateStore #422). Conflict inscoring.pyresolved by combiningintervals_out+min/max_swap_rao+ the newpinned_ratesreturn.BATCH_LATEST_RATES_BEFOREquery and removed the single-constantstate_store_queries.pymodule.Both DB writes remain gated by
STORE_DB_RESULTS— dark by default; disabled validators run identically to before.Companion PRs:
entrius/allways-db#21— adds thecurrent_crown_holderstableentrius/das-allways#28—getCurrent()reads the new table with fallback to legacy pathRollout order
allways-db#21migrationdas-allways#28(falls back to legacy path while the new table is empty)STORE_DB_RESULTS=1on a validator — verify rows incrown_holders,rate_history,current_crown_holders, and thatcrown_holderscoverage is contiguous (no gaps between scoring rounds)alw-utils/sync-validator-statedaemonTest plan
STORE_DB_RESULTS=1:current_crown_holdershas 1+ rows/direction within one forward stepcrown_holderstiles gap-free across consecutive scoring rounds (round Nwindow_end== round N+1window_start)