feat(validator): direct-DB writes for crown_holders + live current crown by anderdc · Pull Request #390 · entrius/allways

anderdc · 2026-05-27T19:29:05Z

Summary

The validator owns its dashboard-write surface directly instead of being mirrored by the alw-utils sync-validator-state daemon. Originally two changes; now also includes the cadence fix that makes the direct path actually produce gap-free coverage, plus a merge of test.

1. Round-end direct write

Replaces the alw-utils daemon's port-and-mirror pattern with direct writes from the validator's scoring path:

New allways/validator/storage/ module (psycopg v3) mirroring the gittensor layout (database/queries/repository/storage).
replay_crown_time_window gains an optional intervals_out capture list — exposes the per-interval crown data it was already computing internally; no change to return value or scoring math.
calculate_miner_rewards expands captured intervals to per-block crown_holders rows, collects rate events for the window, and flushes everything plus sync_cursor advancement in one transaction at end-of-round.
Eliminates the validator-vs-daemon drift class.

2. Per-forward live current-crown write

The dashboard refreshes the current-crown header within one forward pass instead of waiting for the next round flush:

snapshot_current_crown_holders(self) reconstructs rates/busy/active at self.block and evaluates crown_holders_at_instant once per direction. No event-stream walk.
DatabaseStorage.upsert_current_crown_snapshot atomically replaces each direction's rows (delete then insert), preserving k-way tie semantics.
Called at the tail of forward(), after deadline-sensitive RPC, wrapped in try/except so DB outages can't propagate.
Now applies the reservation-pin overlay and the Filter unexecutable rates from crown eligibility #395 executability filter so the live table credits the exact same holder the scoring/ledger path does — no more showing an out-of-bounds-rate or pre-pin holder the ledger drops. Bounds come from the 300-block-TTL bounds_cache, so it stays a zero-RPC, sub-ms per-step write.

3. Block-based scoring cadence + gap-free window tiling ⚠️ behavior change

This is the fix that makes the direct-write path correct, and it changes reward distribution — review accordingly.

The scoring round fired on a forward-step count (step % SCORING_WINDOW_BLOCKS) and wrote only the trailing SCORING_WINDOW_BLOCKS window. But a forward pass spans several blocks (~5 on mainnet, more on testnet under RPC limits). At ~5 blocks/step a 600-block window fired only every ~3,000 blocks, so:

the per-block crown_holders ledger was ~20% covered (a regression vs the continuous alw-utils indexer this PR replaces — the dashboard grid would look ~80% empty on prod), and
reward scoring sampled ~2h out of every ~10h instead of running continuously, and the intended "~2h cadence" was really ~10h.

Fix (root cause = step-vs-block unit mismatch):

due_for_scoring() — block-based gate: fire once on a fresh process, then every SCORING_WINDOW_BLOCKS blocks, independent of forward-pass length.
scoring_window_bounds() — anchor window_start to the last-scored block so consecutive rounds tile with no per-block gap; capped at MAX_SCORING_BACKFILL_BLOCKS so a post-outage catch-up round can't replay an unbounded span.
Validator seeds last_scored_block one window back (preserving the "one trailing window on fresh start" behavior); score_and_reward_miners advances it only after a round completes, so a failed round retries its window rather than skipping it.
Window shortened 2h → 1h (SCORING_WINDOW_BLOCKS 600 → 300). Halves miner feedback lag — scores and the per-block ledger refresh every ~1h. The window is the only smoothing (SCORING_EMA_ALPHA = 1.0); 1h is near-identical to 2h for steady miners, just tracks recent activity tighter. Every lookback (prune cutoffs, event-watcher backfill, the 2× cap) follows the constant.

Both helpers are pure and unit-tested (tiling, overshoot, backfill cap, fresh-seed). The replay/flush/storage machinery is unchanged — it was already parameterized on window_start/window_end.

Consensus note: this shifts reward scoring from a sparse ~20% sample to full contiguous coverage every ~1h. Emissions distribution will move accordingly. Intended and more correct (crown time held between the old sparse samples was previously uncredited), but it is a live-network behavior change.

Housekeeping

Merged origin/test (brings Crown calc uses pinned rate during the reserved-not-busy window #391 pinned-rate, Filter unexecutable rates from crown eligibility #395/Fix swap quote row selection and close TAO→BTC crown loophole #420 executability filter, Tighten timeout-extension evidence + persist dest-tip snapshot #421/Refactor: extract lock+commit boilerplate from ValidatorStateStore #422). Conflict in scoring.py resolved by combining intervals_out + min/max_swap_rao + the new pinned_rates return.
Inlined the lone BATCH_LATEST_RATES_BEFORE query and removed the single-constant state_store_queries.py module.

Both DB writes remain gated by STORE_DB_RESULTS — dark by default; disabled validators run identically to before.

Companion PRs:

schema: entrius/allways-db#21 — adds the current_crown_holders table
dashboard reader: entrius/das-allways#28 — getCurrent() reads the new table with fallback to legacy path

Rollout order

Apply allways-db#21 migration
Deploy das-allways#28 (falls back to legacy path while the new table is empty)
Deploy this PR
Set STORE_DB_RESULTS=1 on a validator — verify rows in crown_holders, rate_history, current_crown_holders, and that crown_holders coverage is contiguous (no gaps between scoring rounds)
After a roll-out window, stop the alw-utils/sync-validator-state daemon

Test plan

616 tests pass (7 new: block-based gate + gap-free tiling + backfill cap)
pre-commit (ruff lint + format) clean
Staging validator with STORE_DB_RESULTS=1: current_crown_holders has 1+ rows/direction within one forward step
Verify crown_holders tiles gap-free across consecutive scoring rounds (round N window_end == round N+1 window_start)
Confirm emissions move as expected under continuous (vs sampled) scoring before wide rollout
Force a DB outage mid-forward — verify forward continues and votes still land

Replaces the alw-utils/sync-validator-state daemon's port-and-mirror pattern with direct writes from the validator's scoring path. The daemon was a hand-maintained Python copy of replay_crown_time_window that ran on a 12s poll and re-derived crown winners from raw events; two copies of the same algorithm are now collapsed to one. Mechanics: - New `allways/validator/storage/` module mirrors the gittensor storage layout (database.py / queries.py / repository.py / storage.py). Uses psycopg v3 to match gittensor. - `replay_crown_time_window` gains an optional `intervals_out` capture list — the per-interval data the function was already computing internally, exposed as a side-channel without changing its return value or the scoring math. - `calculate_miner_rewards` expands captured intervals to per-block crown_holders rows, collects rate events for the window from state_store, and flushes both plus sync_cursor advancement in a single transaction at the end of the round. - `DatabaseStorage` is instantiated on the Validator alongside state_store; closed on validator exit. Default-disabled: STORE_DB_RESULTS env var gates everything. When unset, `is_enabled()` returns False, `intervals_out` stays None, the tee is a no-op, and no DB connection is attempted. Validators that don't write to the dashboard DB run identically to before. The `allways_network` block in docker-compose.vali.yml is also commented out by default, mirroring gittensor. All 563 existing tests pass; the scoring test fixture's SimpleNamespace gained a disabled DatabaseStorage stub.

The round-end flush (every 600 blocks ≈ 2h) is the right cadence for the per-block scoring ledger in crown_holders, but it makes the dashboard's "who holds the crown right now" widget stale for hours at a time. Users have complained that crown holders aren't really "known" between rounds. New live writer: - snapshot_current_crown_holders(self) reconstructs rates/busy/active at self.block and evaluates crown_holders_at_instant once per direction. No event-stream walk — sub-millisecond per call. - DatabaseStorage.upsert_current_crown_snapshot replaces each direction's rows in current_crown_holders atomically (delete then insert), preserving tie semantics: a k-way tie writes k rows with credit=1/k. Readers that want a single winner pick one (e.g. first by hotkey). - Called at the tail of forward(), after vote_initiate / finalize / timeout-extension RPC so DB latency can't push deadline-sensitive votes past their block windows. Wrapped in try/except so any DB hiccup is logged and forward continues. Same STORE_DB_RESULTS gate as the round-end flush; disabled validators pay no cost (is_enabled() short-circuits everything). Requires the current_crown_holders table — see corresponding allways-db migration. All 563 existing tests pass; no test changes needed since the new path is gated by the existing MagicMock stub returning is_enabled=False.

Applies four fixes flagged by an independent audit of the validator direct-DB-storage rollout (replacing alw-utils/sync-validator-state): 1. DB ceilings + clearer startup logging. New connect_timeout=2s and server-side statement_timeout=2000ms cap every individual write so a slow/down Postgres can never stall the forward loop. Single connect attempt at validator init — no retry, no backoff. The DatabaseStorage constructor now logs the gating decision and the connection result explicitly: - "STORE_DB_RESULTS not set — validator DB storage disabled" - "STORE_DB_RESULTS=1 — connecting to Postgres for dashboard writes" - "Validator DB storage enabled" (on success) - "STORE_DB_RESULTS=1 but Postgres connection failed — dashboard writes disabled for this process" (on failure) 2. Halt-path DB writes. The daemon explicitly cleared crown_holders for the halted window and advanced sync_cursor. Without this, the dashboard kept showing pre-halt holders while the validator had recycled the pool. Adds DatabaseStorage.flush_halt_window and wires it into score_and_reward_miners' halted branch. 3. Live current-crown halt-awareness. snapshot_current_crown_holders now accepts halted=True and returns empty rows for every direction; the per-forward writer in forward.py checks contract_is_halted (renamed from _contract_is_halted to make it module-public) and passes it through. During a halt the live table is cleared so the UI matches the recycle semantics. 4. Off-by-one sync_cursor. flush_scoring_window writes range [window_start, window_end) exclusive of window_end, so the last block actually flushed is window_end - 1. The cursor was claiming window_end — readers were lied to about freshness by one block. Skipping the multi-validator cursor GREATEST fix from the audit per operator: only a single validator writes the dashboard DB in practice. All 563 existing tests pass — no test changes needed since the new paths are gated by the existing MagicMock stub returning is_enabled=False.

Two perf reductions on the forward-step cost introduced by the direct-DB-storage rollout, both targeting calls that fire every ~12s under STORE_DB_RESULTS=1. 1. Halt RPC cache. contract_is_halted now delegates to bounds_cache.halted() with a 5-block (~60s) TTL. The per-forward live-crown snapshot was hitting the substrate RPC every step; halt only flips via admin tx so a short cache cuts ~50-200ms off most forward steps. RPC failures fall back to the last cached value (or False if none) so transient flakes don't churn the live-crown writer. 2. Batched rate reconstruction. state_store gains get_latest_rates_before(from_chain, to_chain, block) — one query per direction with ROW_NUMBER() OVER (PARTITION BY hotkey ORDER BY block DESC, id DESC) instead of N point lookups. reconstruct_window_start_state filters the dict by membership in rewardable_hotkeys in Python. Same tie-break semantics as the single-row form. Saves ~10-30ms on every snapshot, more as miner count grows. New SQL lives in allways/validator/state_store_queries.py — first step toward migrating state_store SQL out of inline method bodies per the project's "SQL goes in queries.py" convention. All 563 tests pass.

contract_is_halted was running once per forward step (~12s) to gate the live-crown snapshot's halt-aware path. Halt is rare and the dashboard already signals it loudly via the HaltBanner + top-right "paused" indicator (both fed by /halt off contract_events, not the contract RPC) — so an ~2h lag between an actual halt and the live current_crown_holders table being cleared is acceptable, and worth not burning 1 RPC per minute (or even 1 every 5 blocks under the cache) just to detect a state that flips a few times a year. Changes: - forward.py per-step snapshot block no longer calls contract_is_halted; snapshot_current_crown_holders called bare. - snapshot_current_crown_holders loses the halted= parameter (dead with no caller). - _flush_halt_window now also clears current_crown_holders by calling upsert_current_crown_snapshot with empty rows per direction. One clear per scoring round at halt-detection time. Net halt-RPC frequency returns to pre-direct-DB-storage cadence (~1 per scoring round, ~every 2h), even though the bounds_cache.halted() helper stays in place for any future caller. All 563 tests pass.

…-db-storage # Conflicts: # allways/validator/scoring.py

The queries-module convention isn't worth a separate file for a single constant; inline the SQL at its one call site (matching the singular get_latest_rate_before form) and remove the module.

snapshot_current_crown_holders now passes executable_rate_check into crown_holders_at_instant, matching the scoring/ledger path (#395). Bounds come from the 300-block-TTL bounds_cache, so no per-step RPC; both-zero on read failure is the unset sentinel (permissive), preserving prior behavior. Closes the gap where the live current_crown_holders table could show an out-of-bounds-rate holder the historical ledger drops.

The scoring round fired on a forward-step count (step % SCORING_WINDOW_BLOCKS) and wrote only the trailing SCORING_WINDOW_BLOCKS window. A forward pass spans several blocks (~5 on mainnet, more on testnet under RPC limits), so rounds fired far less often than once per window — at ~5 blocks/step a 600-block window fired every ~3000 blocks, leaving ~80% of blocks unscored. That made the per-block crown_holders ledger sparse (a regression vs the continuous alw-utils indexer this PR replaces) and meant reward scoring sampled ~2h out of every ~10h instead of running continuously. Fix: - due_for_scoring(): block-based gate — fire once on a fresh process, then every SCORING_WINDOW_BLOCKS *blocks*, independent of forward-pass length. - scoring_window_bounds(): anchor window_start to the last-scored block so consecutive rounds tile with no per-block gap, capped at MAX_SCORING_BACKFILL_BLOCKS so a post-outage catch-up round can't replay an unbounded span. - Validator seeds last_scored_block one window back; score_and_reward_miners advances it only after a round completes (a failed round retries its window). Both helpers are pure and unit-tested for tiling, overshoot, and the backfill cap. Replay/flush/storage machinery is unchanged — it's already window-param'd.

A long stall isn't worth recovering — the event-watcher only reconstructs ~one window back on restart anyway. Cap at 2*SCORING_WINDOW_BLOCKS: enough headroom over one window that forward-pass overshoot can't re-open a gap, while a post-stall round just resumes a couple hours back instead of ~16h.

Halves the miner feedback lag — scores, the per-block crown_holders ledger, and weights-driving data all refresh every ~1h instead of ~2h. The window is the only smoothing (EMA alpha=1.0), and 1h is near-identical to 2h for steady miners while tracking recent activity tighter; the main trade is a slightly sharper busy-while-fulfilling dip, a pre-existing crown/busy property. SCORING_WINDOW_BLOCKS 600 -> 300; every lookback (window_start, prune cutoffs, event-watcher backfill, last_scored seed) and the 2x backfill cap follow the constant automatically. Tests that hardcoded blocks for the 600-wide window repositioned to fit 300 (no assertion changes).

This PR is the wiring — the module is now called from the scoring run and the per-forward snapshot, so the 'follow-up' notes were inaccurate.

anderdc and others added 3 commits May 26, 2026 16:32

style: auto-fix pre-commit hooks

2570fa8

anderdc mentioned this pull request May 27, 2026

feat(crown): note that round-bounded panels refresh every ~2h entrius/allways-ui#103

Merged

6 tasks

anderdc and others added 10 commits May 27, 2026 16:59

style: auto-fix pre-commit hooks

6cd9757

style: auto-fix pre-commit hooks

ac3d06c

style: auto-fix pre-commit hooks

bce5c54

Merge remote-tracking branch 'origin/test' into feat/validator-direct…

249ba98

…-db-storage # Conflicts: # allways/validator/scoring.py

Inline BATCH_LATEST_RATES_BEFORE, drop state_store_queries.py

c24e0ab

The queries-module convention isn't worth a separate file for a single constant; inline the SQL at its one call site (matching the singular get_latest_rate_before form) and remove the module.

anderdc force-pushed the feat/validator-direct-db-storage branch from b919097 to 6bf3c8d Compare May 28, 2026 21:15

anderdc added 4 commits May 28, 2026 16:19

Trim verbose comments on the scoring-cadence change

dc9b713

Fix stale 'not yet wired up' comments in storage module

556e9ac

This PR is the wiring — the module is now called from the scoring run and the per-forward snapshot, so the 'follow-up' notes were inaccurate.

entrius approved these changes May 28, 2026

View reviewed changes

entrius merged commit eeb2991 into test May 28, 2026
3 checks passed

entrius deleted the feat/validator-direct-db-storage branch May 28, 2026 22:04

LandynDev mentioned this pull request May 29, 2026

Log crown holder UID + rate per forward step #429

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(validator): direct-DB writes for crown_holders + live current crown#390

feat(validator): direct-DB writes for crown_holders + live current crown#390
entrius merged 17 commits into
testfrom
feat/validator-direct-db-storage

anderdc commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anderdc commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Round-end direct write

2. Per-forward live current-crown write

3. Block-based scoring cadence + gap-free window tiling ⚠️ behavior change

Housekeeping

Rollout order

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anderdc commented May 27, 2026 •

edited

Loading