Skip to content

Tiredsolid scoreboard (internal)#669

Open
armcconnell wants to merge 33 commits into
mainfrom
tiredsolid-scoreboard
Open

Tiredsolid scoreboard (internal)#669
armcconnell wants to merge 33 commits into
mainfrom
tiredsolid-scoreboard

Conversation

@armcconnell

Copy link
Copy Markdown
Contributor

Internal-only scoreboard for a new market-data venue (sibling to the Edge scoreboard). Gated to allowed-domain Google users.

@github-actions

Copy link
Copy Markdown

🔗 Preview: https://pr-669.data.malbeclabs.com

The page-cache refresh aggregated the feeds summary table with uniqExact over
the 6-column race key and quantileExact over lead times. Both buffer per-group
state proportional to the window's race count (~7.5M matchup rows/hour, ~768 MiB
peak). On the preview/dev ClickHouse the feeds tables are remoteSecure() proxies,
so the GROUP BY runs on the small local instance; under the 7.2 GiB server limit
with the parallel page-cache refreshes, the OvercommitTracker killed the query
(code 241), spamming ERR cache-refresh-failed logs.

Switch to bounded-memory estimators (uniqCombined, quantileTDigest): peak state
drops ~43x (768 -> 18 MiB) on real data with identical results and runtime, at a
sub-1% approximation invisible on the scoreboard. Does not change rows read (the
event_ts filter still does not prune through the proxy; deferred separately).
The 24h composite-latency query grouped every feed source by a source-keyed
tuple over the raw observations table in a first CTE, then re-grouped. On the
memory-constrained proxy ClickHouse (feeds is a remoteSecure() proxy, so the
GROUP BY runs locally) that hash table tripped the OvercommitTracker (code 241),
logging a WARN every refresh.

Collapse to a single GROUP BY over (metro, symbol, block): min(recv) is
associative so the per-source pre-aggregation was redundant, uniqExact(source)>=2
enforces the '>=2 DZ feeds' rule directly, and filtering to tob_ feeds in WHERE
drops competitor rows before the hash table. Add max_bytes_before_external_group_by
as a disk-spill safety net. Verified equivalent to the old form over 1h; the 24h
query now peaks ~13 MiB (was OOMing at 7.2 GiB) in ~7s.
HypeRPC (hyperpc_shared_bbo) delivers quotes a median ~100s stale during
recurring multi-hour backlog episodes (vs ~300ms for the other feeds), which
made DoubleZero appear to win by seconds-to-minutes. That is a HypeRPC-side feed
fault, not a real result, so withhold it from every measurement (per-competitor,
per-node, headline totals, recent races) via a documented hyperliquidExcludedFeeds
list until the feed is fixed.
Add a per-symbol 'updated Xs ago' line derived from the newest race event_ts,
ticking with the existing 5s clock, so a stale or gap-filled symbol is visible
at a glance.
@armcconnell armcconnell force-pushed the tiredsolid-scoreboard branch from c9de871 to 1472274 Compare July 1, 2026 02:26
The scoreboard ran two full-table scans over the proxied summary table (compQ
for per-competitor, nodeQ for per-node). On the memory-constrained preview
ClickHouse the refresh needed both scans to survive the OvercommitTracker in the
same attempt while other page-cache queries had the 7.2 GiB server near its
ceiling, so it was killed as a collateral victim nearly every cycle and its
cached blob went stale (never picking up composite_latency).

A race key includes measurement_node_id, so per-competitor totals are the
per-node cells rolled up over the node dimension. Replace the two queries with
one 'GROUP BY competitor, measurement_node_id WITH ROLLUP' scan: the detail rows
give the per-vantage breakdown and the rolled-up rows give per-competitor totals.
Halves the ClickHouse work and, more importantly, makes the refresh one atomic
query that succeeds-or-retries as a unit. Verified equivalent to the old form
against live data (within uniqCombined's sub-1% approximation).
The headline feed-latency stat took the earliest receive per (metro, symbol,
block), so slower metros inflated it (p50 ~219ms). Scope it to the Tokyo vantage
point — DZ's fastest metro, nearest Hyperliquid — matching Grafana's 'tob_*
(first-arrival, all DZ)' panel: min receive across the tob_ DZ feeds at Tokyo per
block, still requiring >=2 feeds delivered (which keeps p99 realistic). Now
p50/p90/p99 read ~162/187/243 over 24h (~161/180/220 over 3h, matching Grafana's
160/176/208). Relabel the UI 'Composite feed latency' -> 'Tokyo feed latency'
with a blocktime->receive / first-arrival-across-DZ-Tokyo-feeds description.
@armcconnell armcconnell force-pushed the tiredsolid-scoreboard branch from 47eae3f to 440e04c Compare July 1, 2026 13:22

@nikw9944 nikw9944 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated correctness + ClickHouse performance review. Findings are labeled by severity (🔴 high / 🟠 medium / 🟢 low). Nice feature overall — SQL is safely parameterized and it degrades gracefully when the proxy table is absent.

Comment thread api/handlers/hyperliquid_scoreboard.go Outdated
SELECT competitor, measurement_node_id, any(location_code) AS location_code,
uniqCombinedIf(rk, dz_won = 1) AS dz_wins,
uniqCombinedIf(rk, dz_won = 0) AS dz_losses,
toFloat64(quantileTDigestIf(0.5)(lead_ms, dz_won = 1)) AS lead_p50,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 High — NaN quantile breaks the entire scoreboard response

These percentiles are computed only over DZ-won rows (dz_won = 1). For any (competitor, node) cell — or the per-competitor rollup — where DZ won zero races in the window, the If predicate matches no rows and ClickHouse's quantileTDigest over an empty set returns NaN. That NaN is scanned into LeadP50Ms/LeadP95Ms (float64), and writeJSONjson.NewEncoder().Encode() errors on NaN (json: unsupported value: NaN), producing a broken/empty 200 body.

Failure scenario: at a single vantage node far from Tokyo (or in a thin 1h window) DZ loses all its head-to-head races vs one competitor → that one cell is NaN → the entire scoreboard fails to encode for all users. It also breaks the page-cache worker: FetchHyperliquidScoreboardDataWritePageCache marshaling fails the same way, so the cache never populates and every request falls through to the (also-failing) live query. No test covers the 0-DZ-wins case.

Fix: coalesce NaN in SQL (e.g. wrap in ifNull/replace isNaN→0) or sanitize the float64s in Go before encoding.

FROM %[2]s.hyperliquid_bbo_feed_race_summary
WHERE feed != loser_feed AND loser_feed != ''
AND (startsWith(feed,'tob_') != startsWith(loser_feed,'tob_')) %[3]s
AND event_ts >= now() - INTERVAL %[4]s

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Medium — 24h/7d (and symbol-filtered) requests do full multi-day scans on the request path

This filters on event_ts >= now() - INTERVAL <window> and symbol IN (…), but the table is ORDER BY (measurement_node_id, symbol, …) with PARTITION BY toDate(event_ts). With no measurement_node_id predicate, neither filter uses the primary index — the time filter only prunes at day granularity and symbol (2nd sort column) can't skip granules without the leading key.

hyperliquidScoreboardCacheKey returns "" for any window other than 1h and for any symbol request, so those bypass the page cache and run live under the 60s request timeout. A 7d window therefore scans ~7–8 full day-partitions (order of ~10⁹ matchup rows at the ~7.5M rows/hour noted in the code) computing uniqCombined/tdigest per group, on the memory-constrained proxy — likely very slow or a timeout per click.

Consider caching 24h/7d too (with a longer refresh interval), or bounding/rejecting them.

Comment thread api/handlers/hyperliquid_scoreboard.go Outdated

// hyperliquidScoreboardCacheKey returns the cache key for the default request shape, or "".
func hyperliquidScoreboardCacheKey(r *http.Request) string {
if r.URL.Query().Get("since_ts") != "" || r.URL.Query().Get("symbol") != "" {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low — since_ts param bypasses cache but is never applied

This skips the cache when since_ts is present (implying support), but GetHyperliquidScoreboard never parses since_ts and FetchHyperliquidScoreboardData hardcodes fetchHyperliquidRecentRaces(ctx, time.Time{}, 10) (line 370). A ?since_ts=… request therefore pays for an uncached live query yet still returns the default last-15-minutes races. Either wire it through or drop the since_ts check here. (Frontend doesn't send it today, so impact is latent.)

Comment thread api/handlers/hyperliquid_scoreboard.go Outdated
symbol := r.URL.Query().Get("symbol")

// Degrade gracefully if the proxy table isn't available (e.g. local dev).
if !a.hyperliquidFeedsTableExists(ctx) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low — redundant EXISTS TABLE round-trip per uncached request

On the cache-MISS path this calls hyperliquidFeedsTableExists, then immediately calls FetchHyperliquidScoreboardData, which calls it again (line 208) — two EXISTS TABLE queries to ClickHouse per request. Cheap individually, but a pointless extra round-trip. Drop this handler-level check (Fetch already degrades gracefully) or pass the result through.

Comment thread api/handlers/hyperliquid_scoreboard.go Outdated
resp.DZWinSharePct = 100.0 * float64(globalWins) / float64(globalRaces)
}

recent, err := a.fetchHyperliquidRecentRaces(ctx, time.Time{}, 10)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low — three independent scans run sequentially on the request path

The main aggregation, fetchHyperliquidRecentRaces (15-min scan), and fetchHyperliquidPrices (1-min observations scan) run sequentially. They're independent and could run concurrently (errgroup). Harmless for the cached 1h default (runs on the worker), but on the uncached 24h/7d live path their latencies stack under the same 60s budget.

const isEdgeRoute = isShredsRoute
const isHyperliquidScoreboardRoute = location.pathname === '/dz/hyperliquid/scoreboard'
const isHyperliquidRoute = location.pathname.startsWith('/dz/hyperliquid')
const isEdgeRoute = isShredsRoute || isHyperliquidRoute

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low / cosmetic — sidebar highlight inconsistency for the Hyperliquid route

isEdgeRoute = isShredsRoute || isHyperliquidRoute makes the collapsed "Edge" icon highlight on /dz/hyperliquid (line 286), but the expanded Edge parent link (line 559) keys off isShredsRoute only, so it won't reflect the hyperliquid route. Purely visual; worth confirming the intended grouping (Hyperliquid also renders as its own expanded section).

- Guard NaN lead-time percentiles: a (competitor, node) cell where DoubleZero
  won zero races made quantileTDigestIf aggregate over an empty set, returning
  NaN and failing JSON encoding of the entire response (and poisoning the page
  cache). Coalesce to 0 via ifNotFinite; add a regression test.
- Cache the 24h/7d scoreboards on the slow background refresher so those windows
  are served from cache instead of running a multi-day scan on the request path.
- Run the three independent request-path scans (main ROLLUP, recent races,
  live prices) concurrently via errgroup instead of sequentially.
- Drop the dead since_ts cache-key check (never parsed or applied) and the
  redundant per-request EXISTS TABLE round-trip (Fetch already degrades).
@armcconnell

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review — all findings addressed in 2792d91, with two small follow-ups after. Mapping each:

🔴 NaN percentile → broken response — Fixed. Both lead-time percentiles are now wrapped in ifNotFinite(…, 0), so a (competitor, node) cell where DoubleZero won zero races serializes as 0 instead of NaN. Added a regression test (TestHyperliquidScoreboard_ZeroDZWins_Encodable) that reproduces the exact json: unsupported value: NaN failure and now passes.

🟠 24h/7d full scans on the request path — Fixed. The 24h/7d windows are now precomputed on the slow background refresher (StartHyperliquidBackgroundRefresher) and served from the page cache under per-window keys (hyperliquidScoreboardWindowKey), so the request path no longer runs a multi-day scan per click. 1h stays on the page-cache worker; a cold-cache request falls back to the same bounded live path as before.

🟢 since_ts bypasses cache but never applied — Fixed. Dropped the dead since_ts branch from hyperliquidScoreboardCacheKey (it was never parsed).

🟢 Redundant EXISTS TABLE per request — Fixed. Removed the handler-level existence check; FetchHyperliquidScoreboardData already degrades to an empty-but-valid response, so it's a single round-trip now (still covered by TestGetHyperliquidScoreboard_MissingTable).

🟢 Three sequential scans — Fixed. The main ROLLUP scan, recent-races scan, and price lookup now run concurrently via errgroup (verified race-clean).

🟢 Sidebar highlight — No change; I think this is a false positive. Line 559 is the Shreds item, not an "Edge parent" — Edge is a section-header <span>, and Shreds/<the new venue> are siblings that each highlight on their own route. Keying the Shreds link off the other route would incorrectly highlight Shreds while viewing that page; the collapsed group icon intentionally highlights for either. Happy to adjust if I've misread the intended grouping.

Two follow-ups also in the branch since the review: added Dwellir as a competitor, and reordered the per-vantage table (TYO/CHI/NYC) with datacenter facility names + hover detail.

Verified: go build, go vet, gofmt, all scoreboard unit tests pass under -race; tsc clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants