Tiredsolid scoreboard (internal)#669
Conversation
|
🔗 Preview: https://pr-669.data.malbeclabs.com |
…ymbol recent races
The page-cache refresh aggregated the feeds summary table with uniqExact over the 6-column race key and quantileExact over lead times. Both buffer per-group state proportional to the window's race count (~7.5M matchup rows/hour, ~768 MiB peak). On the preview/dev ClickHouse the feeds tables are remoteSecure() proxies, so the GROUP BY runs on the small local instance; under the 7.2 GiB server limit with the parallel page-cache refreshes, the OvercommitTracker killed the query (code 241), spamming ERR cache-refresh-failed logs. Switch to bounded-memory estimators (uniqCombined, quantileTDigest): peak state drops ~43x (768 -> 18 MiB) on real data with identical results and runtime, at a sub-1% approximation invisible on the scoreboard. Does not change rows read (the event_ts filter still does not prune through the proxy; deferred separately).
The 24h composite-latency query grouped every feed source by a source-keyed tuple over the raw observations table in a first CTE, then re-grouped. On the memory-constrained proxy ClickHouse (feeds is a remoteSecure() proxy, so the GROUP BY runs locally) that hash table tripped the OvercommitTracker (code 241), logging a WARN every refresh. Collapse to a single GROUP BY over (metro, symbol, block): min(recv) is associative so the per-source pre-aggregation was redundant, uniqExact(source)>=2 enforces the '>=2 DZ feeds' rule directly, and filtering to tob_ feeds in WHERE drops competitor rows before the hash table. Add max_bytes_before_external_group_by as a disk-spill safety net. Verified equivalent to the old form over 1h; the 24h query now peaks ~13 MiB (was OOMing at 7.2 GiB) in ~7s.
HypeRPC (hyperpc_shared_bbo) delivers quotes a median ~100s stale during recurring multi-hour backlog episodes (vs ~300ms for the other feeds), which made DoubleZero appear to win by seconds-to-minutes. That is a HypeRPC-side feed fault, not a real result, so withhold it from every measurement (per-competitor, per-node, headline totals, recent races) via a documented hyperliquidExcludedFeeds list until the feed is fixed.
Add a per-symbol 'updated Xs ago' line derived from the newest race event_ts, ticking with the existing 5s clock, so a stale or gap-filled symbol is visible at a glance.
c9de871 to
1472274
Compare
The scoreboard ran two full-table scans over the proxied summary table (compQ for per-competitor, nodeQ for per-node). On the memory-constrained preview ClickHouse the refresh needed both scans to survive the OvercommitTracker in the same attempt while other page-cache queries had the 7.2 GiB server near its ceiling, so it was killed as a collateral victim nearly every cycle and its cached blob went stale (never picking up composite_latency). A race key includes measurement_node_id, so per-competitor totals are the per-node cells rolled up over the node dimension. Replace the two queries with one 'GROUP BY competitor, measurement_node_id WITH ROLLUP' scan: the detail rows give the per-vantage breakdown and the rolled-up rows give per-competitor totals. Halves the ClickHouse work and, more importantly, makes the refresh one atomic query that succeeds-or-retries as a unit. Verified equivalent to the old form against live data (within uniqCombined's sub-1% approximation).
The headline feed-latency stat took the earliest receive per (metro, symbol, block), so slower metros inflated it (p50 ~219ms). Scope it to the Tokyo vantage point — DZ's fastest metro, nearest Hyperliquid — matching Grafana's 'tob_* (first-arrival, all DZ)' panel: min receive across the tob_ DZ feeds at Tokyo per block, still requiring >=2 feeds delivered (which keeps p99 realistic). Now p50/p90/p99 read ~162/187/243 over 24h (~161/180/220 over 3h, matching Grafana's 160/176/208). Relabel the UI 'Composite feed latency' -> 'Tokyo feed latency' with a blocktime->receive / first-arrival-across-DZ-Tokyo-feeds description.
47eae3f to
440e04c
Compare
nikw9944
left a comment
There was a problem hiding this comment.
Automated correctness + ClickHouse performance review. Findings are labeled by severity (🔴 high / 🟠 medium / 🟢 low). Nice feature overall — SQL is safely parameterized and it degrades gracefully when the proxy table is absent.
| SELECT competitor, measurement_node_id, any(location_code) AS location_code, | ||
| uniqCombinedIf(rk, dz_won = 1) AS dz_wins, | ||
| uniqCombinedIf(rk, dz_won = 0) AS dz_losses, | ||
| toFloat64(quantileTDigestIf(0.5)(lead_ms, dz_won = 1)) AS lead_p50, |
There was a problem hiding this comment.
🔴 High — NaN quantile breaks the entire scoreboard response
These percentiles are computed only over DZ-won rows (dz_won = 1). For any (competitor, node) cell — or the per-competitor rollup — where DZ won zero races in the window, the If predicate matches no rows and ClickHouse's quantileTDigest over an empty set returns NaN. That NaN is scanned into LeadP50Ms/LeadP95Ms (float64), and writeJSON → json.NewEncoder().Encode() errors on NaN (json: unsupported value: NaN), producing a broken/empty 200 body.
Failure scenario: at a single vantage node far from Tokyo (or in a thin 1h window) DZ loses all its head-to-head races vs one competitor → that one cell is NaN → the entire scoreboard fails to encode for all users. It also breaks the page-cache worker: FetchHyperliquidScoreboardData → WritePageCache marshaling fails the same way, so the cache never populates and every request falls through to the (also-failing) live query. No test covers the 0-DZ-wins case.
Fix: coalesce NaN in SQL (e.g. wrap in ifNull/replace isNaN→0) or sanitize the float64s in Go before encoding.
| FROM %[2]s.hyperliquid_bbo_feed_race_summary | ||
| WHERE feed != loser_feed AND loser_feed != '' | ||
| AND (startsWith(feed,'tob_') != startsWith(loser_feed,'tob_')) %[3]s | ||
| AND event_ts >= now() - INTERVAL %[4]s |
There was a problem hiding this comment.
🟠 Medium — 24h/7d (and symbol-filtered) requests do full multi-day scans on the request path
This filters on event_ts >= now() - INTERVAL <window> and symbol IN (…), but the table is ORDER BY (measurement_node_id, symbol, …) with PARTITION BY toDate(event_ts). With no measurement_node_id predicate, neither filter uses the primary index — the time filter only prunes at day granularity and symbol (2nd sort column) can't skip granules without the leading key.
hyperliquidScoreboardCacheKey returns "" for any window other than 1h and for any symbol request, so those bypass the page cache and run live under the 60s request timeout. A 7d window therefore scans ~7–8 full day-partitions (order of ~10⁹ matchup rows at the ~7.5M rows/hour noted in the code) computing uniqCombined/tdigest per group, on the memory-constrained proxy — likely very slow or a timeout per click.
Consider caching 24h/7d too (with a longer refresh interval), or bounding/rejecting them.
|
|
||
| // hyperliquidScoreboardCacheKey returns the cache key for the default request shape, or "". | ||
| func hyperliquidScoreboardCacheKey(r *http.Request) string { | ||
| if r.URL.Query().Get("since_ts") != "" || r.URL.Query().Get("symbol") != "" { |
There was a problem hiding this comment.
🟢 Low — since_ts param bypasses cache but is never applied
This skips the cache when since_ts is present (implying support), but GetHyperliquidScoreboard never parses since_ts and FetchHyperliquidScoreboardData hardcodes fetchHyperliquidRecentRaces(ctx, time.Time{}, 10) (line 370). A ?since_ts=… request therefore pays for an uncached live query yet still returns the default last-15-minutes races. Either wire it through or drop the since_ts check here. (Frontend doesn't send it today, so impact is latent.)
| symbol := r.URL.Query().Get("symbol") | ||
|
|
||
| // Degrade gracefully if the proxy table isn't available (e.g. local dev). | ||
| if !a.hyperliquidFeedsTableExists(ctx) { |
There was a problem hiding this comment.
🟢 Low — redundant EXISTS TABLE round-trip per uncached request
On the cache-MISS path this calls hyperliquidFeedsTableExists, then immediately calls FetchHyperliquidScoreboardData, which calls it again (line 208) — two EXISTS TABLE queries to ClickHouse per request. Cheap individually, but a pointless extra round-trip. Drop this handler-level check (Fetch already degrades gracefully) or pass the result through.
| resp.DZWinSharePct = 100.0 * float64(globalWins) / float64(globalRaces) | ||
| } | ||
|
|
||
| recent, err := a.fetchHyperliquidRecentRaces(ctx, time.Time{}, 10) |
There was a problem hiding this comment.
🟢 Low — three independent scans run sequentially on the request path
The main aggregation, fetchHyperliquidRecentRaces (15-min scan), and fetchHyperliquidPrices (1-min observations scan) run sequentially. They're independent and could run concurrently (errgroup). Harmless for the cached 1h default (runs on the worker), but on the uncached 24h/7d live path their latencies stack under the same 60s budget.
| const isEdgeRoute = isShredsRoute | ||
| const isHyperliquidScoreboardRoute = location.pathname === '/dz/hyperliquid/scoreboard' | ||
| const isHyperliquidRoute = location.pathname.startsWith('/dz/hyperliquid') | ||
| const isEdgeRoute = isShredsRoute || isHyperliquidRoute |
There was a problem hiding this comment.
🟢 Low / cosmetic — sidebar highlight inconsistency for the Hyperliquid route
isEdgeRoute = isShredsRoute || isHyperliquidRoute makes the collapsed "Edge" icon highlight on /dz/hyperliquid (line 286), but the expanded Edge parent link (line 559) keys off isShredsRoute only, so it won't reflect the hyperliquid route. Purely visual; worth confirming the intended grouping (Hyperliquid also renders as its own expanded section).
- Guard NaN lead-time percentiles: a (competitor, node) cell where DoubleZero won zero races made quantileTDigestIf aggregate over an empty set, returning NaN and failing JSON encoding of the entire response (and poisoning the page cache). Coalesce to 0 via ifNotFinite; add a regression test. - Cache the 24h/7d scoreboards on the slow background refresher so those windows are served from cache instead of running a multi-day scan on the request path. - Run the three independent request-path scans (main ROLLUP, recent races, live prices) concurrently via errgroup instead of sequentially. - Drop the dead since_ts cache-key check (never parsed or applied) and the redundant per-request EXISTS TABLE round-trip (Fetch already degrades).
|
Thanks for the thorough review — all findings addressed in 🔴 NaN percentile → broken response — Fixed. Both lead-time percentiles are now wrapped in 🟠 24h/7d full scans on the request path — Fixed. The 24h/7d windows are now precomputed on the slow background refresher ( 🟢 🟢 Redundant 🟢 Three sequential scans — Fixed. The main ROLLUP scan, recent-races scan, and price lookup now run concurrently via 🟢 Sidebar highlight — No change; I think this is a false positive. Line 559 is the Shreds item, not an "Edge parent" — Two follow-ups also in the branch since the review: added Dwellir as a competitor, and reordered the per-vantage table (TYO/CHI/NYC) with datacenter facility names + hover detail. Verified: |
Internal-only scoreboard for a new market-data venue (sibling to the Edge scoreboard). Gated to allowed-domain Google users.