observability: Overview redesign + Database dashboard + ECS / Prometheus wiring by lukemelia · Pull Request #4699 · cardstack/boxel

lukemelia · 2026-05-07T03:41:33Z

Summary

Rebuilds the Overview dashboard around per-service KPIs and ECS resource cards, adds a new Database dashboard, wires CloudWatch ECS metrics (with a local-dev placeholder fallback so empty panels don't look broken), and moves per-realm reindex from Indexing to Realms (where the realm context lives). Most of this PR is the Overview reshaping; the rest is supporting plumbing.

Screenshots

Overview

Top to bottom: Realms / Users cumulative-growth KPIs · Web Requests / Job Queue / Synapse rate KPIs (smooth gradient lines, mean in legend, drilldown icons) · Realm Server / Prerender / Prerender Mgr / Worker ECS panels showing the local-dev placeholder · Synapse ECS panel pulling from local Prometheus (CPU / Mem / Up) · Postgres DB card (txn/min, drills to the new Database dashboard) · Active Alerts · indexing pipeline stats and throughput.

Database (new dashboard)

KPIs (DB size, active connections, idle in txn, txn/min, cache hit ratio gauge, long-running queries) · connections-by-state stacked bar · transactions/min · top tables by size · cumulative writes by table · idle-in-txn backends · long-running queries · tables overdue for vacuum · markdown footer naming what's NOT covered without pg_stat_statements / pgstattuple / pg_exporter.

Realms (with the moved reindex panel)

Per-realm stat row · Grant Permission · Realm Permissions table · Indexing status · Operator Actions: Reindex this realm (new — moved here from Indexing). The realm comes from the existing ${realm_url} template variable; blast-radius gating disables the button while a job is in flight.

Indexing (trimmed to Full reindex only)

The Operator Actions panel now holds only Pending full-reindex indicator + Reindex ALL realms button. h: 11 → 4. The per-realm reindex moved to the Realms dashboard.

Overview dashboard (`overview.json`) — full reshape

Row	Content
0	Realms + Users stat panels with cumulative-growth sparklines (drill to realms / users dashboards)
1	Web Requests (HTTP req/min + err/min, smooth gradient line) · Job Queue (Jobs/min + Failures/min, gradient line, drills to Job Queue dashboard) · Synapse (Events/min, gradient line, drills to vendored Synapse dashboard)
2	Five ECS resource cards (Realm Server / Prerender / Prerender Mgr / Worker / Synapse) showing CPU / Memory / Tasks · Postgres DB card (txn/min, drills to new Database dashboard)
3	Active Alerts — alertlist (absorbs the deleted `worker-status.json`)
4	Four indexing-pipeline stats (Pending / In-flight / Oldest Pending / Errors)
5	Indexing throughput timeseries
6	Markdown text panel linking to other dashboards by category

Notable shape choices:

Boxel brand palette applied to series colors via field overrides (purple / blue / teal / red / yellow per boxel-ui/addon/src/styles/variables.css). Threshold-based panels stay on Grafana's named green/yellow/red since those colors carry semantic meaning.
Zero-fill on rate panels — LogQL queries append or vector(0); Postgres queries use $__timeGroupAlias(col, '1m', 0) fill-mode. Combined with lineInterpolation: smooth and gradientMode: opacity, the lines flow continuously through idle periods rather than breaking into segments.
Synapse Row 2 panel queries Prometheus, not CloudWatch — synapse exposes process/_synapse metrics via the synapse-prometheus datasource (local Prometheus locally, AMP in staging/production), giving us finer-grained data than CloudWatch's MemoryUtilization / RunningTaskCount.
ECS panels show local-dev placeholder — a second jq pass in apply.sh, gated on env_name == "local", swaps any panel with datasource.type == "cloudwatch" for a text markdown placeholder explaining the panel only works in staging/production. Committed JSON keeps real CloudWatch queries; staging/prod apply pushes them unchanged.

Database dashboard (`database.json` — new, uid `boxeldatabase1`)

SQL-only dashboard for the boxel application Postgres, queried directly via the existing boxel-db datasource — no pg_exporter dependency. Inspired by the upstream pg_exporter community dashboard but trimmed to panels operators act on:

KPIs: Database size, Active connections, Idle in Txn, Txn/min, Cache hit ratio gauge, Long-running queries
Connections by state stacked-bar timeseries
Top 20 tables by size (with dead-tuple ratio + last vacuum/analyze)
Cumulative writes by table + Idle-in-txn backends
Long-running queries + Tables overdue for vacuum
Markdown panel naming what's NOT covered (slow queries need pg_stat_statements; bloat needs pgstattuple; WAL/replication need pg_exporter)

Overview's Postgres DB stat drills here.

Per-service stub dashboards (`service-*.json`)

Minimal per-service deep-dives tagged service:<name>, one each for realm-server / prerender-server / prerender-manager / worker. Each has a service-specific top-row chart (HTTP request rate from httpLogging exit lines for the three HTTP services; indexed-files/sec + error-rate from [indexing-progress] events for worker) plus a Loki-filtered logs panel below.

Overview's Row 2 ECS cards link into these.

env constant template variable

New env constant template variable (__ENV__) substituted by apply.sh to local / staging / production at apply time. Used in CloudWatch dimension values like boxel-realm-server-${env} and the local-only placeholder gate. Same substitution mechanism as the existing __REALM_SERVER_URL__ and REPLACE_AT_APPLY_TIME placeholders.

Reindex panel relocation

The per-realm reindex action moves from the system-wide Indexing dashboard to the realm-scoped Realms dashboard. The "Reindex ALL realms" system action stays on Indexing. Both keep the disable-while-in-flight blast-radius gating.

The Realms dashboard already had a ${realm_url} template variable feeding the rest of its panels — the new operator-action panel just reads from that.

Bug fixes (drive-by)

$__timeGroupAlias(col, '30s') AS time was a SQL syntax error (the macro already adds AS "time"). Fixed in both Overview's throughput timeseries and the pre-existing copy on indexing.json from observability: split Boxel Jobs into Indexing + Job Queue dashboards #4697.
boxel-synapse → synapse Alloy relabel rule so dashboards filtering on service="synapse" work locally (matching the staging/prod ECS task family naming).

Three governing principles

System dashboards aggregate; entity dashboards filter.
Dashboard = context; action lives in its context. Operator actions never centralize — per-realm reindex on Realms, system reindex on Indexing.
Drill-downs go down or sideways, never up. Overview links down; deep-dives link sideways to peers; Logs is the universal terminal with entity:* sideways links that pass realm_url / matrix_user_id to pivot rather than drill back up.

Test plan

Apply locally: cd packages/observability && ./scripts/apply.sh --env local
Open Overview — Row 0 shows Realms + Users with sparklines, Row 1 shows the three KPI gradient lines, Row 2 ECS cards show local-dev placeholder text (not CloudWatch error triangles), Postgres DB card shows current txn/min, Synapse Row 2 card shows real CPU / Mem / Up from local Prometheus.
Click each drill-down icon — Web Requests → realm-server stub, Job Queue → Job Queue dashboard, Synapse → vendored Synapse, Postgres DB → Database, Realms → Realms, Users → Users.
Open Database — KPIs + tables render against local boxel-db; "Top Tables" shows boxel_index, boxel_index_working etc.
Open Realms, pick a realm → "Operator Actions: Reindex this realm" panel renders with realm-scoped pending/in-flight; clicking the "Reindex {realm}" button hits /_grafana-reindex?realm=....
Open Indexing — "Operator Actions: Full reindex" shows the system-wide button only.
Apply against a real hosted Grafana (staging dry-run) — apply.sh --env staging --dry-run succeeds; the CloudWatch placeholder transformation is not applied (env_name guard); the env constant substitutes to staging.

Screenshot host: a dedicated pr-4699-screenshots orphan branch holds the four PNGs. Safe to delete the branch once this PR merges.

🤖 Generated with Claude Code

github-actions · 2026-05-07T03:44:02Z

Observability diff (vs staging)

No dashboard / folder changes detected against the staging Grafana.

(Run: https://github.com/cardstack/boxel/actions/runs/25528541266)

Copilot

Pull request overview

Adds a top-level Grafana “Overview” dashboard plus new per-service “stub” dashboards (realm-server, prerender-server, prerender-manager, worker) to establish a navigable dashboard tree for Boxel operational monitoring, and removes the now-redundant Worker Status dashboard.

Changes:

Added overview.json with service liveness stats, an alertlist panel, indexing pipeline stats, indexing throughput, and a dashboard directory panel.
Added per-service deep-dive stub dashboards under boxel-status/ that focus on request/activity rate + filtered logs.
Deleted worker-status.json (its alertlist is replaced by Overview’s “Active Alerts”).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
packages/observability/grafanactl/resources/dashboards/boxel-status/worker-status.json	Removes the old single-panel Worker Status dashboard.
packages/observability/grafanactl/resources/dashboards/boxel-status/service-worker.json	Adds a worker stub with activity-rate + filtered Loki logs.
packages/observability/grafanactl/resources/dashboards/boxel-status/service-realm-server.json	Adds a realm-server stub with HTTP request-rate + filtered Loki logs.
packages/observability/grafanactl/resources/dashboards/boxel-status/service-prerender-server.json	Adds a prerender-server stub with HTTP request-rate + filtered Loki logs.
packages/observability/grafanactl/resources/dashboards/boxel-status/service-prerender-manager.json	Adds a prerender-manager stub with HTTP request-rate + filtered Loki logs.
packages/observability/grafanactl/resources/dashboards/boxel-status/overview.json	Adds a new top-level Overview dashboard that links to service stubs and key workflow dashboards.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The navigation backbone for the rationalized dashboard tree. Lands last because it links to dashboards introduced in #4696, #4697, #4698. * `overview.json` — top-of-tree dashboard tagged `overview`. Five service-health stats (Loki log volume in last 5m as a liveness proxy, with drill-down panel-link to each service's stub), full-width alertlist (replaces the deleted worker-status.json), four indexing pipeline stats, an indexing throughput timeseries, and a markdown text panel listing the other dashboards by category. * `service-{realm-server,prerender-server,prerender-manager,worker}.json` — minimal per-service deep-dives tagged `service:<name>`. Each is a Loki-filtered logs panel plus a service-specific top-row chart: * realm-server: HTTP request rate (total/4xx/5xx) parsed from the `httpLogging` exit-log lines (`--> METHOD ACCEPT URL: STATUS`) * prerender / prerender-manager: same HTTP rate chart * worker: indexed-files-per-second + error-rate from the indexing-progress event stream CloudWatch ECS metrics (CPU / memory / RunningTaskCount) are intentionally deferred until cluster + task-family naming is standardized in observability config — these stubs are the natural home to add them. * `worker-status.json` deleted — its sole alertlist panel is folded into Overview row 2. The three governing principles for the rationalized tree are: 1. System dashboards aggregate; entity dashboards filter. 2. Dashboard = context; action lives in its context. 3. Drill-downs go down or sideways, never up. Overview is the only dashboard that links downward; deep-dives link sideways to peers. Logs is the universal forensics terminal, with `entity:*` sideways links that pass `realm_url` / `matrix_user_id` to pivot rather than drill back up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three small follow-ups after applying the new Overview dashboard locally: * Drop the redundant `AS time` after `$__timeGroupAlias(col, '30s')` in the throughput timeseries queries. The macro already expands to `floor(...) AS "time"`, so appending `AS time` produces `... AS "time" AS time` — Postgres rejects with "syntax error at or near AS". Same bug fixed in `indexing.json` (pre-existing on main from #4697) so the Indexing dashboard's throughput panel renders too. * Service-health stat panels: change the `null` threshold step from red to transparent and add a `0` step at red, so "no streams matched" (e.g. local dev with mise tasks not running) shows neutral, while "stream exists but silent for 5m" still alarms red. * `Other dashboards` markdown panel: bump height 4 → 8 so the bullet list isn't clipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…w to $__range * Alloy: rewrite the local docker-compose container name `boxel-synapse` to `service=synapse` so dashboards using `service="synapse"` work identically against local and staging/prod (where the ECS task family is already `synapse`). * Overview service-health stats: change `count_over_time(... [5m])` to `count_over_time(... [$__range])`. The 5m window was a strict liveness proxy that only fits chatty staging/prod services — local-dev mise tasks emit in bursts (active clicking → 1k+ lines/min, idle → silent for minutes). With $__range the stat reflects log volume over the visible dashboard window, so a 1h panel catches bursty local activity while operators on staging/prod can narrow the time picker for strict liveness. Panel descriptions updated to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- textMode "value_and_name" → "value": the field name "Value #A" was Grafana's auto-name for an instant query result, with no semantic meaning. The panel title already names the service; the count below is what matters. - "Prerender Server" → "Prerender" and "Prerender Manager" → "Prerender Mgr" so both fit in their stat columns instead of truncating to "Prerender ...". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…service metrics Log-line count was a meaningless number; replace each service stat with metrics an operator actually acts on: * Realm Server / Prerender: HTTP request rate (httpLogging --> exit lines) and error rate (lines matching error|exception|fatal). * Prerender Mgr: request rate from "proxying ..." entry lines (the manager doesn't emit httpLogging), plus error rate. * Worker: jobs/min resolved, jobs/min rejected, queue depth (unfulfilled jobs with no active reservation). Single Postgres query, three columns. * Synapse: PUT /send/m.room.message rate from synapse.access.http lines — the actual chat-send signal, not internal Synapse chatter. All Loki rates use rate([5m]) * 60 for smoothed per-minute. Stat panels display vertically with field name + value (textMode "value_and_name"); field-level overrides invert thresholds so error/failure fields turn red on any nonzero, while activity rates stay neutral when idle. Panel height bumped 4→5 to fit 2-3 stacked stats; everything below shifts +1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… Synapse) Replace the 5 single-stat panels in row 1 with 3 timeseries KPIs at w=8 each. Each shows two metrics + the mean over the visible time range in the legend: * Realm Server (line chart): HTTP req/min and HTTP err/min from Loki httpLogging exit lines. * Worker (stacked bars): Jobs/min (resolved, green) and Failures/min (rejected, red) from Postgres `jobs` grouped by 1m bucket. * Synapse (bars): Msgs/min from synapse access-log PUT /send/m.room.message. Prerender + Prerender Mgr drop out of row 1 — they'll come back in row 2 as ECS resource-utilization panels (CPU / Memory / Instances). Active Alerts and the indexing-stat row shift down by 3 to make room for the taller timeseries panels. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the resource-utilisation row beneath the KPI graphs: * 5 service stat panels (Realm Server, Prerender, Prerender Mgr, Worker, Synapse) at w=4 each, each showing CPU / Memory / Tasks from CloudWatch AWS/ECS metrics. ClusterName=${env}, ServiceName matches staging/production naming: boxel-realm-server-${env} boxel-prerender-server-${env} boxel-prerender-manager-${env} boxel-worker-${env} synapse-${env} ← no boxel- prefix Field overrides rename CloudWatch metric names to short labels (CPU, Mem, Tasks) and apply per-field thresholds — CPU/Mem yellow at 70%, red at 90%; Tasks red at 0 (service down) green at 1+. * 1 DB stat panel (w=4) — Req/min computed as `(xact_commit + xact_rollback) / pg_postmaster_start_time()` against the boxel database via the existing boxel-db Postgres datasource. This is "average since postmaster start", not a rolling window. `env` is a new constant template variable substituted at apply time: __ENV__ → local | staging | production. apply.sh's existing jq walk gets a third substitution clause; render-config-time docs updated. Locally the 5 CloudWatch panels render "No data" (no AWS access); staging/production wire up automatically once this lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…xn/min" The metric is transactions per minute (xact_commit + xact_rollback) — calling it "Req/min" was sloppy. Title bumped to "Postgres DB" so it's clear at a glance which datastore the panel covers (vs. e.g. Redis or Synapse's internal Postgres). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The KPI (jobs/min + failures/min) reflects all queue activity, not just indexing — "Job Queue" reads cleaner. Row 2 still has a "Worker" panel showing the ECS task's resource utilisation, which is genuinely worker-specific. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…equests" The req-rate / err-rate KPI is fundamentally about HTTP traffic into the realm-server — naming it "Web Requests" reflects what's being measured. Row 2 keeps a "Realm Server" panel for the ECS-task-level CPU/Mem/Tasks, which is genuinely realm-server-specific. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New `boxel-status/database.json` (uid `boxeldatabase1`) — Postgres health for the boxel application database, queried directly via the existing boxel-db datasource. No `pg_exporter` dependency. Inspired by the upstream pg_exporter community dashboard (24298) but trimmed to the panels operators actually act on: * Row 1 (KPIs): Database size, Active connections, Idle-in-transaction count, Txn/min, Cache hit ratio gauge, Long-running queries count * Row 2: Connections by state (stacked bar), Transactions/min over time * Row 3: Top 20 tables by total size — incl. dead-tuple ratio + last vacuum/analyze * Row 4: Cumulative writes per table (top 8) + Idle-in-txn backends (table) * Row 5: Long-running queries (table) + Tables overdue for vacuum (table) * Row 6: A markdown panel naming what's NOT covered (slow queries need pg_stat_statements; bloat needs pgstattuple; WAL/replication need pg_exporter — none currently enabled). Overview's Postgres DB stat now has a panel link → /d/boxeldatabase1 so operators can drill from the single-number health stat into the detailed view. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per the Grafana timeseries gradient modes (grafana.com/docs/grafana/latest/visualizations/panels-visualizations/visualizations/time-series/#gradient-mode): * gradientMode: none → opacity. Each series fades from full line color down through transparent — gives the panel a glow under each line rather than a flat-color fill. * fillOpacity: 10 → 50. The opacity gradient needs a bit of fill height to show; 50 makes the gradient legible without drowning the line. * lineInterpolation: linear → smooth. Web Requests is a rate (smoothed over 1m already); a smooth curve reads more naturally than zigzag. Other timeseries panels (Job Queue stacked bars, Synapse bar, Indexing throughput, Connections-by-state stacked bar) intentionally untouched — gradients on stacked bars get visually muddy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Boxel uses Matrix as an event bus, not just for chat — most synapse traffic locally is `app.boxel.realm-event` (realm sync), not `m.room.message` (chat). Filtering only on `m.room.message` made the panel show 'No data' even when synapse was busy. Match all `Received request: PUT .../send/` access-log lines instead. Legend label "Msgs/min" → "Events/min" to reflect the broader count. Description spells out what the count includes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Boxel realm-event traffic is bursty — events arrive on the order of 1 per minute during normal dev activity. A 1-minute rate window often catches zero hits and shows 'No data'; 5 minutes matches the smoothing already used by the Realm Server (Web Requests) panel and gives the Synapse panel something to render whenever there's been activity in the last few minutes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Match the Web Requests panel style: smooth line, fillOpacity 50, gradientMode "opacity". Synapse traffic is sparse and bursty locally, so the bar look ended up as scattered single bars; a smooth line with a fading area underneath communicates the rate trend better. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add visible drilldown link icon (top-left of each panel title) on the three Row 1 timeseries panels: * Web Requests → /d/boxel-svc-realm-server * Job Queue → /d/boxel-svc-worker * Synapse → /d/000000012 For stat panels, fieldConfig.defaults.links renders as a clickable panel header link. For timeseries panels, the same field becomes a data link that's only visible when you hover/click an individual point. Adding a top-level panel.links array gives an always-visible drilldown icon, restoring the click-through experience the stat versions had before the row 1 redesign. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Append `or vector(0)` to the Synapse panel's LogQL so timestamps with no matching log lines come back as 0 instead of null. Combined with the existing `lineInterpolation: smooth`, the line now flows continuously through idle periods and only spikes when there's real activity, instead of breaking into disjoint segments. `or vector(0)` is the standard PromQL idiom for "no-data → 0": vector(0) returns a constant 0-valued series at every evaluation timestamp, and `or` falls through to it whenever the left side has no result. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Web Requests: append `or vector(0)` to both LogQL exprs (HTTP req/min + HTTP err/min) so idle 1-minute windows render as 0 instead of null. * Job Queue: same shape change. SQL gets `$__timeGroupAlias(col, '1m', 0)` — the third macro arg is Postgres datasource fill-mode, "0" expands empty buckets to zero rows. Bar viz → smooth line with opacity gradient (drawStyle line, fillOpacity 50, gradientMode opacity, lineInterpolation smooth, lineWidth 2). Stacking flipped to none — with two lines the visual stack adds no info, while overlay lets Failures/min red show through against Jobs/min green. Result: all three Row 1 KPI panels (Web Requests, Job Queue, Synapse) share the same gradient-line aesthetic and continuously flow through idle periods. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Without AWS credentials, the 5 ECS resource panels (Realm Server, Prerender, Prerender Mgr, Worker, Synapse) on the Overview show a "No data" + query-error triangle locally. Visually indistinguishable from a real broken panel. Add a second jq pass in apply.sh, gated on `env_name == "local"`, that finds any panel whose datasource.type is `cloudwatch` and replaces it with a `text` markdown panel containing an explanatory message: ☁️ AWS CloudWatch — staging/production only The boxel-cloudwatch datasource has no AWS credentials in local dev. ECS resource utilisation (CPU / Memory / Tasks) renders correctly when this dashboard is applied to a hosted Grafana. The id, title, and gridPos are preserved, so the layout is unchanged. Match condition uses both `datasource.type == "cloudwatch"` AND a gridPos object presence so we only catch panel-level objects, not target-level objects (which also carry a datasource ref but no gridPos). Committed JSON is unchanged — staging/production still push real CloudWatch queries through the existing flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tead of CloudWatch The vendored Synapse exposes process_* and synapse_* metrics on /_synapse/metrics. The synapse-prometheus datasource scrapes those (local Prometheus locally, AMP in staging/production), so we don't need CloudWatch ECS metrics for this service — we already have finer-grained data straight from the process. Swap the 3 panel targets: CPU : rate(process_cpu_seconds_total{job="synapse"}[5m]) * 100 Mem : process_resident_memory_bytes{job="synapse"} Up : count(up{job="synapse"} == 1) Mem unit changes from `percent` (CloudWatch MemoryUtilization) to `decbytes` (raw RSS) — synapse's prometheus_client doesn't know the container memory limit so a percent ratio isn't available without extra plumbing. "Tasks" renamed to "Up" because count(up==1) is a healthy-scrape-target count, not an ECS RunningTaskCount. The Synapse panel is now functional locally too — no longer caught by apply.sh's CloudWatch-→-placeholder transform since it uses a `prometheus` datasource. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two new stat panels at the top of the Overview, each showing the running total (big number) plus a cumulative-growth sparkline: * Realms (id 21, x=0..11): SELECT MIN(indexed_at) per realm_url from realm_meta, then ROW_NUMBER() over the sorted creation timestamps for the cumulative count. Drill-through: /d/boxelrealms001. * Users (id 22, x=12..23): same shape, ordered by users.created_at. Drill-through: /d/boxelusers0001. Both panels carry a `timeFrom: "5y"` panel-level override so the sparkline always shows the full growth history regardless of where the dashboard's time picker is set — narrowing to "Last 1h" would otherwise leave the sparkline empty if no creations happened in the window. Existing rows shift +8 to make room. Also fixes the Job Queue panel drill-down: was pointing at /d/boxel-svc-worker (the Worker service deep-dive); now points at /d/boxeljobqueue1, the Job Queue dashboard that actually owns generic worker-queue metrics across all job types. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Boxel-ui defines its brand palette in packages/boxel-ui/addon/src/styles/variables.css. Wire the same hex codes into Overview panel field overrides so the KPI series read as branded rather than Grafana-default: * Realms → #6638ff (boxel-purple — brand primary) * Users → #00ffba (boxel-teal — secondary accent) * HTTP req/min → #0069f9 (boxel-blue — info) * HTTP err/min → #ff5050 (boxel-red — danger) * Jobs/min → #37eb77 (boxel-green — success) * Failures/min → #ff5050 (boxel-red — danger) * Events/min → #6638ff (boxel-purple — synapse maps to brand) * Indexing arrived/started/completed → blue/yellow/teal Threshold-based panels (ECS resource cards, indexing pipeline stats) keep Grafana's named green/yellow/red — those colors are semantically correct ("this is bad/ok/great") and close enough to the boxel palette that swapping hex codes adds visual noise without clarifying intent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per the "operator actions stay contextual" principle: the per-realm reindex action belongs on the Realms dashboard (which is realm-scoped via the `${realm_url}` template variable), not on the system-wide Indexing dashboard. Indexing dashboard (Operator Actions panel, id 4): * Drop: realm_picker, pending, in_flight, oldest_pending_human, last_reindex_status, btn_reindex_realm form elements * Drop: the realm-registry SELECT target that fed the picker * Drop: elementValueChanged hook that mirrored picker → URL var * Keep: pending_full_reindex indicator + btn_reindex_all button * SQL trims to a single COUNT(*) for full-reindex jobs in flight * Title: "Operator Actions" → "Operator Actions: Full reindex" * h=11 → h=4 (form had 8 elements, now has 2; no longer needs the deep visual footprint that overlapped the stat row below) Realms dashboard (new panel id 100): * New "Operator Actions: Reindex this realm" volkovlabs-form-panel placed at y=22 between "Indexing status (this realm)" and "Recent Indexing Errors" * Reads the realm from the existing `${realm_url}` template variable (no realm_picker form element needed — the dashboard already has a Realm dropdown at the top) * SQL adapted from the Indexing version: ${full_index_realm} → ${realm_url}, dropped pending_full_reindex column * Keeps the same blast-radius (pending / in_flight / oldest_pending) + last_reindex_status indicators and disable-while-in-flight guard * Layout below shifts +8 to make room Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-08T00:14:55Z

Host Test Results

1 files 1 suites 1h 44m 32s ⏱️
2 634 tests 2 619 ✅ 15 💤 0 ❌
2 653 runs 2 638 ✅ 15 💤 0 ❌

Results for commit 4e3a2c4.

Realm Server Test Results

1 files ± 0 1 suites +1 17m 57s ⏱️ + 17m 57s
1 285 tests +1 285 1 285 ✅ +1 285 0 💤 ±0 0 ❌ ±0
1 364 runs +1 364 1 364 ✅ +1 364 0 💤 ±0 0 ❌ ±0

Results for commit 4e3a2c4. ± Comparison against earlier commit d5b9bc5.

lukemelia mentioned this pull request May 7, 2026

observability-diff: scope PR comment to files the PR actually changes #4711

Merged

4 tasks

lukemelia requested a review from Copilot May 7, 2026 18:21

Copilot started reviewing on behalf of lukemelia May 7, 2026 18:22 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

lukemelia force-pushed the grafana-stack/06-overview-and-service-stubs branch from 88004b7 to 6cddc7b Compare May 7, 2026 18:47

lukemelia force-pushed the grafana-stack/05-realms-and-users-entities branch from 26934fb to c027383 Compare May 7, 2026 19:24

lukemelia changed the base branch from grafana-stack/05-realms-and-users-entities to main May 7, 2026 19:58

lukemelia force-pushed the grafana-stack/06-overview-and-service-stubs branch from 6cddc7b to 8c9b92d Compare May 7, 2026 20:10

lukemelia marked this pull request as ready for review May 7, 2026 20:12

lukemelia requested review from a team and backspace May 7, 2026 20:12

lukemelia and others added 12 commits May 7, 2026 18:59

lukemelia force-pushed the grafana-stack/06-overview-and-service-stubs branch from d8dfcfe to 2e2314a Compare May 7, 2026 22:59

lukemelia and others added 5 commits May 7, 2026 19:16

lukemelia and others added 6 commits May 7, 2026 19:27

lukemelia changed the title ~~observability: add Overview + per-service stub dashboards~~ observability: Overview redesign + Database dashboard + ECS / Prometheus wiring May 7, 2026

backspace approved these changes May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

observability: Overview redesign + Database dashboard + ECS / Prometheus wiring#4699

observability: Overview redesign + Database dashboard + ECS / Prometheus wiring#4699
lukemelia wants to merge 23 commits intomainfrom
grafana-stack/06-overview-and-service-stubs

lukemelia commented May 7, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lukemelia commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Screenshots

Overview

Database (new dashboard)

Realms (with the moved reindex panel)

Indexing (trimmed to Full reindex only)

Overview dashboard (overview.json) — full reshape

Database dashboard (database.json — new, uid boxeldatabase1)

Per-service stub dashboards (service-*.json)

env constant template variable

Reindex panel relocation

Bug fixes (drive-by)

Three governing principles

Test plan

Uh oh!

github-actions Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Observability diff (vs staging)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Host Test Results

Realm Server Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lukemelia commented May 7, 2026 •

edited

Loading

Overview dashboard (`overview.json`) — full reshape

Database dashboard (`database.json` — new, uid `boxeldatabase1`)

Per-service stub dashboards (`service-*.json`)

github-actions Bot commented May 7, 2026 •

edited

Loading

github-actions Bot commented May 8, 2026 •

edited

Loading