observability: Overview redesign + Database dashboard + ECS / Prometheus wiring#4699
observability: Overview redesign + Database dashboard + ECS / Prometheus wiring#4699
Conversation
Observability diff (vs staging)No dashboard / folder changes detected against the staging Grafana. (Run: https://github.com/cardstack/boxel/actions/runs/25528541266) |
There was a problem hiding this comment.
Pull request overview
Adds a top-level Grafana “Overview” dashboard plus new per-service “stub” dashboards (realm-server, prerender-server, prerender-manager, worker) to establish a navigable dashboard tree for Boxel operational monitoring, and removes the now-redundant Worker Status dashboard.
Changes:
- Added
overview.jsonwith service liveness stats, an alertlist panel, indexing pipeline stats, indexing throughput, and a dashboard directory panel. - Added per-service deep-dive stub dashboards under
boxel-status/that focus on request/activity rate + filtered logs. - Deleted
worker-status.json(its alertlist is replaced by Overview’s “Active Alerts”).
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/observability/grafanactl/resources/dashboards/boxel-status/worker-status.json | Removes the old single-panel Worker Status dashboard. |
| packages/observability/grafanactl/resources/dashboards/boxel-status/service-worker.json | Adds a worker stub with activity-rate + filtered Loki logs. |
| packages/observability/grafanactl/resources/dashboards/boxel-status/service-realm-server.json | Adds a realm-server stub with HTTP request-rate + filtered Loki logs. |
| packages/observability/grafanactl/resources/dashboards/boxel-status/service-prerender-server.json | Adds a prerender-server stub with HTTP request-rate + filtered Loki logs. |
| packages/observability/grafanactl/resources/dashboards/boxel-status/service-prerender-manager.json | Adds a prerender-manager stub with HTTP request-rate + filtered Loki logs. |
| packages/observability/grafanactl/resources/dashboards/boxel-status/overview.json | Adds a new top-level Overview dashboard that links to service stubs and key workflow dashboards. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
88004b7 to
6cddc7b
Compare
26934fb to
c027383
Compare
6cddc7b to
8c9b92d
Compare
The navigation backbone for the rationalized dashboard tree. Lands last because it links to dashboards introduced in #4696, #4697, #4698. * `overview.json` — top-of-tree dashboard tagged `overview`. Five service-health stats (Loki log volume in last 5m as a liveness proxy, with drill-down panel-link to each service's stub), full-width alertlist (replaces the deleted worker-status.json), four indexing pipeline stats, an indexing throughput timeseries, and a markdown text panel listing the other dashboards by category. * `service-{realm-server,prerender-server,prerender-manager,worker}.json` — minimal per-service deep-dives tagged `service:<name>`. Each is a Loki-filtered logs panel plus a service-specific top-row chart: * realm-server: HTTP request rate (total/4xx/5xx) parsed from the `httpLogging` exit-log lines (`--> METHOD ACCEPT URL: STATUS`) * prerender / prerender-manager: same HTTP rate chart * worker: indexed-files-per-second + error-rate from the indexing-progress event stream CloudWatch ECS metrics (CPU / memory / RunningTaskCount) are intentionally deferred until cluster + task-family naming is standardized in observability config — these stubs are the natural home to add them. * `worker-status.json` deleted — its sole alertlist panel is folded into Overview row 2. The three governing principles for the rationalized tree are: 1. System dashboards aggregate; entity dashboards filter. 2. Dashboard = context; action lives in its context. 3. Drill-downs go down or sideways, never up. Overview is the only dashboard that links downward; deep-dives link sideways to peers. Logs is the universal forensics terminal, with `entity:*` sideways links that pass `realm_url` / `matrix_user_id` to pivot rather than drill back up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three small follow-ups after applying the new Overview dashboard locally: * Drop the redundant `AS time` after `$__timeGroupAlias(col, '30s')` in the throughput timeseries queries. The macro already expands to `floor(...) AS "time"`, so appending `AS time` produces `... AS "time" AS time` — Postgres rejects with "syntax error at or near AS". Same bug fixed in `indexing.json` (pre-existing on main from #4697) so the Indexing dashboard's throughput panel renders too. * Service-health stat panels: change the `null` threshold step from red to transparent and add a `0` step at red, so "no streams matched" (e.g. local dev with mise tasks not running) shows neutral, while "stream exists but silent for 5m" still alarms red. * `Other dashboards` markdown panel: bump height 4 → 8 so the bullet list isn't clipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…w to $__range * Alloy: rewrite the local docker-compose container name `boxel-synapse` to `service=synapse` so dashboards using `service="synapse"` work identically against local and staging/prod (where the ECS task family is already `synapse`). * Overview service-health stats: change `count_over_time(... [5m])` to `count_over_time(... [$__range])`. The 5m window was a strict liveness proxy that only fits chatty staging/prod services — local-dev mise tasks emit in bursts (active clicking → 1k+ lines/min, idle → silent for minutes). With $__range the stat reflects log volume over the visible dashboard window, so a 1h panel catches bursty local activity while operators on staging/prod can narrow the time picker for strict liveness. Panel descriptions updated to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- textMode "value_and_name" → "value": the field name "Value #A" was Grafana's auto-name for an instant query result, with no semantic meaning. The panel title already names the service; the count below is what matters. - "Prerender Server" → "Prerender" and "Prerender Manager" → "Prerender Mgr" so both fit in their stat columns instead of truncating to "Prerender ...". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…service metrics Log-line count was a meaningless number; replace each service stat with metrics an operator actually acts on: * Realm Server / Prerender: HTTP request rate (httpLogging --> exit lines) and error rate (lines matching error|exception|fatal). * Prerender Mgr: request rate from "proxying ..." entry lines (the manager doesn't emit httpLogging), plus error rate. * Worker: jobs/min resolved, jobs/min rejected, queue depth (unfulfilled jobs with no active reservation). Single Postgres query, three columns. * Synapse: PUT /send/m.room.message rate from synapse.access.http lines — the actual chat-send signal, not internal Synapse chatter. All Loki rates use rate([5m]) * 60 for smoothed per-minute. Stat panels display vertically with field name + value (textMode "value_and_name"); field-level overrides invert thresholds so error/failure fields turn red on any nonzero, while activity rates stay neutral when idle. Panel height bumped 4→5 to fit 2-3 stacked stats; everything below shifts +1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… Synapse) Replace the 5 single-stat panels in row 1 with 3 timeseries KPIs at w=8 each. Each shows two metrics + the mean over the visible time range in the legend: * Realm Server (line chart): HTTP req/min and HTTP err/min from Loki httpLogging exit lines. * Worker (stacked bars): Jobs/min (resolved, green) and Failures/min (rejected, red) from Postgres `jobs` grouped by 1m bucket. * Synapse (bars): Msgs/min from synapse access-log PUT /send/m.room.message. Prerender + Prerender Mgr drop out of row 1 — they'll come back in row 2 as ECS resource-utilization panels (CPU / Memory / Instances). Active Alerts and the indexing-stat row shift down by 3 to make room for the taller timeseries panels. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the resource-utilisation row beneath the KPI graphs:
* 5 service stat panels (Realm Server, Prerender, Prerender Mgr, Worker,
Synapse) at w=4 each, each showing CPU / Memory / Tasks from
CloudWatch AWS/ECS metrics. ClusterName=${env}, ServiceName matches
staging/production naming:
boxel-realm-server-${env}
boxel-prerender-server-${env}
boxel-prerender-manager-${env}
boxel-worker-${env}
synapse-${env} ← no boxel- prefix
Field overrides rename CloudWatch metric names to short labels (CPU,
Mem, Tasks) and apply per-field thresholds — CPU/Mem yellow at 70%,
red at 90%; Tasks red at 0 (service down) green at 1+.
* 1 DB stat panel (w=4) — Req/min computed as
`(xact_commit + xact_rollback) / pg_postmaster_start_time()` against
the boxel database via the existing boxel-db Postgres datasource.
This is "average since postmaster start", not a rolling window.
`env` is a new constant template variable substituted at apply time:
__ENV__ → local | staging | production. apply.sh's existing jq walk
gets a third substitution clause; render-config-time docs updated.
Locally the 5 CloudWatch panels render "No data" (no AWS access);
staging/production wire up automatically once this lands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…xn/min" The metric is transactions per minute (xact_commit + xact_rollback) — calling it "Req/min" was sloppy. Title bumped to "Postgres DB" so it's clear at a glance which datastore the panel covers (vs. e.g. Redis or Synapse's internal Postgres). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The KPI (jobs/min + failures/min) reflects all queue activity, not just indexing — "Job Queue" reads cleaner. Row 2 still has a "Worker" panel showing the ECS task's resource utilisation, which is genuinely worker-specific. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…equests" The req-rate / err-rate KPI is fundamentally about HTTP traffic into the realm-server — naming it "Web Requests" reflects what's being measured. Row 2 keeps a "Realm Server" panel for the ECS-task-level CPU/Mem/Tasks, which is genuinely realm-server-specific. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New `boxel-status/database.json` (uid `boxeldatabase1`) — Postgres health for the boxel application database, queried directly via the existing boxel-db datasource. No `pg_exporter` dependency. Inspired by the upstream pg_exporter community dashboard (24298) but trimmed to the panels operators actually act on: * Row 1 (KPIs): Database size, Active connections, Idle-in-transaction count, Txn/min, Cache hit ratio gauge, Long-running queries count * Row 2: Connections by state (stacked bar), Transactions/min over time * Row 3: Top 20 tables by total size — incl. dead-tuple ratio + last vacuum/analyze * Row 4: Cumulative writes per table (top 8) + Idle-in-txn backends (table) * Row 5: Long-running queries (table) + Tables overdue for vacuum (table) * Row 6: A markdown panel naming what's NOT covered (slow queries need pg_stat_statements; bloat needs pgstattuple; WAL/replication need pg_exporter — none currently enabled). Overview's Postgres DB stat now has a panel link → /d/boxeldatabase1 so operators can drill from the single-number health stat into the detailed view. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per the Grafana timeseries gradient modes (grafana.com/docs/grafana/latest/visualizations/panels-visualizations/visualizations/time-series/#gradient-mode): * gradientMode: none → opacity. Each series fades from full line color down through transparent — gives the panel a glow under each line rather than a flat-color fill. * fillOpacity: 10 → 50. The opacity gradient needs a bit of fill height to show; 50 makes the gradient legible without drowning the line. * lineInterpolation: linear → smooth. Web Requests is a rate (smoothed over 1m already); a smooth curve reads more naturally than zigzag. Other timeseries panels (Job Queue stacked bars, Synapse bar, Indexing throughput, Connections-by-state stacked bar) intentionally untouched — gradients on stacked bars get visually muddy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d8dfcfe to
2e2314a
Compare
Boxel uses Matrix as an event bus, not just for chat — most synapse traffic locally is `app.boxel.realm-event` (realm sync), not `m.room.message` (chat). Filtering only on `m.room.message` made the panel show 'No data' even when synapse was busy. Match all `Received request: PUT .../send/` access-log lines instead. Legend label "Msgs/min" → "Events/min" to reflect the broader count. Description spells out what the count includes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Boxel realm-event traffic is bursty — events arrive on the order of 1 per minute during normal dev activity. A 1-minute rate window often catches zero hits and shows 'No data'; 5 minutes matches the smoothing already used by the Realm Server (Web Requests) panel and gives the Synapse panel something to render whenever there's been activity in the last few minutes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Match the Web Requests panel style: smooth line, fillOpacity 50, gradientMode "opacity". Synapse traffic is sparse and bursty locally, so the bar look ended up as scattered single bars; a smooth line with a fading area underneath communicates the rate trend better. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add visible drilldown link icon (top-left of each panel title) on the three Row 1 timeseries panels: * Web Requests → /d/boxel-svc-realm-server * Job Queue → /d/boxel-svc-worker * Synapse → /d/000000012 For stat panels, fieldConfig.defaults.links renders as a clickable panel header link. For timeseries panels, the same field becomes a data link that's only visible when you hover/click an individual point. Adding a top-level panel.links array gives an always-visible drilldown icon, restoring the click-through experience the stat versions had before the row 1 redesign. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Append `or vector(0)` to the Synapse panel's LogQL so timestamps with no matching log lines come back as 0 instead of null. Combined with the existing `lineInterpolation: smooth`, the line now flows continuously through idle periods and only spikes when there's real activity, instead of breaking into disjoint segments. `or vector(0)` is the standard PromQL idiom for "no-data → 0": vector(0) returns a constant 0-valued series at every evaluation timestamp, and `or` falls through to it whenever the left side has no result. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Web Requests: append `or vector(0)` to both LogQL exprs (HTTP req/min + HTTP err/min) so idle 1-minute windows render as 0 instead of null. * Job Queue: same shape change. SQL gets `$__timeGroupAlias(col, '1m', 0)` — the third macro arg is Postgres datasource fill-mode, "0" expands empty buckets to zero rows. Bar viz → smooth line with opacity gradient (drawStyle line, fillOpacity 50, gradientMode opacity, lineInterpolation smooth, lineWidth 2). Stacking flipped to none — with two lines the visual stack adds no info, while overlay lets Failures/min red show through against Jobs/min green. Result: all three Row 1 KPI panels (Web Requests, Job Queue, Synapse) share the same gradient-line aesthetic and continuously flow through idle periods. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without AWS credentials, the 5 ECS resource panels (Realm Server, Prerender, Prerender Mgr, Worker, Synapse) on the Overview show a "No data" + query-error triangle locally. Visually indistinguishable from a real broken panel. Add a second jq pass in apply.sh, gated on `env_name == "local"`, that finds any panel whose datasource.type is `cloudwatch` and replaces it with a `text` markdown panel containing an explanatory message: ☁️ AWS CloudWatch — staging/production only The boxel-cloudwatch datasource has no AWS credentials in local dev. ECS resource utilisation (CPU / Memory / Tasks) renders correctly when this dashboard is applied to a hosted Grafana. The id, title, and gridPos are preserved, so the layout is unchanged. Match condition uses both `datasource.type == "cloudwatch"` AND a gridPos object presence so we only catch panel-level objects, not target-level objects (which also carry a datasource ref but no gridPos). Committed JSON is unchanged — staging/production still push real CloudWatch queries through the existing flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tead of CloudWatch
The vendored Synapse exposes process_* and synapse_* metrics on
/_synapse/metrics. The synapse-prometheus datasource scrapes those
(local Prometheus locally, AMP in staging/production), so we don't
need CloudWatch ECS metrics for this service — we already have
finer-grained data straight from the process.
Swap the 3 panel targets:
CPU : rate(process_cpu_seconds_total{job="synapse"}[5m]) * 100
Mem : process_resident_memory_bytes{job="synapse"}
Up : count(up{job="synapse"} == 1)
Mem unit changes from `percent` (CloudWatch MemoryUtilization) to
`decbytes` (raw RSS) — synapse's prometheus_client doesn't know the
container memory limit so a percent ratio isn't available without
extra plumbing. "Tasks" renamed to "Up" because count(up==1) is a
healthy-scrape-target count, not an ECS RunningTaskCount.
The Synapse panel is now functional locally too — no longer caught
by apply.sh's CloudWatch-→-placeholder transform since it uses a
`prometheus` datasource.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new stat panels at the top of the Overview, each showing the running total (big number) plus a cumulative-growth sparkline: * Realms (id 21, x=0..11): SELECT MIN(indexed_at) per realm_url from realm_meta, then ROW_NUMBER() over the sorted creation timestamps for the cumulative count. Drill-through: /d/boxelrealms001. * Users (id 22, x=12..23): same shape, ordered by users.created_at. Drill-through: /d/boxelusers0001. Both panels carry a `timeFrom: "5y"` panel-level override so the sparkline always shows the full growth history regardless of where the dashboard's time picker is set — narrowing to "Last 1h" would otherwise leave the sparkline empty if no creations happened in the window. Existing rows shift +8 to make room. Also fixes the Job Queue panel drill-down: was pointing at /d/boxel-svc-worker (the Worker service deep-dive); now points at /d/boxeljobqueue1, the Job Queue dashboard that actually owns generic worker-queue metrics across all job types. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Boxel-ui defines its brand palette in packages/boxel-ui/addon/src/styles/variables.css.
Wire the same hex codes into Overview panel field overrides so the
KPI series read as branded rather than Grafana-default:
* Realms → #6638ff (boxel-purple — brand primary)
* Users → #00ffba (boxel-teal — secondary accent)
* HTTP req/min → #0069f9 (boxel-blue — info)
* HTTP err/min → #ff5050 (boxel-red — danger)
* Jobs/min → #37eb77 (boxel-green — success)
* Failures/min → #ff5050 (boxel-red — danger)
* Events/min → #6638ff (boxel-purple — synapse maps to brand)
* Indexing arrived/started/completed → blue/yellow/teal
Threshold-based panels (ECS resource cards, indexing pipeline stats)
keep Grafana's named green/yellow/red — those colors are semantically
correct ("this is bad/ok/great") and close enough to the boxel palette
that swapping hex codes adds visual noise without clarifying intent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per the "operator actions stay contextual" principle: the per-realm
reindex action belongs on the Realms dashboard (which is realm-scoped
via the `${realm_url}` template variable), not on the system-wide
Indexing dashboard.
Indexing dashboard (Operator Actions panel, id 4):
* Drop: realm_picker, pending, in_flight, oldest_pending_human,
last_reindex_status, btn_reindex_realm form elements
* Drop: the realm-registry SELECT target that fed the picker
* Drop: elementValueChanged hook that mirrored picker → URL var
* Keep: pending_full_reindex indicator + btn_reindex_all button
* SQL trims to a single COUNT(*) for full-reindex jobs in flight
* Title: "Operator Actions" → "Operator Actions: Full reindex"
* h=11 → h=4 (form had 8 elements, now has 2; no longer needs the
deep visual footprint that overlapped the stat row below)
Realms dashboard (new panel id 100):
* New "Operator Actions: Reindex this realm" volkovlabs-form-panel
placed at y=22 between "Indexing status (this realm)" and
"Recent Indexing Errors"
* Reads the realm from the existing `${realm_url}` template variable
(no realm_picker form element needed — the dashboard already has
a Realm dropdown at the top)
* SQL adapted from the Indexing version: ${full_index_realm} →
${realm_url}, dropped pending_full_reindex column
* Keeps the same blast-radius (pending / in_flight / oldest_pending)
+ last_reindex_status indicators and disable-while-in-flight guard
* Layout below shifts +8 to make room
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Host Test Results 1 files 1 suites 1h 44m 32s ⏱️ Results for commit 4e3a2c4. Realm Server Test Results 1 files ± 0 1 suites +1 17m 57s ⏱️ + 17m 57s Results for commit 4e3a2c4. ± Comparison against earlier commit d5b9bc5. |
Summary
Rebuilds the Overview dashboard around per-service KPIs and ECS resource cards, adds a new Database dashboard, wires CloudWatch ECS metrics (with a local-dev placeholder fallback so empty panels don't look broken), and moves per-realm reindex from Indexing to Realms (where the realm context lives). Most of this PR is the Overview reshaping; the rest is supporting plumbing.
Screenshots
Overview
Top to bottom: Realms / Users cumulative-growth KPIs · Web Requests / Job Queue / Synapse rate KPIs (smooth gradient lines, mean in legend, drilldown icons) · Realm Server / Prerender / Prerender Mgr / Worker ECS panels showing the local-dev placeholder · Synapse ECS panel pulling from local Prometheus (CPU / Mem / Up) · Postgres DB card (
txn/min, drills to the new Database dashboard) · Active Alerts · indexing pipeline stats and throughput.Database (new dashboard)
KPIs (DB size, active connections, idle in txn, txn/min, cache hit ratio gauge, long-running queries) · connections-by-state stacked bar · transactions/min · top tables by size · cumulative writes by table · idle-in-txn backends · long-running queries · tables overdue for vacuum · markdown footer naming what's NOT covered without
pg_stat_statements/pgstattuple/pg_exporter.Realms (with the moved reindex panel)
Per-realm stat row · Grant Permission · Realm Permissions table · Indexing status · Operator Actions: Reindex this realm (new — moved here from Indexing). The realm comes from the existing
${realm_url}template variable; blast-radius gating disables the button while a job is in flight.Indexing (trimmed to Full reindex only)
The Operator Actions panel now holds only
Pending full-reindexindicator +Reindex ALL realmsbutton. h: 11 → 4. The per-realm reindex moved to the Realms dashboard.Overview dashboard (
overview.json) — full reshapeworker-status.json)Notable shape choices:
boxel-ui/addon/src/styles/variables.css). Threshold-based panels stay on Grafana's named green/yellow/red since those colors carry semantic meaning.or vector(0); Postgres queries use$__timeGroupAlias(col, '1m', 0)fill-mode. Combined withlineInterpolation: smoothandgradientMode: opacity, the lines flow continuously through idle periods rather than breaking into segments.apply.sh, gated onenv_name == "local", swaps any panel withdatasource.type == "cloudwatch"for atextmarkdown placeholder explaining the panel only works in staging/production. Committed JSON keeps real CloudWatch queries; staging/prod apply pushes them unchanged.Database dashboard (
database.json— new, uidboxeldatabase1)SQL-only dashboard for the boxel application Postgres, queried directly via the existing
boxel-dbdatasource — nopg_exporterdependency. Inspired by the upstream pg_exporter community dashboard but trimmed to panels operators act on:pg_stat_statements; bloat needspgstattuple; WAL/replication needpg_exporter)Overview's Postgres DB stat drills here.
Per-service stub dashboards (
service-*.json)Minimal per-service deep-dives tagged
service:<name>, one each for realm-server / prerender-server / prerender-manager / worker. Each has a service-specific top-row chart (HTTP request rate from httpLogging exit lines for the three HTTP services; indexed-files/sec + error-rate from[indexing-progress]events for worker) plus a Loki-filtered logs panel below.Overview's Row 2 ECS cards link into these.
env constant template variable
New
envconstant template variable (__ENV__) substituted byapply.shtolocal/staging/productionat apply time. Used in CloudWatch dimension values likeboxel-realm-server-${env}and the local-only placeholder gate. Same substitution mechanism as the existing__REALM_SERVER_URL__andREPLACE_AT_APPLY_TIMEplaceholders.Reindex panel relocation
The per-realm reindex action moves from the system-wide Indexing dashboard to the realm-scoped Realms dashboard. The "Reindex ALL realms" system action stays on Indexing. Both keep the disable-while-in-flight blast-radius gating.
The Realms dashboard already had a
${realm_url}template variable feeding the rest of its panels — the new operator-action panel just reads from that.Bug fixes (drive-by)
$__timeGroupAlias(col, '30s') AS timewas a SQL syntax error (the macro already addsAS "time"). Fixed in both Overview's throughput timeseries and the pre-existing copy onindexing.jsonfrom observability: split Boxel Jobs into Indexing + Job Queue dashboards #4697.boxel-synapse → synapseAlloy relabel rule so dashboards filtering onservice="synapse"work locally (matching the staging/prod ECS task family naming).Three governing principles
entity:*sideways links that passrealm_url/matrix_user_idto pivot rather than drill back up.Test plan
cd packages/observability && ./scripts/apply.sh --env local/_grafana-reindex?realm=....apply.sh --env staging --dry-runsucceeds; the CloudWatch placeholder transformation is not applied (env_name guard); the env constant substitutes tostaging.🤖 Generated with Claude Code