Skip to content

observability: Overview redesign + Database dashboard + ECS / Prometheus wiring#4699

Open
lukemelia wants to merge 23 commits intomainfrom
grafana-stack/06-overview-and-service-stubs
Open

observability: Overview redesign + Database dashboard + ECS / Prometheus wiring#4699
lukemelia wants to merge 23 commits intomainfrom
grafana-stack/06-overview-and-service-stubs

Conversation

@lukemelia
Copy link
Copy Markdown
Contributor

@lukemelia lukemelia commented May 7, 2026

Summary

Rebuilds the Overview dashboard around per-service KPIs and ECS resource cards, adds a new Database dashboard, wires CloudWatch ECS metrics (with a local-dev placeholder fallback so empty panels don't look broken), and moves per-realm reindex from Indexing to Realms (where the realm context lives). Most of this PR is the Overview reshaping; the rest is supporting plumbing.

Screenshots

Overview

Overview dashboard

Top to bottom: Realms / Users cumulative-growth KPIs · Web Requests / Job Queue / Synapse rate KPIs (smooth gradient lines, mean in legend, drilldown icons) · Realm Server / Prerender / Prerender Mgr / Worker ECS panels showing the local-dev placeholder · Synapse ECS panel pulling from local Prometheus (CPU / Mem / Up) · Postgres DB card (txn/min, drills to the new Database dashboard) · Active Alerts · indexing pipeline stats and throughput.

Database (new dashboard)

Database dashboard

KPIs (DB size, active connections, idle in txn, txn/min, cache hit ratio gauge, long-running queries) · connections-by-state stacked bar · transactions/min · top tables by size · cumulative writes by table · idle-in-txn backends · long-running queries · tables overdue for vacuum · markdown footer naming what's NOT covered without pg_stat_statements / pgstattuple / pg_exporter.

Realms (with the moved reindex panel)

Realms dashboard

Per-realm stat row · Grant Permission · Realm Permissions table · Indexing status · Operator Actions: Reindex this realm (new — moved here from Indexing). The realm comes from the existing ${realm_url} template variable; blast-radius gating disables the button while a job is in flight.

Indexing (trimmed to Full reindex only)

Indexing dashboard, trimmed

The Operator Actions panel now holds only Pending full-reindex indicator + Reindex ALL realms button. h: 11 → 4. The per-realm reindex moved to the Realms dashboard.

Overview dashboard (overview.json) — full reshape

Row Content
0 Realms + Users stat panels with cumulative-growth sparklines (drill to realms / users dashboards)
1 Web Requests (HTTP req/min + err/min, smooth gradient line) · Job Queue (Jobs/min + Failures/min, gradient line, drills to Job Queue dashboard) · Synapse (Events/min, gradient line, drills to vendored Synapse dashboard)
2 Five ECS resource cards (Realm Server / Prerender / Prerender Mgr / Worker / Synapse) showing CPU / Memory / Tasks · Postgres DB card (txn/min, drills to new Database dashboard)
3 Active Alerts — alertlist (absorbs the deleted worker-status.json)
4 Four indexing-pipeline stats (Pending / In-flight / Oldest Pending / Errors)
5 Indexing throughput timeseries
6 Markdown text panel linking to other dashboards by category

Notable shape choices:

  • Boxel brand palette applied to series colors via field overrides (purple / blue / teal / red / yellow per boxel-ui/addon/src/styles/variables.css). Threshold-based panels stay on Grafana's named green/yellow/red since those colors carry semantic meaning.
  • Zero-fill on rate panels — LogQL queries append or vector(0); Postgres queries use $__timeGroupAlias(col, '1m', 0) fill-mode. Combined with lineInterpolation: smooth and gradientMode: opacity, the lines flow continuously through idle periods rather than breaking into segments.
  • Synapse Row 2 panel queries Prometheus, not CloudWatch — synapse exposes process/_synapse metrics via the synapse-prometheus datasource (local Prometheus locally, AMP in staging/production), giving us finer-grained data than CloudWatch's MemoryUtilization / RunningTaskCount.
  • ECS panels show local-dev placeholder — a second jq pass in apply.sh, gated on env_name == "local", swaps any panel with datasource.type == "cloudwatch" for a text markdown placeholder explaining the panel only works in staging/production. Committed JSON keeps real CloudWatch queries; staging/prod apply pushes them unchanged.

Database dashboard (database.json — new, uid boxeldatabase1)

SQL-only dashboard for the boxel application Postgres, queried directly via the existing boxel-db datasource — no pg_exporter dependency. Inspired by the upstream pg_exporter community dashboard but trimmed to panels operators act on:

  • KPIs: Database size, Active connections, Idle in Txn, Txn/min, Cache hit ratio gauge, Long-running queries
  • Connections by state stacked-bar timeseries
  • Top 20 tables by size (with dead-tuple ratio + last vacuum/analyze)
  • Cumulative writes by table + Idle-in-txn backends
  • Long-running queries + Tables overdue for vacuum
  • Markdown panel naming what's NOT covered (slow queries need pg_stat_statements; bloat needs pgstattuple; WAL/replication need pg_exporter)

Overview's Postgres DB stat drills here.

Per-service stub dashboards (service-*.json)

Minimal per-service deep-dives tagged service:<name>, one each for realm-server / prerender-server / prerender-manager / worker. Each has a service-specific top-row chart (HTTP request rate from httpLogging exit lines for the three HTTP services; indexed-files/sec + error-rate from [indexing-progress] events for worker) plus a Loki-filtered logs panel below.

Overview's Row 2 ECS cards link into these.

env constant template variable

New env constant template variable (__ENV__) substituted by apply.sh to local / staging / production at apply time. Used in CloudWatch dimension values like boxel-realm-server-${env} and the local-only placeholder gate. Same substitution mechanism as the existing __REALM_SERVER_URL__ and REPLACE_AT_APPLY_TIME placeholders.

Reindex panel relocation

The per-realm reindex action moves from the system-wide Indexing dashboard to the realm-scoped Realms dashboard. The "Reindex ALL realms" system action stays on Indexing. Both keep the disable-while-in-flight blast-radius gating.

The Realms dashboard already had a ${realm_url} template variable feeding the rest of its panels — the new operator-action panel just reads from that.

Bug fixes (drive-by)

  • $__timeGroupAlias(col, '30s') AS time was a SQL syntax error (the macro already adds AS "time"). Fixed in both Overview's throughput timeseries and the pre-existing copy on indexing.json from observability: split Boxel Jobs into Indexing + Job Queue dashboards #4697.
  • boxel-synapse → synapse Alloy relabel rule so dashboards filtering on service="synapse" work locally (matching the staging/prod ECS task family naming).

Three governing principles

  1. System dashboards aggregate; entity dashboards filter.
  2. Dashboard = context; action lives in its context. Operator actions never centralize — per-realm reindex on Realms, system reindex on Indexing.
  3. Drill-downs go down or sideways, never up. Overview links down; deep-dives link sideways to peers; Logs is the universal terminal with entity:* sideways links that pass realm_url / matrix_user_id to pivot rather than drill back up.

Test plan

  • Apply locally: cd packages/observability && ./scripts/apply.sh --env local
  • Open Overview — Row 0 shows Realms + Users with sparklines, Row 1 shows the three KPI gradient lines, Row 2 ECS cards show local-dev placeholder text (not CloudWatch error triangles), Postgres DB card shows current txn/min, Synapse Row 2 card shows real CPU / Mem / Up from local Prometheus.
  • Click each drill-down icon — Web Requests → realm-server stub, Job Queue → Job Queue dashboard, Synapse → vendored Synapse, Postgres DB → Database, Realms → Realms, Users → Users.
  • Open Database — KPIs + tables render against local boxel-db; "Top Tables" shows boxel_index, boxel_index_working etc.
  • Open Realms, pick a realm → "Operator Actions: Reindex this realm" panel renders with realm-scoped pending/in-flight; clicking the "Reindex {realm}" button hits /_grafana-reindex?realm=....
  • Open Indexing — "Operator Actions: Full reindex" shows the system-wide button only.
  • Apply against a real hosted Grafana (staging dry-run) — apply.sh --env staging --dry-run succeeds; the CloudWatch placeholder transformation is not applied (env_name guard); the env constant substitutes to staging.

Screenshot host: a dedicated pr-4699-screenshots orphan branch holds the four PNGs. Safe to delete the branch once this PR merges.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Observability diff (vs staging)

No dashboard / folder changes detected against the staging Grafana.

(Run: https://github.com/cardstack/boxel/actions/runs/25528541266)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a top-level Grafana “Overview” dashboard plus new per-service “stub” dashboards (realm-server, prerender-server, prerender-manager, worker) to establish a navigable dashboard tree for Boxel operational monitoring, and removes the now-redundant Worker Status dashboard.

Changes:

  • Added overview.json with service liveness stats, an alertlist panel, indexing pipeline stats, indexing throughput, and a dashboard directory panel.
  • Added per-service deep-dive stub dashboards under boxel-status/ that focus on request/activity rate + filtered logs.
  • Deleted worker-status.json (its alertlist is replaced by Overview’s “Active Alerts”).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
packages/observability/grafanactl/resources/dashboards/boxel-status/worker-status.json Removes the old single-panel Worker Status dashboard.
packages/observability/grafanactl/resources/dashboards/boxel-status/service-worker.json Adds a worker stub with activity-rate + filtered Loki logs.
packages/observability/grafanactl/resources/dashboards/boxel-status/service-realm-server.json Adds a realm-server stub with HTTP request-rate + filtered Loki logs.
packages/observability/grafanactl/resources/dashboards/boxel-status/service-prerender-server.json Adds a prerender-server stub with HTTP request-rate + filtered Loki logs.
packages/observability/grafanactl/resources/dashboards/boxel-status/service-prerender-manager.json Adds a prerender-manager stub with HTTP request-rate + filtered Loki logs.
packages/observability/grafanactl/resources/dashboards/boxel-status/overview.json Adds a new top-level Overview dashboard that links to service stubs and key workflow dashboards.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@lukemelia lukemelia force-pushed the grafana-stack/06-overview-and-service-stubs branch from 88004b7 to 6cddc7b Compare May 7, 2026 18:47
@lukemelia lukemelia force-pushed the grafana-stack/05-realms-and-users-entities branch from 26934fb to c027383 Compare May 7, 2026 19:24
@lukemelia lukemelia changed the base branch from grafana-stack/05-realms-and-users-entities to main May 7, 2026 19:58
@lukemelia lukemelia force-pushed the grafana-stack/06-overview-and-service-stubs branch from 6cddc7b to 8c9b92d Compare May 7, 2026 20:10
@lukemelia lukemelia marked this pull request as ready for review May 7, 2026 20:12
@lukemelia lukemelia requested review from a team and backspace May 7, 2026 20:12
lukemelia and others added 12 commits May 7, 2026 18:59
The navigation backbone for the rationalized dashboard tree. Lands last
because it links to dashboards introduced in #4696, #4697, #4698.

* `overview.json` — top-of-tree dashboard tagged `overview`. Five
  service-health stats (Loki log volume in last 5m as a liveness proxy,
  with drill-down panel-link to each service's stub), full-width
  alertlist (replaces the deleted worker-status.json), four indexing
  pipeline stats, an indexing throughput timeseries, and a markdown
  text panel listing the other dashboards by category.

* `service-{realm-server,prerender-server,prerender-manager,worker}.json`
  — minimal per-service deep-dives tagged `service:<name>`. Each is a
  Loki-filtered logs panel plus a service-specific top-row chart:
    * realm-server: HTTP request rate (total/4xx/5xx) parsed from the
      `httpLogging` exit-log lines (`--> METHOD ACCEPT URL: STATUS`)
    * prerender / prerender-manager: same HTTP rate chart
    * worker: indexed-files-per-second + error-rate from the
      indexing-progress event stream
  CloudWatch ECS metrics (CPU / memory / RunningTaskCount) are
  intentionally deferred until cluster + task-family naming is
  standardized in observability config — these stubs are the natural
  home to add them.

* `worker-status.json` deleted — its sole alertlist panel is folded
  into Overview row 2.

The three governing principles for the rationalized tree are:

  1. System dashboards aggregate; entity dashboards filter.
  2. Dashboard = context; action lives in its context.
  3. Drill-downs go down or sideways, never up. Overview is the only
     dashboard that links downward; deep-dives link sideways to peers.
     Logs is the universal forensics terminal, with `entity:*` sideways
     links that pass `realm_url` / `matrix_user_id` to pivot rather
     than drill back up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three small follow-ups after applying the new Overview dashboard locally:

* Drop the redundant `AS time` after `$__timeGroupAlias(col, '30s')` in
  the throughput timeseries queries. The macro already expands to
  `floor(...) AS "time"`, so appending `AS time` produces
  `... AS "time" AS time` — Postgres rejects with "syntax error at or
  near AS". Same bug fixed in `indexing.json` (pre-existing on main from
  #4697) so the Indexing dashboard's throughput panel renders too.

* Service-health stat panels: change the `null` threshold step from red
  to transparent and add a `0` step at red, so "no streams matched"
  (e.g. local dev with mise tasks not running) shows neutral, while
  "stream exists but silent for 5m" still alarms red.

* `Other dashboards` markdown panel: bump height 4 → 8 so the bullet
  list isn't clipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…w to $__range

* Alloy: rewrite the local docker-compose container name `boxel-synapse`
  to `service=synapse` so dashboards using `service="synapse"` work
  identically against local and staging/prod (where the ECS task family
  is already `synapse`).

* Overview service-health stats: change `count_over_time(... [5m])` to
  `count_over_time(... [$__range])`. The 5m window was a strict liveness
  proxy that only fits chatty staging/prod services — local-dev mise
  tasks emit in bursts (active clicking → 1k+ lines/min, idle → silent
  for minutes). With $__range the stat reflects log volume over the
  visible dashboard window, so a 1h panel catches bursty local activity
  while operators on staging/prod can narrow the time picker for strict
  liveness. Panel descriptions updated to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- textMode "value_and_name" → "value": the field name "Value #A" was
  Grafana's auto-name for an instant query result, with no semantic
  meaning. The panel title already names the service; the count below
  is what matters.
- "Prerender Server" → "Prerender" and "Prerender Manager" →
  "Prerender Mgr" so both fit in their stat columns instead of
  truncating to "Prerender ...".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…service metrics

Log-line count was a meaningless number; replace each service stat with
metrics an operator actually acts on:

* Realm Server / Prerender: HTTP request rate (httpLogging --> exit
  lines) and error rate (lines matching error|exception|fatal).
* Prerender Mgr: request rate from "proxying ..." entry lines (the
  manager doesn't emit httpLogging), plus error rate.
* Worker: jobs/min resolved, jobs/min rejected, queue depth (unfulfilled
  jobs with no active reservation). Single Postgres query, three columns.
* Synapse: PUT /send/m.room.message rate from synapse.access.http
  lines — the actual chat-send signal, not internal Synapse chatter.

All Loki rates use rate([5m]) * 60 for smoothed per-minute. Stat panels
display vertically with field name + value (textMode "value_and_name");
field-level overrides invert thresholds so error/failure fields turn red
on any nonzero, while activity rates stay neutral when idle. Panel
height bumped 4→5 to fit 2-3 stacked stats; everything below shifts +1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… Synapse)

Replace the 5 single-stat panels in row 1 with 3 timeseries KPIs at
w=8 each. Each shows two metrics + the mean over the visible time
range in the legend:

* Realm Server (line chart): HTTP req/min and HTTP err/min from
  Loki httpLogging exit lines.
* Worker (stacked bars): Jobs/min (resolved, green) and Failures/min
  (rejected, red) from Postgres `jobs` grouped by 1m bucket.
* Synapse (bars): Msgs/min from synapse access-log PUT /send/m.room.message.

Prerender + Prerender Mgr drop out of row 1 — they'll come back in
row 2 as ECS resource-utilization panels (CPU / Memory / Instances).

Active Alerts and the indexing-stat row shift down by 3 to make room
for the taller timeseries panels.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the resource-utilisation row beneath the KPI graphs:

* 5 service stat panels (Realm Server, Prerender, Prerender Mgr, Worker,
  Synapse) at w=4 each, each showing CPU / Memory / Tasks from
  CloudWatch AWS/ECS metrics. ClusterName=${env}, ServiceName matches
  staging/production naming:
    boxel-realm-server-${env}
    boxel-prerender-server-${env}
    boxel-prerender-manager-${env}
    boxel-worker-${env}
    synapse-${env}                 ← no boxel- prefix
  Field overrides rename CloudWatch metric names to short labels (CPU,
  Mem, Tasks) and apply per-field thresholds — CPU/Mem yellow at 70%,
  red at 90%; Tasks red at 0 (service down) green at 1+.

* 1 DB stat panel (w=4) — Req/min computed as
  `(xact_commit + xact_rollback) / pg_postmaster_start_time()` against
  the boxel database via the existing boxel-db Postgres datasource.
  This is "average since postmaster start", not a rolling window.

`env` is a new constant template variable substituted at apply time:
__ENV__ → local | staging | production. apply.sh's existing jq walk
gets a third substitution clause; render-config-time docs updated.

Locally the 5 CloudWatch panels render "No data" (no AWS access);
staging/production wire up automatically once this lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…xn/min"

The metric is transactions per minute (xact_commit + xact_rollback) — calling
it "Req/min" was sloppy. Title bumped to "Postgres DB" so it's clear at a
glance which datastore the panel covers (vs. e.g. Redis or Synapse's
internal Postgres).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The KPI (jobs/min + failures/min) reflects all queue activity, not just
indexing — "Job Queue" reads cleaner. Row 2 still has a "Worker" panel
showing the ECS task's resource utilisation, which is genuinely
worker-specific.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…equests"

The req-rate / err-rate KPI is fundamentally about HTTP traffic into the
realm-server — naming it "Web Requests" reflects what's being measured.
Row 2 keeps a "Realm Server" panel for the ECS-task-level CPU/Mem/Tasks,
which is genuinely realm-server-specific.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New `boxel-status/database.json` (uid `boxeldatabase1`) — Postgres
health for the boxel application database, queried directly via the
existing boxel-db datasource. No `pg_exporter` dependency.

Inspired by the upstream pg_exporter community dashboard (24298) but
trimmed to the panels operators actually act on:

* Row 1 (KPIs): Database size, Active connections, Idle-in-transaction
  count, Txn/min, Cache hit ratio gauge, Long-running queries count
* Row 2: Connections by state (stacked bar), Transactions/min over time
* Row 3: Top 20 tables by total size — incl. dead-tuple ratio + last
  vacuum/analyze
* Row 4: Cumulative writes per table (top 8) + Idle-in-txn backends
  (table)
* Row 5: Long-running queries (table) + Tables overdue for vacuum
  (table)
* Row 6: A markdown panel naming what's NOT covered (slow queries
  need pg_stat_statements; bloat needs pgstattuple; WAL/replication
  need pg_exporter — none currently enabled).

Overview's Postgres DB stat now has a panel link → /d/boxeldatabase1
so operators can drill from the single-number health stat into the
detailed view.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per the Grafana timeseries gradient modes
(grafana.com/docs/grafana/latest/visualizations/panels-visualizations/visualizations/time-series/#gradient-mode):

* gradientMode: none → opacity. Each series fades from full line color
  down through transparent — gives the panel a glow under each line
  rather than a flat-color fill.
* fillOpacity: 10 → 50. The opacity gradient needs a bit of fill height
  to show; 50 makes the gradient legible without drowning the line.
* lineInterpolation: linear → smooth. Web Requests is a rate (smoothed
  over 1m already); a smooth curve reads more naturally than zigzag.

Other timeseries panels (Job Queue stacked bars, Synapse bar, Indexing
throughput, Connections-by-state stacked bar) intentionally untouched —
gradients on stacked bars get visually muddy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lukemelia lukemelia force-pushed the grafana-stack/06-overview-and-service-stubs branch from d8dfcfe to 2e2314a Compare May 7, 2026 22:59
lukemelia and others added 5 commits May 7, 2026 19:16
Boxel uses Matrix as an event bus, not just for chat — most synapse
traffic locally is `app.boxel.realm-event` (realm sync), not
`m.room.message` (chat). Filtering only on `m.room.message` made the
panel show 'No data' even when synapse was busy.

Match all `Received request: PUT .../send/` access-log lines instead.
Legend label "Msgs/min" → "Events/min" to reflect the broader count.
Description spells out what the count includes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Boxel realm-event traffic is bursty — events arrive on the order of
1 per minute during normal dev activity. A 1-minute rate window often
catches zero hits and shows 'No data'; 5 minutes matches the
smoothing already used by the Realm Server (Web Requests) panel and
gives the Synapse panel something to render whenever there's been
activity in the last few minutes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Match the Web Requests panel style: smooth line, fillOpacity 50,
gradientMode "opacity". Synapse traffic is sparse and bursty locally,
so the bar look ended up as scattered single bars; a smooth line with
a fading area underneath communicates the rate trend better.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add visible drilldown link icon (top-left of each panel title) on the
three Row 1 timeseries panels:

* Web Requests → /d/boxel-svc-realm-server
* Job Queue    → /d/boxel-svc-worker
* Synapse      → /d/000000012

For stat panels, fieldConfig.defaults.links renders as a clickable
panel header link. For timeseries panels, the same field becomes a
data link that's only visible when you hover/click an individual point.
Adding a top-level panel.links array gives an always-visible drilldown
icon, restoring the click-through experience the stat versions had
before the row 1 redesign.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Append `or vector(0)` to the Synapse panel's LogQL so timestamps with
no matching log lines come back as 0 instead of null. Combined with
the existing `lineInterpolation: smooth`, the line now flows
continuously through idle periods and only spikes when there's
real activity, instead of breaking into disjoint segments.

`or vector(0)` is the standard PromQL idiom for "no-data → 0":
vector(0) returns a constant 0-valued series at every evaluation
timestamp, and `or` falls through to it whenever the left side has
no result.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lukemelia and others added 6 commits May 7, 2026 19:27
* Web Requests: append `or vector(0)` to both LogQL exprs (HTTP req/min
  + HTTP err/min) so idle 1-minute windows render as 0 instead of
  null.

* Job Queue: same shape change. SQL gets `$__timeGroupAlias(col, '1m', 0)`
  — the third macro arg is Postgres datasource fill-mode, "0" expands
  empty buckets to zero rows. Bar viz → smooth line with opacity
  gradient (drawStyle line, fillOpacity 50, gradientMode opacity,
  lineInterpolation smooth, lineWidth 2). Stacking flipped to none —
  with two lines the visual stack adds no info, while overlay lets
  Failures/min red show through against Jobs/min green.

Result: all three Row 1 KPI panels (Web Requests, Job Queue, Synapse)
share the same gradient-line aesthetic and continuously flow through
idle periods.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without AWS credentials, the 5 ECS resource panels (Realm Server,
Prerender, Prerender Mgr, Worker, Synapse) on the Overview show a
"No data" + query-error triangle locally. Visually indistinguishable
from a real broken panel.

Add a second jq pass in apply.sh, gated on `env_name == "local"`, that
finds any panel whose datasource.type is `cloudwatch` and replaces it
with a `text` markdown panel containing an explanatory message:

  ☁️ AWS CloudWatch — staging/production only
  The boxel-cloudwatch datasource has no AWS credentials in local dev.
  ECS resource utilisation (CPU / Memory / Tasks) renders correctly
  when this dashboard is applied to a hosted Grafana.

The id, title, and gridPos are preserved, so the layout is unchanged.
Match condition uses both `datasource.type == "cloudwatch"` AND a
gridPos object presence so we only catch panel-level objects, not
target-level objects (which also carry a datasource ref but no
gridPos).

Committed JSON is unchanged — staging/production still push real
CloudWatch queries through the existing flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tead of CloudWatch

The vendored Synapse exposes process_* and synapse_* metrics on
/_synapse/metrics. The synapse-prometheus datasource scrapes those
(local Prometheus locally, AMP in staging/production), so we don't
need CloudWatch ECS metrics for this service — we already have
finer-grained data straight from the process.

Swap the 3 panel targets:

  CPU  : rate(process_cpu_seconds_total{job="synapse"}[5m]) * 100
  Mem  : process_resident_memory_bytes{job="synapse"}
  Up   : count(up{job="synapse"} == 1)

Mem unit changes from `percent` (CloudWatch MemoryUtilization) to
`decbytes` (raw RSS) — synapse's prometheus_client doesn't know the
container memory limit so a percent ratio isn't available without
extra plumbing. "Tasks" renamed to "Up" because count(up==1) is a
healthy-scrape-target count, not an ECS RunningTaskCount.

The Synapse panel is now functional locally too — no longer caught
by apply.sh's CloudWatch-→-placeholder transform since it uses a
`prometheus` datasource.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new stat panels at the top of the Overview, each showing the
running total (big number) plus a cumulative-growth sparkline:

* Realms (id 21, x=0..11): SELECT MIN(indexed_at) per realm_url from
  realm_meta, then ROW_NUMBER() over the sorted creation timestamps
  for the cumulative count. Drill-through: /d/boxelrealms001.
* Users  (id 22, x=12..23): same shape, ordered by users.created_at.
  Drill-through: /d/boxelusers0001.

Both panels carry a `timeFrom: "5y"` panel-level override so the
sparkline always shows the full growth history regardless of where
the dashboard's time picker is set — narrowing to "Last 1h" would
otherwise leave the sparkline empty if no creations happened in
the window.

Existing rows shift +8 to make room. Also fixes the Job Queue panel
drill-down: was pointing at /d/boxel-svc-worker (the Worker service
deep-dive); now points at /d/boxeljobqueue1, the Job Queue dashboard
that actually owns generic worker-queue metrics across all job types.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Boxel-ui defines its brand palette in packages/boxel-ui/addon/src/styles/variables.css.
Wire the same hex codes into Overview panel field overrides so the
KPI series read as branded rather than Grafana-default:

* Realms          → #6638ff (boxel-purple — brand primary)
* Users           → #00ffba (boxel-teal — secondary accent)
* HTTP req/min    → #0069f9 (boxel-blue — info)
* HTTP err/min    → #ff5050 (boxel-red — danger)
* Jobs/min        → #37eb77 (boxel-green — success)
* Failures/min    → #ff5050 (boxel-red — danger)
* Events/min      → #6638ff (boxel-purple — synapse maps to brand)
* Indexing arrived/started/completed → blue/yellow/teal

Threshold-based panels (ECS resource cards, indexing pipeline stats)
keep Grafana's named green/yellow/red — those colors are semantically
correct ("this is bad/ok/great") and close enough to the boxel palette
that swapping hex codes adds visual noise without clarifying intent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per the "operator actions stay contextual" principle: the per-realm
reindex action belongs on the Realms dashboard (which is realm-scoped
via the `${realm_url}` template variable), not on the system-wide
Indexing dashboard.

Indexing dashboard (Operator Actions panel, id 4):
  * Drop: realm_picker, pending, in_flight, oldest_pending_human,
    last_reindex_status, btn_reindex_realm form elements
  * Drop: the realm-registry SELECT target that fed the picker
  * Drop: elementValueChanged hook that mirrored picker → URL var
  * Keep: pending_full_reindex indicator + btn_reindex_all button
  * SQL trims to a single COUNT(*) for full-reindex jobs in flight
  * Title: "Operator Actions" → "Operator Actions: Full reindex"
  * h=11 → h=4 (form had 8 elements, now has 2; no longer needs the
    deep visual footprint that overlapped the stat row below)

Realms dashboard (new panel id 100):
  * New "Operator Actions: Reindex this realm" volkovlabs-form-panel
    placed at y=22 between "Indexing status (this realm)" and
    "Recent Indexing Errors"
  * Reads the realm from the existing `${realm_url}` template variable
    (no realm_picker form element needed — the dashboard already has
    a Realm dropdown at the top)
  * SQL adapted from the Indexing version: ${full_index_realm} →
    ${realm_url}, dropped pending_full_reindex column
  * Keeps the same blast-radius (pending / in_flight / oldest_pending)
    + last_reindex_status indicators and disable-while-in-flight guard
  * Layout below shifts +8 to make room

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lukemelia lukemelia changed the title observability: add Overview + per-service stub dashboards observability: Overview redesign + Database dashboard + ECS / Prometheus wiring May 7, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Host Test Results

    1 files      1 suites   1h 44m 32s ⏱️
2 634 tests 2 619 ✅ 15 💤 0 ❌
2 653 runs  2 638 ✅ 15 💤 0 ❌

Results for commit 4e3a2c4.

Realm Server Test Results

    1 files  ±    0      1 suites  +1   17m 57s ⏱️ + 17m 57s
1 285 tests +1 285  1 285 ✅ +1 285  0 💤 ±0  0 ❌ ±0 
1 364 runs  +1 364  1 364 ✅ +1 364  0 💤 ±0  0 ❌ ±0 

Results for commit 4e3a2c4. ± Comparison against earlier commit d5b9bc5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants