You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Aperio operators have no visibility into Aperio itself. Critical operational questions today require ad-hoc SQL or log greps:
"When did each connector last successfully sync, and is the data fresh?"
"Are SIEM deliveries succeeding, retrying, or dead-lettered?"
"Which detection rules ran in the last hour, and how long did each take?"
"Did the hourly scheduled job for tenant X actually run?"
"Is the ingestion queue backing up?"
The data is mostly already in Postgres (IntegrationConnection.lastSyncAt, SiemDelivery.status/attempts, IngestionJob.status/attempts, SiemDestination.deliveriesOk/Fail) — it just isn't surfaced.
Operators need to trust the data before they act on findings; today that trust is implicit and unverifiable.
Customer-facing rule-run audit log — every check Aperio runs against a tenant is queryable ("rule X ran at T, evaluated N events, opened M findings, took P milliseconds").
Prometheus + OpenTelemetry exposition for ops teams running their own monitoring.
Status page at /status (and JSON) for embedding into upstream dashboards.
Non-goals
Not building a generic APM (Datadog, Honeycomb territory) — Aperio exposes its own health, customers point their existing tools at it.
Not building a full SLO management system (Nobl9 territory) — just expose primitives.
OTel SDK initialized in the Go server + workers; OTLP exporter configurable via standard OTEL_EXPORTER_OTLP_ENDPOINT. Spans on every connector sync + every rule evaluation.
Health & freshness dashboard
/admin/health page (admin role required):
Connectors panel: row per IntegrationConnection showing provider icon, last sync, freshness traffic light (green < 1h, yellow < 6h, red > 6h or configurable per-provider), last 5 syncs sparkline, last error.
SIEM destinations panel: row per SiemDestination showing delivery success rate, queue depth, last delivery, last error.
Problem
Aperio operators have no visibility into Aperio itself. Critical operational questions today require ad-hoc SQL or log greps:
The data is mostly already in Postgres (
IntegrationConnection.lastSyncAt,SiemDelivery.status/attempts,IngestionJob.status/attempts,SiemDestination.deliveriesOk/Fail) — it just isn't surfaced.Operators need to trust the data before they act on findings; today that trust is implicit and unverifiable.
Goals
/status(and JSON) for embedding into upstream dashboards.Non-goals
Proposed design
New schema
Prometheus / OpenTelemetry exposition
/metricsendpoint (Prometheus exposition format) exposes:OTel SDK initialized in the Go server + workers; OTLP exporter configurable via standard
OTEL_EXPORTER_OTLP_ENDPOINT. Spans on every connector sync + every rule evaluation.Health & freshness dashboard
/admin/healthpage (admin role required):IntegrationConnectionshowing provider icon, last sync, freshness traffic light (green < 1h, yellow < 6h, red > 6h or configurable per-provider), last 5 syncs sparkline, last error.SiemDestinationshowing delivery success rate, queue depth, last delivery, last error.Rule-run audit log (customer-facing)
/admin/audit/rules(visible to ADMIN + SECURITY_ANALYST):Status page
/status(public, with optional auth):RuleRun/ConnectorSyncRunfailure rates)./status.jsonfor embedding.Alerting hooks
Self-monitoring rules that fire as Aperio findings (severity tagged
aperio.platform.*) when:These show up in the existing finding flow and route through the existing SIEM/workflow dispatchers — eat our own dog food.
Phasing
RuleRun+ConnectorSyncRunschemas; instrument workers to write them;/admin/healthpage (connectors + ingestion + SIEM panels)/metricsPrometheus exposition; OTel SDK + OTLP exporter; rule-run audit log UIOpen questions
RuleRunandConnectorSyncRun— these grow fast; downsample after N days or rely on orgdataRetentionDays?/metricsbe unauthenticated (Prometheus convention) or gated by API token (Add worker leases for durable queues #8)?