Skip to content

Product observability: connector health, SIEM delivery, rule-run audit, /metrics #54

@dcoln25-writer

Description

@dcoln25-writer

Problem

Aperio operators have no visibility into Aperio itself. Critical operational questions today require ad-hoc SQL or log greps:

  • "When did each connector last successfully sync, and is the data fresh?"
  • "Are SIEM deliveries succeeding, retrying, or dead-lettered?"
  • "Which detection rules ran in the last hour, and how long did each take?"
  • "Did the hourly scheduled job for tenant X actually run?"
  • "Is the ingestion queue backing up?"

The data is mostly already in Postgres (IntegrationConnection.lastSyncAt, SiemDelivery.status/attempts, IngestionJob.status/attempts, SiemDestination.deliveriesOk/Fail) — it just isn't surfaced.

Operators need to trust the data before they act on findings; today that trust is implicit and unverifiable.

Goals

  1. Connector health & freshness dashboard — per-connector last-sync, latency, error rate, scope freshness.
  2. SIEM delivery health — per-destination success rate, retry depth, dead-letter queue depth.
  3. Ingestion pipeline health — queue depth, processing rate, dead-letter rate, per-rule execution time.
  4. Customer-facing rule-run audit log — every check Aperio runs against a tenant is queryable ("rule X ran at T, evaluated N events, opened M findings, took P milliseconds").
  5. Prometheus + OpenTelemetry exposition for ops teams running their own monitoring.
  6. Status page at /status (and JSON) for embedding into upstream dashboards.

Non-goals

  • Not building a generic APM (Datadog, Honeycomb territory) — Aperio exposes its own health, customers point their existing tools at it.
  • Not building a full SLO management system (Nobl9 territory) — just expose primitives.

Proposed design

New schema

enum RuleRunStatus {
  STARTED
  SUCCEEDED
  FAILED
  SKIPPED
}

model RuleRun {
  id                String   @id @default(cuid())
  organizationId    String   @map("organization_id")
  ruleKey           String   @map("rule_key") @db.VarChar(160)
  ruleVersion       String?  @map("rule_version") @db.VarChar(32)
  integrationId     String?  @map("integration_id")
  status            RuleRunStatus
  eventsEvaluated   Int      @default(0) @map("events_evaluated")
  findingsCreated   Int      @default(0) @map("findings_created")
  findingsReopened  Int      @default(0) @map("findings_reopened")
  findingsResolved  Int      @default(0) @map("findings_resolved")
  durationMs        Int      @map("duration_ms")
  errorMessage      String?  @map("error_message") @db.VarChar(500)
  startedAt         DateTime @map("started_at")
  finishedAt        DateTime @map("finished_at")
  organization      Organization @relation(...)
  @@index([organizationId, startedAt])
  @@index([organizationId, ruleKey, startedAt])
  @@index([organizationId, status, startedAt])
  @@map("rule_runs")
}

model ConnectorSyncRun {
  id                  String   @id @default(cuid())
  organizationId      String   @map("organization_id")
  integrationId       String   @map("integration_id")
  triggeredBy         String   @map("triggered_by") @db.VarChar(40)  // "scheduled" | "manual" | "webhook"
  status              String   @db.VarChar(20)                       // "succeeded" | "failed" | "partial"
  eventsIngested      Int      @default(0) @map("events_ingested")
  newAssetsObserved   Int      @default(0) @map("new_assets_observed")
  apiCallsMade        Int      @default(0) @map("api_calls_made")
  durationMs          Int      @map("duration_ms")
  errorMessage        String?  @map("error_message") @db.VarChar(500)
  startedAt           DateTime @map("started_at")
  finishedAt          DateTime @map("finished_at")
  organization        Organization @relation(...)
  integration         IntegrationConnection @relation(...)
  @@index([organizationId, integrationId, startedAt])
  @@index([organizationId, status, startedAt])
  @@map("connector_sync_runs")
}

Prometheus / OpenTelemetry exposition

/metrics endpoint (Prometheus exposition format) exposes:

aperio_connector_last_sync_seconds{org, provider, integration_id}
aperio_connector_sync_duration_seconds{org, provider, integration_id, status}
aperio_ingestion_queue_depth{org, status}
aperio_ingestion_processed_total{org, status}
aperio_siem_delivery_attempts_total{org, destination_kind, status}
aperio_siem_delivery_queue_depth{org, status}
aperio_rule_execution_duration_seconds{org, rule_key, status}    # histogram
aperio_rule_findings_created_total{org, rule_key}
aperio_workflow_delivery_attempts_total{org, destination_kind, status}  # post-#6
aperio_api_token_invocations_total{org, scope, status}                  # post-#8

OTel SDK initialized in the Go server + workers; OTLP exporter configurable via standard OTEL_EXPORTER_OTLP_ENDPOINT. Spans on every connector sync + every rule evaluation.

Health & freshness dashboard

/admin/health page (admin role required):

  • Connectors panel: row per IntegrationConnection showing provider icon, last sync, freshness traffic light (green < 1h, yellow < 6h, red > 6h or configurable per-provider), last 5 syncs sparkline, last error.
  • SIEM destinations panel: row per SiemDestination showing delivery success rate, queue depth, last delivery, last error.
  • Workflow destinations panel (post-Persist ingestion jobs in the database #6): same shape for ticketing/chatops.
  • Ingestion panel: queue depth gauge, processing-rate sparkline, dead-letter count with drill-down.
  • Detection rules panel: top 5 slowest rules (p95), top 5 noisiest rules (most findings/hour), rules that haven't run in 24h (broken?).

Rule-run audit log (customer-facing)

/admin/audit/rules (visible to ADMIN + SECURITY_ANALYST):

  • Filter by rule key, integration, status, time window.
  • Per-run drill-down: trigger event, evaluation context (truncated), result.
  • Export to CSV/JSON for compliance evidence (feeds into Handle disallowed CORS origins explicitly #5's evidence pack).

Status page

/status (public, with optional auth):

  • Per-component status: API, ingestion worker, SIEM dispatcher, workflow dispatcher (Persist ingestion jobs in the database #6), web console.
  • Recent incidents (manual posting + auto-detected from RuleRun / ConnectorSyncRun failure rates).
  • JSON variant at /status.json for embedding.

Alerting hooks

Self-monitoring rules that fire as Aperio findings (severity tagged aperio.platform.*) when:

  • Connector hasn't synced in 6h for a CONNECTED integration.
  • SIEM destination DLQ has > 10 messages.
  • Ingestion queue depth > N or oldest queued job > 1h.
  • Detection rule duration p95 > 5s.

These show up in the existing finding flow and route through the existing SIEM/workflow dispatchers — eat our own dog food.

Phasing

Phase Scope
P1 RuleRun + ConnectorSyncRun schemas; instrument workers to write them; /admin/health page (connectors + ingestion + SIEM panels)
P2 /metrics Prometheus exposition; OTel SDK + OTLP exporter; rule-run audit log UI
P3 Self-monitoring platform-finding rules; status page; CSV export of rule runs
P4 Per-tenant SLI dashboard customization; Grafana dashboard JSONs shipped in repo

Open questions

  • Retention for RuleRun and ConnectorSyncRun — these grow fast; downsample after N days or rely on org dataRetentionDays?
  • Should /metrics be unauthenticated (Prometheus convention) or gated by API token (Add worker leases for durable queues #8)?
  • Per-rule sample-rate for span emission to keep OTel volume sane (e.g. 1% of high-frequency rules).
  • Status page — do we ship a hosted version (statuspage.io-style) or strictly self-hosted?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestobservabilityProduct observability, metrics, status pagetier-3-operator-dxTier 3: operator + developer experience

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions