Product observability: connector health, SIEM delivery, rule-run audit, /metrics

## Problem

Aperio operators have no visibility into Aperio itself. Critical operational questions today require ad-hoc SQL or log greps:

- "When did each connector last successfully sync, and is the data fresh?"
- "Are SIEM deliveries succeeding, retrying, or dead-lettered?"
- "Which detection rules ran in the last hour, and how long did each take?"
- "Did the hourly scheduled job for tenant X actually run?"
- "Is the ingestion queue backing up?"

The data is mostly already in Postgres (`IntegrationConnection.lastSyncAt`, `SiemDelivery.status/attempts`, `IngestionJob.status/attempts`, `SiemDestination.deliveriesOk/Fail`) — it just isn't surfaced.

Operators need to **trust** the data before they act on findings; today that trust is implicit and unverifiable.

## Goals

1. **Connector health & freshness dashboard** — per-connector last-sync, latency, error rate, scope freshness.
2. **SIEM delivery health** — per-destination success rate, retry depth, dead-letter queue depth.
3. **Ingestion pipeline health** — queue depth, processing rate, dead-letter rate, per-rule execution time.
4. **Customer-facing rule-run audit log** — every check Aperio runs against a tenant is queryable ("rule X ran at T, evaluated N events, opened M findings, took P milliseconds").
5. **Prometheus + OpenTelemetry exposition** for ops teams running their own monitoring.
6. **Status page** at `/status` (and JSON) for embedding into upstream dashboards.

## Non-goals

- Not building a generic APM (Datadog, Honeycomb territory) — Aperio exposes its own health, customers point their existing tools at it.
- Not building a full SLO management system (Nobl9 territory) — just expose primitives.

## Proposed design

### New schema

```prisma
enum RuleRunStatus {
  STARTED
  SUCCEEDED
  FAILED
  SKIPPED
}

model RuleRun {
  id                String   @id @default(cuid())
  organizationId    String   @map("organization_id")
  ruleKey           String   @map("rule_key") @db.VarChar(160)
  ruleVersion       String?  @map("rule_version") @db.VarChar(32)
  integrationId     String?  @map("integration_id")
  status            RuleRunStatus
  eventsEvaluated   Int      @default(0) @map("events_evaluated")
  findingsCreated   Int      @default(0) @map("findings_created")
  findingsReopened  Int      @default(0) @map("findings_reopened")
  findingsResolved  Int      @default(0) @map("findings_resolved")
  durationMs        Int      @map("duration_ms")
  errorMessage      String?  @map("error_message") @db.VarChar(500)
  startedAt         DateTime @map("started_at")
  finishedAt        DateTime @map("finished_at")
  organization      Organization @relation(...)
  @@index([organizationId, startedAt])
  @@index([organizationId, ruleKey, startedAt])
  @@index([organizationId, status, startedAt])
  @@map("rule_runs")
}

model ConnectorSyncRun {
  id                  String   @id @default(cuid())
  organizationId      String   @map("organization_id")
  integrationId       String   @map("integration_id")
  triggeredBy         String   @map("triggered_by") @db.VarChar(40)  // "scheduled" | "manual" | "webhook"
  status              String   @db.VarChar(20)                       // "succeeded" | "failed" | "partial"
  eventsIngested      Int      @default(0) @map("events_ingested")
  newAssetsObserved   Int      @default(0) @map("new_assets_observed")
  apiCallsMade        Int      @default(0) @map("api_calls_made")
  durationMs          Int      @map("duration_ms")
  errorMessage        String?  @map("error_message") @db.VarChar(500)
  startedAt           DateTime @map("started_at")
  finishedAt          DateTime @map("finished_at")
  organization        Organization @relation(...)
  integration         IntegrationConnection @relation(...)
  @@index([organizationId, integrationId, startedAt])
  @@index([organizationId, status, startedAt])
  @@map("connector_sync_runs")
}
```

### Prometheus / OpenTelemetry exposition

`/metrics` endpoint (Prometheus exposition format) exposes:

```
aperio_connector_last_sync_seconds{org, provider, integration_id}
aperio_connector_sync_duration_seconds{org, provider, integration_id, status}
aperio_ingestion_queue_depth{org, status}
aperio_ingestion_processed_total{org, status}
aperio_siem_delivery_attempts_total{org, destination_kind, status}
aperio_siem_delivery_queue_depth{org, status}
aperio_rule_execution_duration_seconds{org, rule_key, status}    # histogram
aperio_rule_findings_created_total{org, rule_key}
aperio_workflow_delivery_attempts_total{org, destination_kind, status}  # post-#6
aperio_api_token_invocations_total{org, scope, status}                  # post-#8
```

OTel SDK initialized in the Go server + workers; OTLP exporter configurable via standard `OTEL_EXPORTER_OTLP_ENDPOINT`. Spans on every connector sync + every rule evaluation.

### Health & freshness dashboard

`/admin/health` page (admin role required):

- **Connectors panel**: row per `IntegrationConnection` showing provider icon, last sync, freshness traffic light (green < 1h, yellow < 6h, red > 6h or configurable per-provider), last 5 syncs sparkline, last error.
- **SIEM destinations panel**: row per `SiemDestination` showing delivery success rate, queue depth, last delivery, last error.
- **Workflow destinations panel** (post-#6): same shape for ticketing/chatops.
- **Ingestion panel**: queue depth gauge, processing-rate sparkline, dead-letter count with drill-down.
- **Detection rules panel**: top 5 slowest rules (p95), top 5 noisiest rules (most findings/hour), rules that haven't run in 24h (broken?).

### Rule-run audit log (customer-facing)

`/admin/audit/rules` (visible to ADMIN + SECURITY_ANALYST):

- Filter by rule key, integration, status, time window.
- Per-run drill-down: trigger event, evaluation context (truncated), result.
- Export to CSV/JSON for compliance evidence (feeds into #5's evidence pack).

### Status page

`/status` (public, with optional auth):

- Per-component status: API, ingestion worker, SIEM dispatcher, workflow dispatcher (#6), web console.
- Recent incidents (manual posting + auto-detected from `RuleRun` / `ConnectorSyncRun` failure rates).
- JSON variant at `/status.json` for embedding.

### Alerting hooks

Self-monitoring rules that fire as Aperio findings (severity tagged `aperio.platform.*`) when:

- Connector hasn't synced in 6h for a CONNECTED integration.
- SIEM destination DLQ has > 10 messages.
- Ingestion queue depth > N or oldest queued job > 1h.
- Detection rule duration p95 > 5s.

These show up in the existing finding flow and route through the existing SIEM/workflow dispatchers — eat our own dog food.

## Phasing

| Phase | Scope |
|---|---|
| **P1** | `RuleRun` + `ConnectorSyncRun` schemas; instrument workers to write them; `/admin/health` page (connectors + ingestion + SIEM panels) |
| **P2** | `/metrics` Prometheus exposition; OTel SDK + OTLP exporter; rule-run audit log UI |
| **P3** | Self-monitoring platform-finding rules; status page; CSV export of rule runs |
| **P4** | Per-tenant SLI dashboard customization; Grafana dashboard JSONs shipped in repo |

## Open questions

- Retention for `RuleRun` and `ConnectorSyncRun` — these grow fast; downsample after N days or rely on org `dataRetentionDays`?
- Should `/metrics` be unauthenticated (Prometheus convention) or gated by API token (#8)?
- Per-rule sample-rate for span emission to keep OTel volume sane (e.g. 1% of high-frequency rules).
- Status page — do we ship a hosted version (statuspage.io-style) or strictly self-hosted?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Product observability: connector health, SIEM delivery, rule-run audit, /metrics #54

Problem

Goals

Non-goals

Proposed design

New schema

Prometheus / OpenTelemetry exposition

Health & freshness dashboard

Rule-run audit log (customer-facing)

Status page

Alerting hooks

Phasing

Open questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Phase	Scope
P1	`RuleRun` + `ConnectorSyncRun` schemas; instrument workers to write them; `/admin/health` page (connectors + ingestion + SIEM panels)
P2	`/metrics` Prometheus exposition; OTel SDK + OTLP exporter; rule-run audit log UI
P3	Self-monitoring platform-finding rules; status page; CSV export of rule runs
P4	Per-tenant SLI dashboard customization; Grafana dashboard JSONs shipped in repo

Product observability: connector health, SIEM delivery, rule-run audit, /metrics #54

Description

Problem

Goals

Non-goals

Proposed design

New schema

Prometheus / OpenTelemetry exposition

Health & freshness dashboard

Rule-run audit log (customer-facing)

Status page

Alerting hooks

Phasing

Open questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions