Skip to content

feat(alerts): per-org config with hierarchical merge#99

Draft
edospadoni wants to merge 4 commits into
mainfrom
feat/alerts-config-refactor
Draft

feat(alerts): per-org config with hierarchical merge#99
edospadoni wants to merge 4 commits into
mainfrom
feat/alerts-config-refactor

Conversation

@edospadoni
Copy link
Copy Markdown
Member

@edospadoni edospadoni commented May 12, 2026

Test instance

Summary

End-to-end refactor of the per-organization alerting configuration. The previous shape (global lists + per-severity overrides + per-system overrides + per-tenant email_template_lang) was hard to consume from the UI and hard to extend; this PR replaces it with a flat, recipient-centric model where each recipient carries its own scope (severities) and rendering hints (language, format for email).

The merge across the org hierarchy stays server-side only/alerts/config exposes the caller's own layer and nothing else. No inherited view, no merged-effective preview ever leaves the backend, so secrets and routing intent of an upstream org never leak to descendants.

The branch also includes the alerts list / history / silences / activity timeline / aggregations rebuild (commit 1 of 2), kept in the same PR because both pieces were untested-on-QA, share the openapi surface, and the new list code consumes types defined by the config commit.

Frontend hand-off

The frontend AlertingView and its types will be rewritten by the frontend developer directly on this branch — the existing frontend code does not match this API and is intentionally left out of this PR.

The new shape mirrors what POST /alerts/config accepts and what GET /alerts/config returns (data is the layer itself plus updated_by_name/updated_at):

{
  "enabled": { "email": true, "webhook": null, "telegram": null },
  "email_recipients": [
    { "address": "noc@org.example", "severities": ["critical","warning"], "language": "it", "format": "html" }
  ],
  "webhook_recipients": [
    { "name": "ops-slack", "url": "https://hooks.slack.com/...", "severities": ["critical"] }
  ],
  "telegram_recipients": [
    { "bot_token": "123:ABC", "chat_id": -1001234567890, "severities": [] }
  ]
}

severities=[] = applies to all severities. enabled.X = null = no opinion at this layer (inherit). Owner only can set enabled.X = false; non-Owner explicit false is normalised to null on save.

Refer to backend/openapi.yaml for the full schema, the 6 request examples on POST /alerts/config, and response examples on the alert/silence endpoints (added in the openapi sections of both commits).

What the backend does internally

  • Merge engine (services/alerting/merge.go): walks the chain Owner → … → tenant; unions recipients per channel with dedup keys (email→address, webhook→url, telegram→(bot_token,chat_id)); on a dedup hit, severities are unioned and "[] widens to all".
  • Template renderer (services/alerting/template.go): fans out one Alertmanager receiver per severity (critical/warning/info); each email recipient emits its own email_configs entry referencing per-language dispatcher templates (alert_<lang>.html|txt|subject). format=plain emits html: '' explicitly so Alertmanager's default HTML body does not override ours.
  • Templates (services/alerting/templates/): both en and it are always shipped to every tenant; the dispatcher routes firing/resolved to the right language fragment.
  • Provision (services/alerting/provision.go): on new org creation, pushes the effective merged config to Mimir so any ancestor layers take effect immediately.
  • Redaction (services/alerting/redaction.go): minimal helper used only for audit-log snapshots (webhook URL paths and Telegram tokens are scrubbed before the layer goes to LogBusinessOperationDetails). API responses never use this.
  • RBAC: /alerts/config* is gated on the dedicated alerts resource (read:alerts for GET, manage:alerts for POST/DELETE) — admin/super only. The list endpoints stay gated on the existing read:systems / manage:systems.

Migration

Migration 024_add_alert_config_layers.sql creates the table for this layer. No data carry-over from any previous shape — the table starts empty and operators reconfigure /alerts/config after deploy. Migration 023_add_alert_activity.sql creates the alert_activity table backing the per-alert audit timeline.

@edospadoni edospadoni deployed to feat/alerts-config-refactor - my-backend-qa PR #99 May 12, 2026 09:28 — with Render Active
@github-actions
Copy link
Copy Markdown
Contributor

🔗 Redirect URIs Added to Logto

The following redirect URIs have been automatically added to the Logto application configuration:

Redirect URIs:

  • https://my-proxy-qa-pr-99.onrender.com/login-redirect

Post-logout redirect URIs:

  • https://my-proxy-qa-pr-99.onrender.com/login

These will be automatically removed when the PR is closed or merged.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 12, 2026

🚨 Breaking My API change detected

Preview documentation

Structural change details

Added (1)

  • GET /alerts/{fingerprint}/activity

Modified (10)

  • DELETE /alerts/config
    • Response modified: 200
      • Content type modified: application/json
        • [Breaking] Property modified: data
          • Type went from object | null to object [Breaking]
          • Properties added: affected_tenants, propagated_to, warnings
    • [Breaking] Query parameter removed: organization_id
      • Removing a resource is always breaking unless it was deprecated before [Breaking]
  • GET /alerts
    • [Breaking] Query parameter modified: organization_id
      • Type went from string to array[string] [Breaking]
    • Response modified: 200
      • Content type modified: application/json
        • Property modified: data
          • Property modified: alerts
    • Query parameters added: sort_by, sort_direction
  • GET /alerts/config
    • Response modified: 200
      • Content type modified: application/json
        • Property modified: data
          • [Breaking] Property removed: config
            • Removing a resource is always breaking unless it was deprecated before [Breaking]
          • Properties added: enabled, email_recipients, webhook_recipients, telegram_recipients, updated_by_name, updated_at
    • [Breaking] Query parameters removed: organization_id, format
      • Removing a resource is always breaking unless it was deprecated before [Breaking]
    • [Breaking] Responses removed: 400, 500
      • Removing a resource is always breaking unless it was deprecated before [Breaking]
  • GET /alerts/history
    • [Breaking] Query parameter modified: organization_id
      • Type went from string to array[string] [Breaking]
    • Response modified: 200
      • Content type modified: application/json
        • Property modified: data
          • Property modified: alerts
  • GET /alerts/stats
    • [Breaking] Query parameter modified: organization_id
      • Type went from string to array[string] [Breaking]
  • GET /alerts/totals
    • [Breaking] Query parameter modified: organization_id
      • Type went from string to array[string] [Breaking]
    • Response modified: 200
      • Content type modified: application/json
        • Property modified: data
          • Property added: muted
  • GET /alerts/trend
    • [Breaking] Query parameter modified: organization_id
      • Type went from string to array[string] [Breaking]
  • GET /systems/{id}/alerts
    • Response modified: 200
      • Content type modified: application/json
        • Property modified: data
          • Property modified: alerts
  • GET /systems/{id}/alerts/history
    • Response modified: 200
      • Content type modified: application/json
        • Property modified: data
          • Property modified: alerts
  • POST /alerts/config
    • Content type modified: application/json
      • [Breaking] Properties removed: mail_enabled, webhook_enabled, telegram_enabled, mail_addresses, webhook_receivers, telegram_receivers, severities, systems, email_template_lang
        • Removing a resource is always breaking unless it was deprecated before [Breaking]
      • Properties added: enabled, email_recipients, webhook_recipients, telegram_recipients
    • Response modified: 200
      • Content type modified: application/json
        • [Breaking] Property modified: data
          • Type went from object | null to object [Breaking]
          • Properties added: affected_tenants, propagated_to, warnings
    • [Breaking] Query parameter removed: organization_id
      • Removing a resource is always breaking unless it was deprecated before [Breaking]
    • Response added: 413
Powered by Bump.sh

@edospadoni edospadoni force-pushed the feat/alerts-config-refactor branch from 83406ab to b102ce8 Compare May 12, 2026 09:48
@edospadoni edospadoni temporarily deployed to feat/alerts-config-refactor - my-backend-qa PR #99 May 12, 2026 09:48 — with Render Destroyed
@edospadoni
Copy link
Copy Markdown
Member Author

update deploy

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Build triggers updated!

All .render-build-trigger files have been automatically updated to ensure fresh deployments of all services in the PR preview environment.

@edospadoni edospadoni deployed to feat/alerts-config-refactor - my-backend-qa PR #99 May 12, 2026 09:49 — with Render Active
@edospadoni edospadoni deployed to feat/alerts-config-refactor - my-collect-qa PR #99 May 12, 2026 09:49 — with Render Active
@edospadoni edospadoni deployed to feat/alerts-config-refactor - my-mimir-qa PR #99 May 12, 2026 09:49 — with Render Active
@edospadoni edospadoni deployed to feat/alerts-config-refactor - my-frontend-qa PR #99 May 12, 2026 09:49 — with Render Active
Builds the operational alerts surface on top of Mimir Alertmanager: a
single paginated list endpoint plus per-system silence management,
resolved-alert history, and aggregations the UI uses to render the
overview page.

Endpoints:
  - GET /alerts (cross-hierarchy / single-tenant / sub-tree scoping,
    multi-value label filters, sorting on starts_at/severity/alertname,
    pagination with stable fingerprint tiebreaker)
  - GET /alerts/history (paginated alert_history rows with date range)
  - GET /alerts/totals / /trend / /stats (severity buckets, time-series
    deltas, top-N alertname/system_key, MTTR/MTBF)
  - GET /alerts/{fingerprint}/activity (silence/unsilence audit timeline,
    populated transparently by the silence endpoints)
  - GET /systems/{id}/alerts and friends scoped to a single system

Each alert in the list is enriched with a local-DB system object
(id/name/type) so the frontend doesn't need a per-row round-trip.
Per-tenant fan-out failures are surfaced as warnings rather than
failing the whole request.

Gated on the existing read:systems / manage:systems permissions:
read for the list endpoints, manage for silence create/update/delete.
Adds POST/GET/DELETE /alerts/config — every organization saves its own
layer; the effective Mimir YAML for any tenant is the server-side merge
of all layers walking up the hierarchy from the tenant to the Owner.
The merge stays internal: /alerts/config exposes only the caller's own
row, never an inherited or merged view (no leakage of upstream
recipients or secrets to descendants).

Model is flat and recipient-centric:
  enabled:             {email, webhook, telegram}   tri-state per layer
  email_recipients:    [{address, severities[], language, format}]
  webhook_recipients:  [{name, url, severities[]}]
  telegram_recipients: [{bot_token, chat_id, severities[]}]

Per-recipient severities=[] means "all severities". Email recipients
additionally carry language (en|it) and format (html|plain) which the
template renderer turns into per-email_configs overrides:
  - format=html emits our html template + our text fallback (multipart)
  - format=plain emits our text template plus html: '' (the empty html
    is mandatory — Alertmanager otherwise falls back to its built-in
    HTML body and overrides ours with the generic "Sent by Alertmanager")

Rendering fans out a receiver per severity (critical/warning/info);
recipients with severities=[] land on every per-severity receiver. The
builtin alert-history webhook is always attached at the top of the
routes (continue: true) so history persists regardless of config.

Additive-only contract: descendants can ADD recipients but cannot
disable channels enabled by ancestors. The server normalises any
explicit false in enabled.{email,webhook,telegram} from non-Owner
layers to null on storage. Save+propagate is serialised per-org via an
in-process mutex; per-tenant push failures land in warnings[] without
failing the save. Body capped at 1 MiB; oversized requests get 413.

Gated on the dedicated alerts resource (read:alerts for GET,
manage:alerts for POST/DELETE) — admin/super only by default.

Includes:
  - models/alerting.go: flat shape + Validate
  - services/alerting/{merge,template,embed,effective,provision,redaction}.go
  - migration 024_add_alert_config_layers
  - entities/local_alert_config_layers.go (repo)
  - middleware/body_limit.go
  - logger/helpers.go: LogBusinessOperationDetails for audit snapshots
  - methods/alerting.go: ConfigureAlerts/GetAlertingConfig/DisableAlerts
  - methods/{customers,distributors,resellers}.go: provision sig change
  - openapi.yaml: schemas + endpoints + 6 request examples + response examples
  - templates: per-language dispatchers (alert_<lang>.html|txt|subject)
    plus telegram_<lang>.message
The user-facing alerting docs and the AGENTS reference were stuck on
the previous shape (global mail_addresses/webhook_receivers + per-
severity + per-system overrides + per-tenant email_template_lang).
Rewrite the 'Alerting Configuration' section in both en and it locales
to describe the new layer model:

  - flat shape: enabled tri-state + email_recipients/webhook_recipients/
    telegram_recipients with per-recipient severities[]
  - email recipients additionally carry language (en|it) and format
    (html|plain)
  - merge across the org hierarchy stays server-side; /alerts/config
    returns only the caller's own layer (no inherited / merged view)
  - additive-only contract; non-Owner explicit false on enabled.X is
    normalised to null at save time
  - RBAC: the Alerting Configuration tab is gated on read:alerts /
    manage:alerts (admin/super only); the alerts list stays on
    read:systems / manage:systems

Refresh the Telegram step-3 example to use the new shape and update
the email-notifications section to reflect per-recipient language and
format. Realign AGENTS.md §3.5 with the same wording.
@edospadoni edospadoni force-pushed the feat/alerts-config-refactor branch from a5d2789 to 5119691 Compare May 12, 2026 11:06
@edospadoni edospadoni deployed to feat/alerts-config-refactor - my-backend-qa PR #99 May 12, 2026 11:06 — with Render Active
…GET /alerts

Stamp system_type at ingest (collect) alongside the other system_* labels
and drop the per-request DB lookup that enriched each alert with a separate
system object. Saves a SELECT on every GET /alerts and removes a redundant
field the frontend never read.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant