Skip to content

device-health-oracle: detect link impairment from monitoring data and update link.health #2652

Description

@nikw9944

Context

We have a steady stream of link health issues on the network, but the device-health-oracle currently has zero link criteria configured — links auto-advance from Pending to ReadyForService without checks, and ReadyForService is treated as a terminal state with no demotion path. Meanwhile, mainnet has activated links with ISIS down and 100% packet loss that are not reflected in link.health.

Goal

Enable the device-health-oracle to:

  1. Detect impaired links from monitoring data and write LinkHealth = Impaired onchain
  2. Detect recovered links and restore LinkHealth = ReadyForService
  3. Support bidirectional transitions: ReadyForService ↔ Impaired

Approach

Data source: link_rollup_5m

Query the existing link_rollup_5m ClickHouse table, which already aggregates per-link health into 5-minute buckets:

  • link_pk — link public key (no joins needed)
  • isis_down (bool) — whether ISIS adjacency is down
  • a_loss_pct / z_loss_pct (float) — packet loss percentage per direction

This table is in the lake ClickHouse instance the DHO already connects to.

Impairment criteria

A link is impaired if the most recent link_rollup_5m bucket shows:

  • isis_down = true, OR
  • a_loss_pct > threshold OR z_loss_pct > threshold (default threshold: 5%, configurable via flag)

Recovery criteria

A link has recovered if ALL link_rollup_5m buckets within the recovery window are clean (ISIS up AND loss ≤ threshold). The recovery window is derived from ledger slots (using the existing DrainedSlotCount parameter, resolved to wall-clock time via GetBlockTime), consistent with how device burn-in windows work. This asymmetry — fast impairment detection, slow recovery — prevents flapping.

Evaluator changes

Extend LinkHealthEvaluator.Evaluate() to support three paths:

  • ReadyForService: check impairment criteria → demote to Impaired if any fail
  • Impaired: check impairment criteria over recovery window → promote to ReadyForService if all pass
  • Pending/Unknown: check promotion criteria → advance to ReadyForService (existing behavior)

Add a LinkBurnIn helper (parallel to DeviceBurnIn) for slot-based window resolution.

Onchain effect

Writing LinkHealth = Impaired updates the health field but does not automatically change link.status — the serviceability program's check_status_transition() is gated behind a "waiting for health oracle" comment. This is intentional: the health field serves as a signal to operators and dashboards. Automatic status transitions can be enabled later.

Implementation

See implementation plan for detailed tasks and code.

Files to change

Action File What
Modify internal/worker/criteria.go Bidirectional LinkHealthEvaluator, LinkBurnIn helper
Modify internal/worker/criteria_test.go Tests for impairment/recovery transitions
Create internal/worker/link_health.go LinkHealthCriterion — queries link_rollup_5m
Create internal/worker/link_health_test.go Unit tests
Modify internal/worker/clickhouse.go LinkHealthChecker interface, LinkHealthRecent query
Modify cmd/device-health-oracle/main.go Wire up criterion, add --link-loss-threshold flag

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions