Context
We have a steady stream of link health issues on the network, but the device-health-oracle currently has zero link criteria configured — links auto-advance from Pending to ReadyForService without checks, and ReadyForService is treated as a terminal state with no demotion path. Meanwhile, mainnet has activated links with ISIS down and 100% packet loss that are not reflected in link.health.
Goal
Enable the device-health-oracle to:
- Detect impaired links from monitoring data and write
LinkHealth = Impaired onchain
- Detect recovered links and restore
LinkHealth = ReadyForService
- Support bidirectional transitions:
ReadyForService ↔ Impaired
Approach
Data source: link_rollup_5m
Query the existing link_rollup_5m ClickHouse table, which already aggregates per-link health into 5-minute buckets:
link_pk — link public key (no joins needed)
isis_down (bool) — whether ISIS adjacency is down
a_loss_pct / z_loss_pct (float) — packet loss percentage per direction
This table is in the lake ClickHouse instance the DHO already connects to.
Impairment criteria
A link is impaired if the most recent link_rollup_5m bucket shows:
isis_down = true, OR
a_loss_pct > threshold OR z_loss_pct > threshold (default threshold: 5%, configurable via flag)
Recovery criteria
A link has recovered if ALL link_rollup_5m buckets within the recovery window are clean (ISIS up AND loss ≤ threshold). The recovery window is derived from ledger slots (using the existing DrainedSlotCount parameter, resolved to wall-clock time via GetBlockTime), consistent with how device burn-in windows work. This asymmetry — fast impairment detection, slow recovery — prevents flapping.
Evaluator changes
Extend LinkHealthEvaluator.Evaluate() to support three paths:
- ReadyForService: check impairment criteria → demote to
Impaired if any fail
- Impaired: check impairment criteria over recovery window → promote to
ReadyForService if all pass
- Pending/Unknown: check promotion criteria → advance to
ReadyForService (existing behavior)
Add a LinkBurnIn helper (parallel to DeviceBurnIn) for slot-based window resolution.
Onchain effect
Writing LinkHealth = Impaired updates the health field but does not automatically change link.status — the serviceability program's check_status_transition() is gated behind a "waiting for health oracle" comment. This is intentional: the health field serves as a signal to operators and dashboards. Automatic status transitions can be enabled later.
Implementation
See implementation plan for detailed tasks and code.
Files to change
| Action |
File |
What |
| Modify |
internal/worker/criteria.go |
Bidirectional LinkHealthEvaluator, LinkBurnIn helper |
| Modify |
internal/worker/criteria_test.go |
Tests for impairment/recovery transitions |
| Create |
internal/worker/link_health.go |
LinkHealthCriterion — queries link_rollup_5m |
| Create |
internal/worker/link_health_test.go |
Unit tests |
| Modify |
internal/worker/clickhouse.go |
LinkHealthChecker interface, LinkHealthRecent query |
| Modify |
cmd/device-health-oracle/main.go |
Wire up criterion, add --link-loss-threshold flag |
Context
We have a steady stream of link health issues on the network, but the device-health-oracle currently has zero link criteria configured — links auto-advance from Pending to ReadyForService without checks, and ReadyForService is treated as a terminal state with no demotion path. Meanwhile, mainnet has activated links with ISIS down and 100% packet loss that are not reflected in
link.health.Goal
Enable the device-health-oracle to:
LinkHealth = ImpairedonchainLinkHealth = ReadyForServiceReadyForService ↔ ImpairedApproach
Data source:
link_rollup_5mQuery the existing
link_rollup_5mClickHouse table, which already aggregates per-link health into 5-minute buckets:link_pk— link public key (no joins needed)isis_down(bool) — whether ISIS adjacency is downa_loss_pct/z_loss_pct(float) — packet loss percentage per directionThis table is in the lake ClickHouse instance the DHO already connects to.
Impairment criteria
A link is impaired if the most recent
link_rollup_5mbucket shows:isis_down = true, ORa_loss_pct > thresholdORz_loss_pct > threshold(default threshold: 5%, configurable via flag)Recovery criteria
A link has recovered if ALL
link_rollup_5mbuckets within the recovery window are clean (ISIS up AND loss ≤ threshold). The recovery window is derived from ledger slots (using the existingDrainedSlotCountparameter, resolved to wall-clock time viaGetBlockTime), consistent with how device burn-in windows work. This asymmetry — fast impairment detection, slow recovery — prevents flapping.Evaluator changes
Extend
LinkHealthEvaluator.Evaluate()to support three paths:Impairedif any failReadyForServiceif all passReadyForService(existing behavior)Add a
LinkBurnInhelper (parallel toDeviceBurnIn) for slot-based window resolution.Onchain effect
Writing
LinkHealth = Impairedupdates the health field but does not automatically changelink.status— the serviceability program'scheck_status_transition()is gated behind a "waiting for health oracle" comment. This is intentional: the health field serves as a signal to operators and dashboards. Automatic status transitions can be enabled later.Implementation
See implementation plan for detailed tasks and code.
Files to change
internal/worker/criteria.goLinkHealthEvaluator,LinkBurnInhelperinternal/worker/criteria_test.gointernal/worker/link_health.goLinkHealthCriterion— querieslink_rollup_5minternal/worker/link_health_test.gointernal/worker/clickhouse.goLinkHealthCheckerinterface,LinkHealthRecentquerycmd/device-health-oracle/main.go--link-loss-thresholdflag