Skip to content

multicast: report devices with no mroute telemetry as unknown, not unhealthy#662

Open
bgm-malbeclabs wants to merge 2 commits into
mainfrom
multicast-no-telemetry-unknown-status
Open

multicast: report devices with no mroute telemetry as unknown, not unhealthy#662
bgm-malbeclabs wants to merge 2 commits into
mainfrom
multicast-no-telemetry-unknown-status

Conversation

@bgm-malbeclabs

Copy link
Copy Markdown
Contributor

Summary of Changes

  • Multicast publishers/subscribers on a device that exports no mroute telemetry now resolve to health_status='unknown' ("no mroute telemetry observed from <device>") instead of a false unhealthy. A genuine RPF mismatch on a reporting device still resolves to unhealthy.
  • Reuses the existing unknown status — already wired through the rate view (which defaults to it), API counts, and web badges/sort — so no API or schema changes were needed.
  • Updates the Multicast health UI copy: the unknown definition now covers the no-telemetry case, and the under-development banner explains non-reporting devices show unknown (not unhealthy).
  • Pins the ClickHouse testcontainer image to 25.12 (matching docker-compose.yml / k8s/base/clickhouse.yaml) instead of :latest, which had drifted to 26.3 and broke a test; also moves SQL -- comments out of INSERT ... VALUES blocks that the newer Values parser rejects.

Diff Breakdown

Category Files Lines (+/-) Net
Core logic 2 +438 / -7 +431
Tests 2 +51 / -33 +18
Scaffolding 2 +8 / -2 +6
Total 6 +497 / -42 +455

The line count is dominated by the migration re-declaring the full health_multicast_user view in its Up and Down blocks; the actual semantic change is two added multiIf branches (health_status and mismatch_reason).

Key files (click to expand)
  • indexer/db/clickhouse/migrations/20260616000001_health_multicast_user_no_telemetry_reason.sql — adds a devices_with_mroutes flag to health_multicast_user; a device with zero observed mroutes resolves health_status to unknown with a "no mroute telemetry observed" reason.
  • indexer/pkg/dz/mroute/health_multicast_user_test.go — adds a publisher on a non-reporting device (asserts unknown + telemetry-gap reason) and a contrasting reporting-device fault case (asserts unhealthy + "RPF interface").
  • web/src/components/multicast-group-health-tab.tsx — broadens the unknown status definition and rewrites the under-development banner to reflect the new behavior.
  • indexer/pkg/clickhouse/testing/db.go, api/testing/clickhouse.go — pin the testcontainer ClickHouse image to 25.12.
  • indexer/pkg/dz/mroute/health_multicast_user_rate_test.go — relocate embedded -- comments out of INSERT ... VALUES.

Testing Verification

  • health_multicast_user_test.go covers both new paths: a publisher on a zero-mroute device resolves to unknown with "no mroute telemetry observed from …" (and not "RPF interface"), while a publisher whose reporting device lacks its tunnel still resolves to unhealthy.
  • Confirmed the root cause of the pre-existing rate-test failure against ClickHouse 26.3 (the Values parser rejects -- comments between VALUES tuples); mroute package and API Multicast tests pass on the pinned 25.12 image.

…healthy

A multicast publisher/subscriber on a device that exports no mroute
telemetry previously rendered as a confirmed 'unhealthy' fault with a
reason implying its tunnel was missing as the RPF interface. The real
cause is missing telemetry, not a forwarding problem.

The health_multicast_user view now resolves the device-reports-nothing
case to health_status='unknown' (an existing status, already wired
through the rate view, API counts, and web badges) with the reason
'no mroute telemetry observed from <device>'. A genuine RPF mismatch on
a reporting device still resolves to 'unhealthy'.
…SERT VALUES

The testcontainer image was clickhouse/clickhouse-server:latest, which
drifted to 26.3 whose Values parser rejects '--' comments embedded
between INSERT ... VALUES tuples, breaking TestHealthMulticastUserRate.

Pin the test image to 25.12 (matching docker-compose.yml and
k8s/base/clickhouse.yaml) so tests track the version we actually run,
and move the embedded SQL comments in the rate test out to Go comments.
@github-actions

Copy link
Copy Markdown

🔗 Preview: https://pr-662.data.malbeclabs.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant