Skip to content

fix(router): blame intermediate not dst when id_reservation fails on a hop#2750

Merged
0pcom merged 1 commit into
skycoin:developfrom
0pcom:fix/breaker-no-poison-from-intermediates
May 20, 2026
Merged

fix(router): blame intermediate not dst when id_reservation fails on a hop#2750
0pcom merged 1 commit into
skycoin:developfrom
0pcom:fix/breaker-no-poison-from-intermediates

Conversation

@0pcom
Copy link
Copy Markdown
Collaborator

@0pcom 0pcom commented May 20, 2026

Summary

Disjoint-mux scenario (N parallel routes through N different intermediates) exposed a circuit-breaker attribution bug: when an intermediate hop failed to dial during id_reservation, the destination's breaker accumulated the hit. Five bad intermediates tripped the destination's breaker even though the destination was reachable via the other N-5 healthy intermediates. All subsequent setup attempts to the dst then fast-failed for circuitOpenDuration (~60s).

Symmetric to the source-side fix already in place: source-side dial failures don't poison the dst breaker either. Intermediates needed the same treatment.

Changes

  • pkg/router/setupmetrics/stats.go:

    • finish() default branch (intermediate hop failed): set reason = ReasonIntermediateUnreachable, blameDst = false, and record the breaker hit against the intermediate's PK (not the dst's).
    • New AllowIntermediate accessor mirrors AllowDestination; both share an allowPK helper.
    • Adds ReasonIntermediateUnreachable FailureReason.
  • pkg/router/setupnode.go: CreateRouteGroup also consults AllowIntermediate for each hop on the forward path. A known-bad intermediate now short-circuits the setup with the same ErrCircuitOpen sentinel + the intermediate's PK in the reason string, saving the ~10s id_reservation timeout per known-bad hop.

  • pkg/router/setupmetrics/stats_test.go: 2 new tests.

    • TestCollector_CircuitBreaker_IntermediateUnreachable — 5+ intermediate-fails do NOT trip the dst breaker, DO trip the intermediate's breaker, are reclassified to intermediate_unreachable, and don't blame the dst Failed counter.
    • TestCollector_CircuitBreaker_IntermediateBreakerNotPoisoningDst — asserts the asymmetry that makes disjoint-mux work: bad intermediate denied, good intermediate + dst still allowed.

Empirical motivation

Captured during #2746 followup. mux-bw --routes 8 --min-hops 2 against a peer failed all 8 routes with destination circuit breaker open despite the route-finder returning 7+ valid candidates through different intermediates. Each route's intermediate-dial failure ticked the dst breaker; once 5 in a 5-minute window had hit, the dst was locked out and the remaining ~3 healthy candidates were never even tried.

Why this matters now

Pairs with #2746 (disjoint mux) and #2749 (--min-hops plumbing). Once mux-bw actually routes through intermediates, intermediate flakiness no longer destroys the dst breaker semantics — disjoint-mux is now actually viable against deployment intermediates that may individually be flaky.

Test plan

  • go test ./pkg/router/setupmetrics/ -count=1 -v — all 13 subtests pass (11 pre-existing + 2 new).
  • go test ./pkg/router/ -count=1 -short — passes.
  • go build ./... — clean.
  • gofmt — clean.
  • CI lanes (linux, darwin, windows, nix).
  • Post-merge: re-run mux-bw --routes 8 --min-hops 2 against a peer with known-flaky intermediates and confirm routes through healthy intermediates succeed instead of all-fail.

…a hop

Disjoint-mux scenario (N parallel routes through N different
intermediates) exposed a circuit-breaker attribution bug: when an
intermediate hop failed to dial during id_reservation, the
destination's breaker accumulated the hit. Five bad intermediates
tripped the destination's breaker even though the destination was
reachable via the other N-5 healthy intermediates. All subsequent
setup attempts to the dst then fast-failed for circuitOpenDuration.

Symmetric to the source-side fix already in place (lines 423-426
pre-change): source-side dial failures don't poison the dst breaker
either. Intermediates needed the same treatment.

Changes
- pkg/router/setupmetrics/stats.go: in finish() default branch
  (intermediate hop failed), set reason=ReasonIntermediateUnreachable,
  blameDst=false, and record the breaker hit against the intermediate's
  PK (not the dst's). New AllowIntermediate accessor mirrors
  AllowDestination; both share an allowPK helper. Adds
  ReasonIntermediateUnreachable FailureReason.

- pkg/router/setupnode.go: CreateRouteGroup also consults
  AllowIntermediate for each hop on the forward path. A known-bad
  intermediate now short-circuits the setup with the same
  ErrCircuitOpen sentinel + the intermediate's PK in the reason
  string, saving the ~10s id_reservation timeout.

- pkg/router/setupmetrics/stats_test.go: 2 new tests.
  TestCollector_CircuitBreaker_IntermediateUnreachable asserts that
  5+ intermediate-fails do NOT trip the dst breaker, DO trip the
  intermediate's breaker, are reclassified to intermediate_unreachable,
  and don't blame the dst Failed counter.
  TestCollector_CircuitBreaker_IntermediateBreakerNotPoisoningDst
  asserts the asymmetry that makes disjoint-mux work: bad
  intermediate denied, good intermediate + dst still allowed.

Empirical motivation
Captured in PR skycoin#2746 followup investigation: mux-bw --routes 8
--min-hops 2 against Beta failed all 8 routes with "destination
circuit breaker open" despite route-finder returning 7+ valid
candidates through different intermediates. Each route's
intermediate-dial failure ticked the dst breaker; once 5 in a
5-minute window had hit, the dst was locked out and the remaining
~3 healthy candidates were never even tried.

Pairs with skycoin#2746 (disjoint mux) and skycoin#2749 (--min-hops plumbing): once
mux-bw actually routes through intermediates, intermediate flakiness
no longer destroys the dst breaker semantics.
@0pcom 0pcom merged commit ab40b4c into skycoin:develop May 20, 2026
3 of 4 checks passed
@0pcom
Copy link
Copy Markdown
Collaborator Author

0pcom commented May 20, 2026

Empirical validation — caught the bug in real time

Validated post-#2751 auto-deploy. With the multi-hop path now genuinely reachable, I observed the BUG 2 scenario this PR fixes play out live:

Run 1: N=1 --min-hops 2

Run 2: N=8 --min-hops 2 (same target, same minute)

  • 2 routes succeed via DIFFERENT intermediates (03eb5a33... 3.3s, 02e40731... 4.3s) — confirms disjoint-mux machinery picks distinct intermediates.
  • 6 routes time out at setup-timeout.
  • The 6 failed id_reservation calls poison the destination's circuit breaker (5 failures in 5min window → breaker trips OPEN).

Run 3: N=1 --min-hops 2 (immediately after Run 2)

  • TIMEOUT at 15s. The same destination that succeeded 2 minutes ago is now unreachable.

Run 4: N=2 --min-hops 2

  • BOTH TIMEOUT.

The destination is healthy (Run 1 proved it). The intermediate(s) that succeeded in Run 2 are healthy. But the destination's circuit breaker is now open from Run 2's intermediate failures, so all subsequent setups fast-fail regardless of which intermediate they'd traverse.

This is precisely the path this PR fixes:

  • Pre-fix: 6 intermediate-dial failures → dst breaker trips → healthy routes (through the still-good intermediates 03eb5a33, 02e40731, 030c894f) get short-circuited.
  • Post-fix: 6 intermediate-dial failures → the FAILED intermediates' breakers trip → dst breaker stays closed → subsequent attempts through good intermediates proceed.

The new TestCollector_CircuitBreaker_IntermediateBreakerNotPoisoningDst test in this PR is the regression guard for exactly the asymmetry that the empirical sequence above is missing today.

Side note (out of scope for this PR)

mux-bw's setupWg.Wait() waits for ALL routes to settle (success or fail) before starting pump phase. When some routes time out at e.g. 30s and others succeed at 4s, the successful routes idle for ~26s before pump starts. By then a) the keepalive may have rotated b) timing-sensitive measurements are skewed. This is design surface independent of the breaker — worth a follow-up.

Darwin CI

The darwin lane failed on TestPeerSubMissesAfterReattach in pkg/cxo/treestore — completely unrelated to this PR's surface. Test passes 3x locally. Will request re-run when linux settles.

0pcom added a commit that referenced this pull request May 21, 2026
… + MinHops (#2757)

Beta's #2756 MuxRouteFailure event surfaced the smoking gun on
mux-bw --routes N --min-hops 2: one route would establish but its
pump loop immediately failed with

  no ping connection for <pk>#0, call DialPing first

while MuxRouteEstablished named route_index = 2 for the same
route. The lookup PingRouteRef.Index was 0 even though the
pump goroutine had RouteIndex = 2.

ROOT CAUSE
The rpcgrpc PingConf → visor PingConfig adapter in
pkg/visor/init_apps.go's visorPingAdapter forwards RouteIndex
correctly for DialPing (line 238) and PingOnce (line 249), but
PingOnceWithEcho's adapter (lines 316-322) was missing the
field — same for MinHops. Aux-route pumps therefore degraded to
a primary-route (Index=0) lookup, finding nothing because
DialPing had registered the conn at the matching aux Index.

FIX
Add `RouteIndex: conf.RouteIndex` and `MinHops: conf.MinHops` to
the visorPingAdapter.PingOnceWithEcho conversion. Three-line
adapter parity fix.

DMSG adapter (DmsgPingOnceWithEcho) intentionally untouched —
v.dmsgPing.conns is keyed by PK alone, not by PingRouteRef, so
no DMSG path consumes RouteIndex.

TEST
TestPingAdapter_PingOnceWithEcho_ForwardsRouteIndex pins the
adapter contract: with no ping connection registered, the visor's
PingOnceWithEcho returns

  no ping connection for %s#%d, call DialPing first

The "%d" portion is conf.RouteIndex post-adapter. The test calls
the adapter with RouteIndex in {0, 1, 2, 7} and asserts the error
message contains the matching `#<idx>,` — so a regression that
drops RouteIndex (or stuffs in MinHops by accident) surfaces in
the error string directly. No mock VisorAPI, no fixtures.

EMPIRICAL CHAIN
This is the third bug in the mux-bw --min-hops measurement chain
to surface via the wire-event observability landed in #2746#2749#2751#2752#2750#2753#2754#2756. The pattern
holds: each event-surface fix lights up the next bug downstream.
The operator's "mux > direct" hypothesis test should now finally
be measurable end-to-end once #2756 (route_failure event) +
this PR auto-deploy.
0pcom added a commit that referenced this pull request May 21, 2026
* fix(visor/ping): narrow ping.mu/dmsgPing.mu critical section to map lookup

DOMINANT BOTTLENECK for mux-bw bandwidth measurements. The visor's
Ping/PingOnce/PingOnceWithEcho (and dmsg twins) held v.ping.mu (a
single visor-global *sync.Mutex) for the ENTIRE wire roundtrip:

  v.ping.mu.Lock()
  defer v.ping.mu.Unlock()
  pingEntry, ok := v.ping.conns[ref]
  // ... ~287ms of wire I/O at 2-hop with 32 KB payloads ...

mux-bw's N pump goroutines all call PingOnceWithEcho on DIFFERENT
PingRouteRefs. They each look up their OWN conn via the map; the
wire I/O is independent. But the global mutex serialized them
through one ~287ms slot each. So:

  - Aggregate throughput across N routes didn't scale with N
  - Per-route avg pinned at ~351 kbps even though single-call peak
    was ~1.7 Mbps (1 RTT × 32 KB)
  - --probe-rtt latency probes during a loaded pump measured
    "probe-mutex-wait + network RTT" instead of network RTT,
    swamping the queueing-delay signal at short hop counts
  - Bidirectional simultaneous mux-bw measurements showed
    mutual-starvation that LOOKED like shared-link contention
    but was actually mutex contention on each side's ping state

ROOT CAUSE

The mutex's actual job is to protect v.ping.conns (the map) from
concurrent insert (DialPing) and delete (StopPingRoute). The
wire I/O on the chosen conn does NOT need the map mutex held —
each mux-bw pump goroutine owns its own conn via its RouteIndex,
no aliasing.

FIX

Shrink the critical section to just the map lookup:

  v.ping.mu.Lock()
  pingEntry, ok := v.ping.conns[ref]
  v.ping.mu.Unlock()
  if !ok { ... }
  // ... wire I/O on pingEntry.conn WITHOUT holding the mutex ...

Applied to Ping, PingOnce, PingOnceWithEcho, DmsgPing,
DmsgPingOnce, DmsgPingOnceWithEcho. BandwidthTest already had the
correct narrow scope.

CONCURRENT-CLOSE SEMANTICS

Pre-fix: StopPing concurrent with PingOnceWithEcho serialized via
the mutex — they took turns, no race. Post-fix: StopPing can close
the conn while PingOnceWithEcho is doing wire I/O. The Read/Write
on the closed conn returns ErrClosed cleanly. mux-bw's pump loop
already handles read/write errors by exiting the pump goroutine;
the resulting failure is surfaced via Beta's MuxRouteFailure event
(#2756) so the operator sees the cause instead of an indefinite
block.

The same-PingRouteRef-from-multiple-goroutines case (always
undefined behavior on the underlying net.Conn) is unchanged —
callers must serialize themselves. mux-bw enforces one
goroutine per RouteIndex natively.

TESTS

pkg/visor/ping_mu_concurrency_test.go:
  - TestPingOnceWithEcho_DoesNotSerializeAcrossRouteIndexes:
    200 concurrent calls with distinct PingRouteRefs (no registered
    conns) complete in << 1s. A regression that re-introduces
    wire-I/O-under-lock would either time out or take orders of
    magnitude longer.
  - TestPingMu_NotHeldDuringConnAbsentCallpath: after
    PingOnceWithEcho returns, the mutex must be immediately
    acquirable from another goroutine. Catches the defer-on-entry
    pattern directly.

EMPIRICAL PREDICTION

Once this auto-deploys, the operator's "mux > direct" hypothesis
becomes testable WITHOUT --hops-via intermediate pinning. Per-
route avg should rise from ~351 kbps toward the single-call peak
of ~1.7 Mbps, and N=2..8 disjoint mux should aggregate roughly
linearly (modulo per-intermediate quality variance) instead of
flat-lining at single-route throughput.

CHAIN

The mux-bw measurement loop has now closed:
  #2745 per-route teardown
  #2746 disjoint-intermediate routing
  #2749 plumb --min-hops through DialPing
  #2750 stop poisoning dst breaker from intermediate failures
  #2751 tryDirectPingDial gate on MinHops
  #2752 honor caller SetupTimeout
  #2756 MuxRouteFailure pump-phase event
  #2757 PingOnceWithEcho adapter forwards RouteIndex
  this PR: ping.mu doesn't serialize parallel pumps

If the mux > direct hypothesis is real, it should be visible
in measurements after this lands.

* fix(visor/ping): errcheck on discarded PingOnceWithEcho returns in test

golangci-lint errcheck flagged both test sites that discard the
4-tuple return of v.PingOnceWithEcho via _, _, _, _. The discards
are intentional — the concurrency test asserts on wall-clock
serialization behavior, not on per-call success; the mutex-release
test asserts that the lock is acquirable post-return regardless of
whether the call itself succeeded or returned ErrNoPingConnection.

Add //nolint:errcheck comments explaining the intent. No behavior
change; CI lint pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant