Skip to content

fix(rpcgrpc): mux-bw probe-after-pump race + per-level LevelDone counters#2754

Merged
0pcom merged 2 commits into
skycoin:developfrom
0pcom:fix/mux-bw-probe-cleanup-and-levelDone-counters
May 21, 2026
Merged

fix(rpcgrpc): mux-bw probe-after-pump race + per-level LevelDone counters#2754
0pcom merged 2 commits into
skycoin:developfrom
0pcom:fix/mux-bw-probe-cleanup-and-levelDone-counters

Conversation

@0pcom
Copy link
Copy Markdown
Collaborator

@0pcom 0pcom commented May 21, 2026

Summary

Two small unrelated fixes against the rpcgrpc handlers; bundled because the diffs are tiny and ship in the same package.

1. mux-bw probe-after-pump race

In `muxBwProbeLoop`, the `select { ... ticker.C ... }` loop can land in the ticker-fired branch in the same scheduler window as the pump's `ctx.Done` — by the time `PingOnce` fires, the pump's defer has already torn down the route via `StopPingRoute`. Result: 1-2 trailing "no ping connection for ... call DialPing first" probe errors at every mux-bw run end.

Cosmetic but noisy. Fix: re-check `ctx.Err()` at the top of the ticker case so the probe bails out cleanly when the pump is shutting down.

2. LevelDone counters never populated

In `server_ping_tree.go`, `levelDoneFor()` emitted only `Attempted`; `Succeeded` / `Failed` / `SkippedCached` were always zero because no per-level counters were tracked anywhere. Operator-visible: `tree-stream`'s `level_done` human row read `attempted=509 succeeded=0 failed=0 skipped_cached=0` even when 484 entries came back from the transport-summary cache.

Fix: new `levelStats` struct (per-level int32 atomics), accumulated alongside the existing `pingTreeTotals` counters in `pingTreePingLevel`, returned from the level function, read into `PingTreeLevelDone`. Both call sites updated; dry-run path populates SkippedCached locally.

Test plan

  • `go build ./...` clean
  • `go test ./pkg/visor/rpcgrpc/...` pass
  • `golangci-lint run` clean
  • Live: `mux-bw` run-end no longer logs trailing "no ping connection" probes
  • Live: `tree-stream` level_done rows show real succeeded/failed/cached counts that sum to attempted

0pcom added 2 commits May 20, 2026 18:57
The local route-calc fallback in calculateLocalRoutes only knows
how to build 1-hop (direct) and 2-hop routes. When the global
route-finder fails to produce a path (transient timeout, stale
TPD, ErrNoSuitableTransport), DialRoutes falls back to
calculateLocalRoutes — which would silently return a 2-hop route
even when the caller demanded min-hops >= 3.

Empirically reproduced:
  cli visor ping mux-bw <peer> --min-hops 3  → hops_count=2
  cli visor ping mux-bw <peer> --min-hops 4  → hops_count=2
  cli visor ping mux-bw <peer> --min-hops 5  → hops_count=5 (correct)

The route-finder service does honor min-hops correctly when
queried directly (`cli route find <peer> -n 3 -x 3` returns
proper 3-hop paths). The bug is in the local fallback: it
returns whatever 2-hop path it can build, regardless of opts.

FIX: short-circuit calculateLocalRoutes with an error when
dialOpts.MinHops > 2. The router-level retry loop above will then
either re-attempt the route-finder query or return the error to
the caller — both of which are correct behaviors when the caller
explicitly demands a multi-hop path the local cache can't
satisfy.

Effect: `mux-bw --min-hops N` with N >= 3 either gets a route
that actually has >= N hops (when the route-finder service is
healthy) or fails cleanly with a useful error — no more silent
constraint violations.

Build / gofmt / golangci-lint clean. The earlier MinHops > 1
guard (suppressing the direct-1-hop probe) added in skycoin#2749 is
unchanged; this PR layers a second guard for the 2-hop case.
…ters

Two small unrelated fixes against the rpcgrpc handlers; bundling
because the diffs are tiny and ship in the same package.

1. mux-bw probe-after-pump race

In muxBwProbeLoop, the `for { select { ... ticker.C ... } }` loop
can land in the ticker-fired branch in the same scheduler window
as the pump's ctx.Done — by the time PingOnce fires, the pump
goroutine's defer has already torn down the route via
StopPingRoute. Result: 1-2 trailing "no ping connection for ...
call DialPing first" probe errors at every mux-bw run end.

Cosmetic but noisy in the operator's NDJSON output. Fix: re-check
ctx.Err() at the top of the ticker case so the probe bails out
cleanly when the pump is already shutting down.

2. LevelDone counters never populated for cache hits / failures

In server_ping_tree.go, levelDoneFor() emitted only the
candidate-count under Attempted; Succeeded / Failed / SkippedCached
were always zero because no per-level counters were tracked
anywhere. Operator-visible: tree-stream's level_done human row
read "attempted=509 succeeded=0 failed=0 skipped_cached=0" even
when 484 entries actually came back from the transport-summary
cache.

Fix: introduce a levelStats struct (per-level atomic int32
triple), accumulate alongside the existing pingTreeTotals
counters in pingTreePingLevel, return it from the level function,
and read into PingTreeLevelDone.{Succeeded,Failed,SkippedCached}.
Both call sites (level-1 + level-N loop) updated; dry-run path
populates SkippedCached locally.

Build / gofmt / golangci-lint clean. Existing rpcgrpc tests pass.
@0pcom 0pcom merged commit 5732f6d into skycoin:develop May 21, 2026
3 of 4 checks passed
0pcom added a commit that referenced this pull request May 21, 2026
… + MinHops (#2757)

Beta's #2756 MuxRouteFailure event surfaced the smoking gun on
mux-bw --routes N --min-hops 2: one route would establish but its
pump loop immediately failed with

  no ping connection for <pk>#0, call DialPing first

while MuxRouteEstablished named route_index = 2 for the same
route. The lookup PingRouteRef.Index was 0 even though the
pump goroutine had RouteIndex = 2.

ROOT CAUSE
The rpcgrpc PingConf → visor PingConfig adapter in
pkg/visor/init_apps.go's visorPingAdapter forwards RouteIndex
correctly for DialPing (line 238) and PingOnce (line 249), but
PingOnceWithEcho's adapter (lines 316-322) was missing the
field — same for MinHops. Aux-route pumps therefore degraded to
a primary-route (Index=0) lookup, finding nothing because
DialPing had registered the conn at the matching aux Index.

FIX
Add `RouteIndex: conf.RouteIndex` and `MinHops: conf.MinHops` to
the visorPingAdapter.PingOnceWithEcho conversion. Three-line
adapter parity fix.

DMSG adapter (DmsgPingOnceWithEcho) intentionally untouched —
v.dmsgPing.conns is keyed by PK alone, not by PingRouteRef, so
no DMSG path consumes RouteIndex.

TEST
TestPingAdapter_PingOnceWithEcho_ForwardsRouteIndex pins the
adapter contract: with no ping connection registered, the visor's
PingOnceWithEcho returns

  no ping connection for %s#%d, call DialPing first

The "%d" portion is conf.RouteIndex post-adapter. The test calls
the adapter with RouteIndex in {0, 1, 2, 7} and asserts the error
message contains the matching `#<idx>,` — so a regression that
drops RouteIndex (or stuffs in MinHops by accident) surfaces in
the error string directly. No mock VisorAPI, no fixtures.

EMPIRICAL CHAIN
This is the third bug in the mux-bw --min-hops measurement chain
to surface via the wire-event observability landed in #2746#2749#2751#2752#2750#2753#2754#2756. The pattern
holds: each event-surface fix lights up the next bug downstream.
The operator's "mux > direct" hypothesis test should now finally
be measurable end-to-end once #2756 (route_failure event) +
this PR auto-deploy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant