fix(visor/ping): narrow ping.mu critical section to map lookup#2761
Merged
0pcom merged 2 commits intoMay 21, 2026
Conversation
…ookup
DOMINANT BOTTLENECK for mux-bw bandwidth measurements. The visor's
Ping/PingOnce/PingOnceWithEcho (and dmsg twins) held v.ping.mu (a
single visor-global *sync.Mutex) for the ENTIRE wire roundtrip:
v.ping.mu.Lock()
defer v.ping.mu.Unlock()
pingEntry, ok := v.ping.conns[ref]
// ... ~287ms of wire I/O at 2-hop with 32 KB payloads ...
mux-bw's N pump goroutines all call PingOnceWithEcho on DIFFERENT
PingRouteRefs. They each look up their OWN conn via the map; the
wire I/O is independent. But the global mutex serialized them
through one ~287ms slot each. So:
- Aggregate throughput across N routes didn't scale with N
- Per-route avg pinned at ~351 kbps even though single-call peak
was ~1.7 Mbps (1 RTT × 32 KB)
- --probe-rtt latency probes during a loaded pump measured
"probe-mutex-wait + network RTT" instead of network RTT,
swamping the queueing-delay signal at short hop counts
- Bidirectional simultaneous mux-bw measurements showed
mutual-starvation that LOOKED like shared-link contention
but was actually mutex contention on each side's ping state
ROOT CAUSE
The mutex's actual job is to protect v.ping.conns (the map) from
concurrent insert (DialPing) and delete (StopPingRoute). The
wire I/O on the chosen conn does NOT need the map mutex held —
each mux-bw pump goroutine owns its own conn via its RouteIndex,
no aliasing.
FIX
Shrink the critical section to just the map lookup:
v.ping.mu.Lock()
pingEntry, ok := v.ping.conns[ref]
v.ping.mu.Unlock()
if !ok { ... }
// ... wire I/O on pingEntry.conn WITHOUT holding the mutex ...
Applied to Ping, PingOnce, PingOnceWithEcho, DmsgPing,
DmsgPingOnce, DmsgPingOnceWithEcho. BandwidthTest already had the
correct narrow scope.
CONCURRENT-CLOSE SEMANTICS
Pre-fix: StopPing concurrent with PingOnceWithEcho serialized via
the mutex — they took turns, no race. Post-fix: StopPing can close
the conn while PingOnceWithEcho is doing wire I/O. The Read/Write
on the closed conn returns ErrClosed cleanly. mux-bw's pump loop
already handles read/write errors by exiting the pump goroutine;
the resulting failure is surfaced via Beta's MuxRouteFailure event
(skycoin#2756) so the operator sees the cause instead of an indefinite
block.
The same-PingRouteRef-from-multiple-goroutines case (always
undefined behavior on the underlying net.Conn) is unchanged —
callers must serialize themselves. mux-bw enforces one
goroutine per RouteIndex natively.
TESTS
pkg/visor/ping_mu_concurrency_test.go:
- TestPingOnceWithEcho_DoesNotSerializeAcrossRouteIndexes:
200 concurrent calls with distinct PingRouteRefs (no registered
conns) complete in << 1s. A regression that re-introduces
wire-I/O-under-lock would either time out or take orders of
magnitude longer.
- TestPingMu_NotHeldDuringConnAbsentCallpath: after
PingOnceWithEcho returns, the mutex must be immediately
acquirable from another goroutine. Catches the defer-on-entry
pattern directly.
EMPIRICAL PREDICTION
Once this auto-deploys, the operator's "mux > direct" hypothesis
becomes testable WITHOUT --hops-via intermediate pinning. Per-
route avg should rise from ~351 kbps toward the single-call peak
of ~1.7 Mbps, and N=2..8 disjoint mux should aggregate roughly
linearly (modulo per-intermediate quality variance) instead of
flat-lining at single-route throughput.
CHAIN
The mux-bw measurement loop has now closed:
skycoin#2745 per-route teardown
skycoin#2746 disjoint-intermediate routing
skycoin#2749 plumb --min-hops through DialPing
skycoin#2750 stop poisoning dst breaker from intermediate failures
skycoin#2751 tryDirectPingDial gate on MinHops
skycoin#2752 honor caller SetupTimeout
skycoin#2756 MuxRouteFailure pump-phase event
skycoin#2757 PingOnceWithEcho adapter forwards RouteIndex
this PR: ping.mu doesn't serialize parallel pumps
If the mux > direct hypothesis is real, it should be visible
in measurements after this lands.
golangci-lint errcheck flagged both test sites that discard the 4-tuple return of v.PingOnceWithEcho via _, _, _, _. The discards are intentional — the concurrency test asserts on wall-clock serialization behavior, not on per-call success; the mutex-release test asserts that the lock is acquirable post-return regardless of whether the call itself succeeded or returned ErrNoPingConnection. Add //nolint:errcheck comments explaining the intent. No behavior change; CI lint pass.
4 tasks
0pcom
added a commit
that referenced
this pull request
May 21, 2026
…ssthrough (#2762) Two related fixes that emerged from the post-#2761 measurement work: 1. SILENT PUMP STALL (#127) — pkg/visor/api_ping.go PingOnceWithEcho set read deadlines (30s) but NO write deadlines. If the underlying TCP send buffer filled (route under load, slow intermediate, etc.) conn.Write would block indefinitely. Empirical symptom: mux-bw runs with route_established success + 0 bytes pumped + 0 MuxRouteFailure events emitted, because the pump goroutine was stuck inside Write rather than returning with an error. In my Phase 6 sweep against Gamma post-#2761: 10 of 12 cells showed exactly this pattern (idle baseline n=40-46 probes, loaded n=0-2, sent=recv=0B for the full pump duration). Fix: SetWriteDeadline mirrors the SetReadDeadline calls already in place (30s). On deadline, Write returns net.ErrDeadlineExceeded and the pump's existing error-return path fires a MuxRouteFailure event (#2756) — making the stall observable instead of silent. 2. PROBE HARDCODED RouteIndex 0 (#130) — Gamma's find at 20:20Z muxBwProbeLoop's probeConf didn't set RouteIndex, so every probe used PingRouteRef{PK, Index: 0}. When N>1 and the first established route was at a non-zero index (Gamma's Phase D: only R2 of N=8 came up), every probe failed with "no ping connection". Same class as #2757's adapter bug but in the probe path. Fix: muxBwProbeLoop takes a routeIndex parameter. Both callers (idle baseline + loaded probe) pick the first established route and pass its index. The idle phase needs the same selection logic as the loaded phase — previously it just ran the loop blindly, which would also hit RouteIndex 0 even when no route 0 had been established. For #127, the deferred SetReadDeadline reset paths gain the matching SetWriteDeadline reset so the conn returns to its deadline-free state on all return paths. For #130, the loaded-phase probe selection logic at line 257-265 is preserved verbatim; the only change is passing probeTarget.index into the goroutine. The idle-phase selection gets new "find first established route" logic that mirrors the loaded-phase pattern. Build / gofmt / golangci-lint clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Dominant bottleneck for mux-bw bandwidth measurements. The visor's Ping / PingOnce / PingOnceWithEcho (and dmsg twins) held
v.ping.mu— a single visor-global*sync.Mutex— for the entire wire roundtrip (~287ms at 2-hop with 32 KB payloads):mux-bw's N pump goroutines each look up their own conn via the map; the wire I/O is independent. But the global mutex serialized them through one ~287ms slot each. So:
--probe-rttlatency probes during a loaded pump measured probe-mutex-wait + network RTT instead of network RTT — swamping the queueing-delay signal at short hop counts (Alpha's lane confirmed this independently)Beta and I converged on the same diagnosis independently this afternoon.
Fix
The mutex's actual job is to protect
v.ping.conns(the map). Wire I/O on the chosen conn doesn't need the map mutex held — each mux-bw pump goroutine owns its conn via itsRouteIndex, no aliasing.Shrink the critical section to just the map lookup:
Applied to
Ping,PingOnce,PingOnceWithEcho,DmsgPing,DmsgPingOnce,DmsgPingOnceWithEcho.BandwidthTestalready had the correct narrow scope.Concurrent-close semantics
Pre-fix:
StopPingconcurrent withPingOnceWithEchoserialized via the mutex — they took turns. Post-fix:StopPingcan close the conn while wire I/O is in flight. The Read/Write on the closed conn returns ErrClosed cleanly. mux-bw's pump loop exits on err and surfaces the cause via Beta's MuxRouteFailure event (#2756) so the operator sees the failure mode instead of an indefinite block.Same-PingRouteRef-from-multiple-goroutines (always undefined behavior on the underlying net.Conn) is unchanged — callers must serialize themselves. mux-bw enforces one goroutine per RouteIndex natively.
Tests
pkg/visor/ping_mu_concurrency_test.go:TestPingOnceWithEcho_DoesNotSerializeAcrossRouteIndexes— 200 concurrent calls with distinctPingRouteRefs (no registered conns) complete in ≪ 1s. A regression that reintroduces wire-I/O-under-lock would either time out or take orders of magnitude longer.TestPingMu_NotHeldDuringConnAbsentCallpath— after PingOnceWithEcho returns, the mutex must be immediately acquirable from another goroutine. Catches thedefer ... Unlock()pattern directly.Empirical prediction
Once this auto-deploys, the operator's "mux > direct" hypothesis becomes testable without
--hops-viaintermediate pinning. Per-route avg should rise from ~351 kbps toward the single-call peak of ~1.7 Mbps, and N=2..8 disjoint mux should aggregate roughly linearly (modulo per-intermediate quality variance) instead of flat-lining at single-route throughput.Chain
The mux-bw measurement loop has now fully closed:
--min-hopsthrough DialPingIf the mux-over-intermediates hypothesis is real, it should now be measurable.
Test plan
go test ./pkg/visor/... -count=1 -short— cleango build ./...— clean (via pkg/visor build path)go vet ./pkg/visor/...— cleangofmt -l pkg/visor/— clean