node/cn: replace per-peer known-tx/block/bid LRU caches with a pointer-free FIFO set#910
Merged
Merged
Conversation
|
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅ |
…r-free FIFO set
The per-peer knownTxsCache/knownBlocksCache/knownBidsCache were golang-lru
caches created with IsScaled:true, so on large-RAM nodes calculateScale()
multiplied their size by TotalPhysicalMemGB/16 (x8 on 128GB), e.g. knownTxs
32768 -> 262144 entries per peer. golang-lru stores each entry in a
container/list element keyed by a map, and the generic common.Cache wrapper
boxes the key into an interface: ~3 pointer-rich live objects per entry. With
one set per peer x dozens of peers, that is tens of millions of pointers the
GC mark phase must traverse every cycle. On a long-running endpoint node,
live-heap profiling showed GC at ~57% of CPU (runtime.findObject ~27% flat),
and the per-peer sets filled slowly over ~a day -- a steady post-restart CPU
creep that only a restart reset.
These sets only need membership + oldest-first eviction (FIFO); access-recency
(LRU) was never used (the caches were configured FIFO, i.e. Get == Peek).
Replace them with knownHashSet: a preallocated ring ([]common.Hash) for O(1)
FIFO eviction plus a map[common.Hash]struct{} for O(1) membership. Both are
pointer-free ("noscan"), so the GC traverses no pointers for these structures
regardless of how many entries they hold.
Benchmarks (go test -bench BenchmarkKnown -benchmem ./node/cn):
Add (steady state): golang-lru 112.8 ns/op 113 B/op 3 allocs/op
knownHashSet 48.5 ns/op 0 B/op 0 allocs/op
GC mark (live+full): n=1,000,000 golang-lru 11.08 ms knownHashSet 0.80 ms (~14x)
knownHashSet GC time is flat in n; golang-lru scales linearly.
Footprint: golang-lru 3.00 live objects/entry; knownHashSet ~0.
03c142f to
ef2c4e9
Compare
Contributor
Author
|
I have read the CLA Document and I hereby sign the CLA |
hyunsooda
approved these changes
May 26, 2026
ian0371
approved these changes
May 27, 2026
Collaborator
|
Thanks for the contribution! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed changes
Background: what these caches are
Every peer connection keeps three "known hashes" sets —
knownTxsCache,knownBlocksCache,knownBidsCache(node/cn/peer.go). They exist purely to de-duplicate gossip: don't send a peer an item it already has.SendTransactions→AddToKnownTxs) and when we receive it from the peer (handleTxMsg→AddToKnownTxs) — in both cases the peer is known to have it.peer_set.goskips any peer for whichKnowsTx(hash)/KnowsBlock(hash)is already true.So each set is a per-peer, best-effort record of "hashes this peer already has," used only to avoid redundant sends. The size constant is even commented "prevent DOS" — bounding is a requirement, not a tuning knob.
Why a small, fixed-size bound is correct (and a large one is pure cost)
This is the crux, so spelling it out:
The set is an optimization, never a correctness mechanism. It only ever suppresses a send. Evicting an entry can therefore only cause a redundant send (we forget a peer has X, so we send X again); the peer already has X, dedups it locally, and does not re-announce it — so there is no loop, no missed delivery, no protocol error. There is no failure mode where forgetting an entry breaks anything; the worst case is one wasted message.
An entry is only useful during the propagation window. Gossip dedup matters while an item is actively spreading (seconds for a tx; until a block is a few deep). Once a tx is mined/dropped or a block is old, it is no longer broadcast, so remembering its hash can never match a future send again. Old entries are dead weight. FIFO eviction (drop oldest-inserted) is exactly right: it discards the entries that have already aged out of the window. This is also why LRU is unnecessary here — relevance is set at insert time (when we learned the peer has it), not at access time; the existing cache was already configured FIFO (
fifoCache.Get == Peek).The bound only needs to exceed the number of distinct items in flight during that window. Concretely: at, say, 100 tx/s with a ~10 s propagation window, only ~1,000 distinct hashes are ever "in flight" at once. The existing
maxKnownTxs = 32768covers ~300+ seconds of such traffic — already ~30× the window. Anything beyond that is remembering hashes that will provably never be re-broadcast. So 32768 is generous; there is no dedup benefit to making it larger.In short: a fixed bound that comfortably covers the propagation window gives the full dedup benefit, and over-sizing it cannot improve dedup — it only adds cost.
The bug: the bound is scaled by host RAM
The three sets are created with
common.NewCache(common.FIFOCacheConfig{… IsScaled: true}).IsScaled: truemultiplies the size bycalculateScale():On a 128 GB host this is ×8, so
knownTxsbecomes 32768 → 262144 entries per peer (×16 at 256 GB; more with--cache.level). Per the reasoning above this extra capacity yields zero dedup benefit — it just remembers ~tens of minutes of stale hashes. But it is not free:golang-lrustores every entry as acontainer/listelement keyed by a map, and the genericcommon.Cachewrapper boxes thecommon.Hashkey into an interface — ~3 pointer-rich live objects per entry (measured below). With one set per peer × dozens of peers, the RAM-scaled bound produces tens of millions of pointers the GC mark phase must traverse on every cycle.Production impact (128 GB host, 54 peers)
With
IsScaled:trueeach peer'sknownTxsset holds 262,144 entries, so across 54 peers the known sets reach up to 14.2M entries ≈ 42M live, pointer-rich objects (~3 objects/entry, measured) that the GC must re-scan on every mark cycle. The symptom was a slow, monotonic CPU climb after each restart. A heap + CPU profile of the live node showed:knownTxsentries and climbing, ~17M live objects in theAddToKnownTxspath.gcBgMarkWorker/gcDrain) — about 3.1 of the ~5.5 active cores doing nothing but collecting — withruntime.findObject≈ 27% (~1.5 cores) pointer-chasing alone, on a 32-vCPU node.The fix
Replace the three
golang-lru-backed sets withknownHashSet: a preallocated ring ([]common.Hash, O(1) FIFO eviction) plus amap[common.Hash]struct{}(O(1) membership), guarded by a mutex (same concurrency contract as before).common.Hashis[32]byteand the value isstruct{}, so both the ring and the map are pointer-free ("noscan") — the GC traverses no pointers for these structures no matter how many entries they hold.Semantics are unchanged: membership + oldest-first (FIFO) eviction, same fixed bounds (
maxKnownTxs=32768,maxKnownBlocks=1024,maxKnownBids=2048— also the upstream go-ethereum defaults). Re-adding a present hash is a no-op (insertion order, and thus eviction order, is preserved). No protocol/consensus behavior changes — only how a peer remembers which hashes it has already exchanged.Benchmarks
go test -run '^$' -bench BenchmarkKnown -benchmem ./node/cn(Go 1.25):The "objects/entry" figure is the mechanism: golang-lru keeps 3 pointer-rich objects per entry that GC must scan;
knownHashSetkeeps a constant handful, so its mark cost does not grow with the set.Production verification
Deployed to the same endpoint node (128 GB, ~54 peers). Comparing the first ~13.5 h of uptime against the previous build (same instance/config):
GCCPUFractionis 1.48% and flat (1.49% at 3 h → 1.48% at 13.5 h) — the old build's GC climbed as the caches filled (its 31 h CPU profile showedgcBgMarkWorker~57%). A fresh CPU profile showsruntime.findObjectdown from 26.8% to 9.9% (44.6 s → 10.0 s per 30 s), i.e. the linked-list pointer-scanning is largely gone.Types of changes
Checklist
$ make test)Related issues
Further comments
Why not just
IsScaled: false? That divides the problem by 8 but the structure is stillcontainer/list-based, so GC cost stays O(entries) and still scales with RAM-driven config elsewhere. The pointer-free set removes the cost categorically (O(1) in pointers scanned), independent of host RAM.Why not
fastcache? It also avoids GC scanning (off-heap data), but it targets a few large shared caches; one instance per peer carries its fixed 512-bucket overhead and coarse chunk eviction — a poor fit for many small per-peer sets. The noscan ring+map is simpler, on-heap, exact-FIFO, dependency-free, and benchmarks equally GC-cheap.node/sc/bridgepeer.gohas the identicalknownTxsCachepattern (IsScaled: true). It is the service-chain bridge peer (lower traffic) and outside the profiling above; happy to convert it here (extractingknownHashSetto a shared location) or in a follow-up — reviewer's preference.