Skip to content

node/cn: replace per-peer known-tx/block/bid LRU caches with a pointer-free FIFO set#910

Merged
ian0371 merged 1 commit into
kaiachain:devfrom
blockswords-io:known-cache-fifo-set
May 27, 2026
Merged

node/cn: replace per-peer known-tx/block/bid LRU caches with a pointer-free FIFO set#910
ian0371 merged 1 commit into
kaiachain:devfrom
blockswords-io:known-cache-fifo-set

Conversation

@swapscanner-ryan
Copy link
Copy Markdown
Contributor

Proposed changes

Background: what these caches are

Every peer connection keeps three "known hashes" sets — knownTxsCache, knownBlocksCache, knownBidsCache (node/cn/peer.go). They exist purely to de-duplicate gossip: don't send a peer an item it already has.

  • We add a hash to a peer's set when we send that item to the peer (SendTransactionsAddToKnownTxs) and when we receive it from the peer (handleTxMsgAddToKnownTxs) — in both cases the peer is known to have it.
  • Before broadcasting/announcing, we filter: peer_set.go skips any peer for which KnowsTx(hash) / KnowsBlock(hash) is already true.

So each set is a per-peer, best-effort record of "hashes this peer already has," used only to avoid redundant sends. The size constant is even commented "prevent DOS" — bounding is a requirement, not a tuning knob.

Why a small, fixed-size bound is correct (and a large one is pure cost)

This is the crux, so spelling it out:

  1. The set is an optimization, never a correctness mechanism. It only ever suppresses a send. Evicting an entry can therefore only cause a redundant send (we forget a peer has X, so we send X again); the peer already has X, dedups it locally, and does not re-announce it — so there is no loop, no missed delivery, no protocol error. There is no failure mode where forgetting an entry breaks anything; the worst case is one wasted message.

  2. An entry is only useful during the propagation window. Gossip dedup matters while an item is actively spreading (seconds for a tx; until a block is a few deep). Once a tx is mined/dropped or a block is old, it is no longer broadcast, so remembering its hash can never match a future send again. Old entries are dead weight. FIFO eviction (drop oldest-inserted) is exactly right: it discards the entries that have already aged out of the window. This is also why LRU is unnecessary here — relevance is set at insert time (when we learned the peer has it), not at access time; the existing cache was already configured FIFO (fifoCache.Get == Peek).

  3. The bound only needs to exceed the number of distinct items in flight during that window. Concretely: at, say, 100 tx/s with a ~10 s propagation window, only ~1,000 distinct hashes are ever "in flight" at once. The existing maxKnownTxs = 32768 covers ~300+ seconds of such traffic — already ~30× the window. Anything beyond that is remembering hashes that will provably never be re-broadcast. So 32768 is generous; there is no dedup benefit to making it larger.

In short: a fixed bound that comfortably covers the propagation window gives the full dedup benefit, and over-sizing it cannot improve dedup — it only adds cost.

The bug: the bound is scaled by host RAM

The three sets are created with common.NewCache(common.FIFOCacheConfig{… IsScaled: true}). IsScaled: true multiplies the size by calculateScale():

CacheScale * ScaleByCacheUsageLevel * TotalPhysicalMemGB / minimumMemorySize / 100 / 100

On a 128 GB host this is ×8, so knownTxs becomes 32768 → 262144 entries per peer (×16 at 256 GB; more with --cache.level). Per the reasoning above this extra capacity yields zero dedup benefit — it just remembers ~tens of minutes of stale hashes. But it is not free:

golang-lru stores every entry as a container/list element keyed by a map, and the generic common.Cache wrapper boxes the common.Hash key into an interface — ~3 pointer-rich live objects per entry (measured below). With one set per peer × dozens of peers, the RAM-scaled bound produces tens of millions of pointers the GC mark phase must traverse on every cycle.

Production impact (128 GB host, 54 peers)

With IsScaled:true each peer's knownTxs set holds 262,144 entries, so across 54 peers the known sets reach up to 14.2M entries ≈ 42M live, pointer-rich objects (~3 objects/entry, measured) that the GC must re-scan on every mark cycle. The symptom was a slow, monotonic CPU climb after each restart. A heap + CPU profile of the live node showed:

  • At ~31 h uptime (sets not yet full): ~5.8M knownTxs entries and climbing, ~17M live objects in the AddToKnownTxs path.
  • GC ≈ 57% of the node's busy CPU (gcBgMarkWorker/gcDrain) — about 3.1 of the ~5.5 active cores doing nothing but collecting — with runtime.findObject ≈ 27% (~1.5 cores) pointer-chasing alone, on a 32-vCPU node.
  • These per-peer sets were the single largest retained, pointer-dense structure and accounted for ~83% of live-object growth over the measured window.
  • The sets fill slowly (a peer learns only a few hashes/s), ramping toward the ×8 cap over ~a day — exactly the shape of the post-restart CPU creep; a restart reset it.

The fix

Replace the three golang-lru-backed sets with knownHashSet: a preallocated ring ([]common.Hash, O(1) FIFO eviction) plus a map[common.Hash]struct{} (O(1) membership), guarded by a mutex (same concurrency contract as before). common.Hash is [32]byte and the value is struct{}, so both the ring and the map are pointer-free ("noscan") — the GC traverses no pointers for these structures no matter how many entries they hold.

Semantics are unchanged: membership + oldest-first (FIFO) eviction, same fixed bounds (maxKnownTxs=32768, maxKnownBlocks=1024, maxKnownBids=2048 — also the upstream go-ethereum defaults). Re-adding a present hash is a no-op (insertion order, and thus eviction order, is preserved). No protocol/consensus behavior changes — only how a peer remembers which hashes it has already exchanged.

Benchmarks

go test -run '^$' -bench BenchmarkKnown -benchmem ./node/cn (Go 1.25):

Add (cache full → evict+insert, the hot path):
  golang-lru      112.8 ns/op    113 B/op    3 allocs/op
  knownHashSet     48.5 ns/op      0 B/op    0 allocs/op

GC mark — time per full runtime.GC() with the structure live and full:
  n=32768          golang-lru   0.89 ms   knownHashSet 0.61 ms
  n=262144         golang-lru   3.26 ms   knownHashSet 0.66 ms   (~5×)
  n=1,000,000      golang-lru  11.08 ms   knownHashSet 0.80 ms   (~14×)
  → knownHashSet GC time is flat in n; golang-lru scales linearly with entries.

Production scale (BenchmarkKnownGCMarkProd: 54 peers × 262144 = 14.2M entries,
as 54 separate instances, all live):
  golang-lru   ~126 ms per GC
  knownHashSet  ~4.5 ms per GC   (~28×)

Footprint (n=262144):
  golang-lru     ~177 B/entry,  3.00 live objects/entry
  knownHashSet   ~112 B/entry,  ~0 objects/entry (≈1k objects total, independent of n)

The "objects/entry" figure is the mechanism: golang-lru keeps 3 pointer-rich objects per entry that GC must scan; knownHashSet keeps a constant handful, so its mark cost does not grow with the set.

Production verification

Deployed to the same endpoint node (128 GB, ~54 peers). Comparing the first ~13.5 h of uptime against the previous build (same instance/config):

  • The post-restart CPU creep is gone. Baseline CPU is flat — mean 3.69 cores (3.9 → 3.86) — whereas the old build had already crept to mean 4.18 (3.9 → 4.3) by the same 13.5 h, and went on to ~5.5 cores by 31 h. RSS 9.9 GB vs 10.5 GB.
  • GC is no longer dominant, and does not creep. Lifetime GCCPUFraction is 1.48% and flat (1.49% at 3 h → 1.48% at 13.5 h) — the old build's GC climbed as the caches filled (its 31 h CPU profile showed gcBgMarkWorker ~57%). A fresh CPU profile shows runtime.findObject down from 26.8% to 9.9% (44.6 s → 10.0 s per 30 s), i.e. the linked-list pointer-scanning is largely gone.
  • The per-peer known caches no longer appear among retained heap objects. Live 32 B (boxed-hash) objects dropped 19M → 2.9M; the live-heap-object floor is ~31% lower and grows ~4× slower (residual growth is the unrelated trie clean-cache fill, not the known sets).

Types of changes

  • 🐛 Bug fix
  • ✨ Non-hardfork changes (node upgrade not required)
  • 💥 Hardfork / consensus-breaking changes
  • 🧪 Test improvements
  • 🧰 CI / build tool
  • ♻️ Chore / Refactor / Non-functional changes

Checklist

  • 📖 I have read the CONTRIBUTING GUIDELINES doc
  • 📝 I have signed the CLA
  • 🟢 Lint and unit tests pass locally with my changes ($ make test)

Related issues

Further comments

Why not just IsScaled: false? That divides the problem by 8 but the structure is still container/list-based, so GC cost stays O(entries) and still scales with RAM-driven config elsewhere. The pointer-free set removes the cost categorically (O(1) in pointers scanned), independent of host RAM.

Why not fastcache? It also avoids GC scanning (off-heap data), but it targets a few large shared caches; one instance per peer carries its fixed 512-bucket overhead and coarse chunk eviction — a poor fit for many small per-peer sets. The noscan ring+map is simpler, on-heap, exact-FIFO, dependency-free, and benchmarks equally GC-cheap.

node/sc/bridgepeer.go has the identical knownTxsCache pattern (IsScaled: true). It is the service-chain bridge peer (lower traffic) and outside the profiling above; happy to convert it here (extracting knownHashSet to a shared location) or in a follow-up — reviewer's preference.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 25, 2026

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

…r-free FIFO set

The per-peer knownTxsCache/knownBlocksCache/knownBidsCache were golang-lru
caches created with IsScaled:true, so on large-RAM nodes calculateScale()
multiplied their size by TotalPhysicalMemGB/16 (x8 on 128GB), e.g. knownTxs
32768 -> 262144 entries per peer. golang-lru stores each entry in a
container/list element keyed by a map, and the generic common.Cache wrapper
boxes the key into an interface: ~3 pointer-rich live objects per entry. With
one set per peer x dozens of peers, that is tens of millions of pointers the
GC mark phase must traverse every cycle. On a long-running endpoint node,
live-heap profiling showed GC at ~57% of CPU (runtime.findObject ~27% flat),
and the per-peer sets filled slowly over ~a day -- a steady post-restart CPU
creep that only a restart reset.

These sets only need membership + oldest-first eviction (FIFO); access-recency
(LRU) was never used (the caches were configured FIFO, i.e. Get == Peek).
Replace them with knownHashSet: a preallocated ring ([]common.Hash) for O(1)
FIFO eviction plus a map[common.Hash]struct{} for O(1) membership. Both are
pointer-free ("noscan"), so the GC traverses no pointers for these structures
regardless of how many entries they hold.

Benchmarks (go test -bench BenchmarkKnown -benchmem ./node/cn):
  Add (steady state):  golang-lru   112.8 ns/op  113 B/op  3 allocs/op
                       knownHashSet  48.5 ns/op    0 B/op  0 allocs/op
  GC mark (live+full): n=1,000,000  golang-lru 11.08 ms  knownHashSet 0.80 ms (~14x)
                       knownHashSet GC time is flat in n; golang-lru scales linearly.
  Footprint:           golang-lru 3.00 live objects/entry; knownHashSet ~0.
@swapscanner-ryan
Copy link
Copy Markdown
Contributor Author

I have read the CLA Document and I hereby sign the CLA

@ian0371
Copy link
Copy Markdown
Collaborator

ian0371 commented May 27, 2026

Thanks for the contribution!

@ian0371 ian0371 merged commit 73d1f8f into kaiachain:dev May 27, 2026
9 of 10 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators May 27, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants