node/cn: replace per-peer known-tx/block/bid LRU caches with a pointer-free FIFO set by swapscanner-ryan · Pull Request #910 · kaiachain/kaia

swapscanner-ryan · 2026-05-25T07:24:56Z

Proposed changes

Background: what these caches are

Every peer connection keeps three "known hashes" sets — knownTxsCache, knownBlocksCache, knownBidsCache (node/cn/peer.go). They exist purely to de-duplicate gossip: don't send a peer an item it already has.

We add a hash to a peer's set when we send that item to the peer (SendTransactions → AddToKnownTxs) and when we receive it from the peer (handleTxMsg → AddToKnownTxs) — in both cases the peer is known to have it.
Before broadcasting/announcing, we filter: peer_set.go skips any peer for which KnowsTx(hash) / KnowsBlock(hash) is already true.

So each set is a per-peer, best-effort record of "hashes this peer already has," used only to avoid redundant sends. The size constant is even commented "prevent DOS" — bounding is a requirement, not a tuning knob.

Why a small, fixed-size bound is correct (and a large one is pure cost)

This is the crux, so spelling it out:

The set is an optimization, never a correctness mechanism. It only ever suppresses a send. Evicting an entry can therefore only cause a redundant send (we forget a peer has X, so we send X again); the peer already has X, dedups it locally, and does not re-announce it — so there is no loop, no missed delivery, no protocol error. There is no failure mode where forgetting an entry breaks anything; the worst case is one wasted message.
An entry is only useful during the propagation window. Gossip dedup matters while an item is actively spreading (seconds for a tx; until a block is a few deep). Once a tx is mined/dropped or a block is old, it is no longer broadcast, so remembering its hash can never match a future send again. Old entries are dead weight. FIFO eviction (drop oldest-inserted) is exactly right: it discards the entries that have already aged out of the window. This is also why LRU is unnecessary here — relevance is set at insert time (when we learned the peer has it), not at access time; the existing cache was already configured FIFO (fifoCache.Get == Peek).
The bound only needs to exceed the number of distinct items in flight during that window. Concretely: at, say, 100 tx/s with a ~10 s propagation window, only ~1,000 distinct hashes are ever "in flight" at once. The existing maxKnownTxs = 32768 covers ~300+ seconds of such traffic — already ~30× the window. Anything beyond that is remembering hashes that will provably never be re-broadcast. So 32768 is generous; there is no dedup benefit to making it larger.

In short: a fixed bound that comfortably covers the propagation window gives the full dedup benefit, and over-sizing it cannot improve dedup — it only adds cost.

The bug: the bound is scaled by host RAM

The three sets are created with common.NewCache(common.FIFOCacheConfig{… IsScaled: true}). IsScaled: true multiplies the size by calculateScale():

CacheScale * ScaleByCacheUsageLevel * TotalPhysicalMemGB / minimumMemorySize / 100 / 100

On a 128 GB host this is ×8, so knownTxs becomes 32768 → 262144 entries per peer (×16 at 256 GB; more with --cache.level). Per the reasoning above this extra capacity yields zero dedup benefit — it just remembers ~tens of minutes of stale hashes. But it is not free:

golang-lru stores every entry as a container/list element keyed by a map, and the generic common.Cache wrapper boxes the common.Hash key into an interface — ~3 pointer-rich live objects per entry (measured below). With one set per peer × dozens of peers, the RAM-scaled bound produces tens of millions of pointers the GC mark phase must traverse on every cycle.

Production impact (128 GB host, 54 peers)

With IsScaled:true each peer's knownTxs set holds 262,144 entries, so across 54 peers the known sets reach up to 14.2M entries ≈ 42M live, pointer-rich objects (~3 objects/entry, measured) that the GC must re-scan on every mark cycle. The symptom was a slow, monotonic CPU climb after each restart. A heap + CPU profile of the live node showed:

At ~31 h uptime (sets not yet full): ~5.8M knownTxs entries and climbing, ~17M live objects in the AddToKnownTxs path.
GC ≈ 57% of the node's busy CPU (gcBgMarkWorker/gcDrain) — about 3.1 of the ~5.5 active cores doing nothing but collecting — with runtime.findObject ≈ 27% (~1.5 cores) pointer-chasing alone, on a 32-vCPU node.
These per-peer sets were the single largest retained, pointer-dense structure and accounted for ~83% of live-object growth over the measured window.
The sets fill slowly (a peer learns only a few hashes/s), ramping toward the ×8 cap over ~a day — exactly the shape of the post-restart CPU creep; a restart reset it.

The fix

Replace the three golang-lru-backed sets with knownHashSet: a preallocated ring ([]common.Hash, O(1) FIFO eviction) plus a map[common.Hash]struct{} (O(1) membership), guarded by a mutex (same concurrency contract as before). common.Hash is [32]byte and the value is struct{}, so both the ring and the map are pointer-free ("noscan") — the GC traverses no pointers for these structures no matter how many entries they hold.

Semantics are unchanged: membership + oldest-first (FIFO) eviction, same fixed bounds (maxKnownTxs=32768, maxKnownBlocks=1024, maxKnownBids=2048 — also the upstream go-ethereum defaults). Re-adding a present hash is a no-op (insertion order, and thus eviction order, is preserved). No protocol/consensus behavior changes — only how a peer remembers which hashes it has already exchanged.

Benchmarks

go test -run '^$' -bench BenchmarkKnown -benchmem ./node/cn (Go 1.25):

Add (cache full → evict+insert, the hot path):
  golang-lru      112.8 ns/op    113 B/op    3 allocs/op
  knownHashSet     48.5 ns/op      0 B/op    0 allocs/op

GC mark — time per full runtime.GC() with the structure live and full:
  n=32768          golang-lru   0.89 ms   knownHashSet 0.61 ms
  n=262144         golang-lru   3.26 ms   knownHashSet 0.66 ms   (~5×)
  n=1,000,000      golang-lru  11.08 ms   knownHashSet 0.80 ms   (~14×)
  → knownHashSet GC time is flat in n; golang-lru scales linearly with entries.

Production scale (BenchmarkKnownGCMarkProd: 54 peers × 262144 = 14.2M entries,
as 54 separate instances, all live):
  golang-lru   ~126 ms per GC
  knownHashSet  ~4.5 ms per GC   (~28×)

Footprint (n=262144):
  golang-lru     ~177 B/entry,  3.00 live objects/entry
  knownHashSet   ~112 B/entry,  ~0 objects/entry (≈1k objects total, independent of n)

The "objects/entry" figure is the mechanism: golang-lru keeps 3 pointer-rich objects per entry that GC must scan; knownHashSet keeps a constant handful, so its mark cost does not grow with the set.

Production verification

Deployed to the same endpoint node (128 GB, ~54 peers). Comparing the first ~13.5 h of uptime against the previous build (same instance/config):

The post-restart CPU creep is gone. Baseline CPU is flat — mean 3.69 cores (3.9 → 3.86) — whereas the old build had already crept to mean 4.18 (3.9 → 4.3) by the same 13.5 h, and went on to ~5.5 cores by 31 h. RSS 9.9 GB vs 10.5 GB.
GC is no longer dominant, and does not creep. Lifetime GCCPUFraction is 1.48% and flat (1.49% at 3 h → 1.48% at 13.5 h) — the old build's GC climbed as the caches filled (its 31 h CPU profile showed gcBgMarkWorker ~57%). A fresh CPU profile shows runtime.findObject down from 26.8% to 9.9% (44.6 s → 10.0 s per 30 s), i.e. the linked-list pointer-scanning is largely gone.
The per-peer known caches no longer appear among retained heap objects. Live 32 B (boxed-hash) objects dropped 19M → 2.9M; the live-heap-object floor is ~31% lower and grows ~4× slower (residual growth is the unrelated trie clean-cache fill, not the known sets).

Types of changes

🐛 Bug fix
✨ Non-hardfork changes (node upgrade not required)
💥 Hardfork / consensus-breaking changes
🧪 Test improvements
🧰 CI / build tool
♻️ Chore / Refactor / Non-functional changes

Checklist

📖 I have read the CONTRIBUTING GUIDELINES doc
📝 I have signed the CLA
🟢 Lint and unit tests pass locally with my changes ($ make test)

Related issues

Further comments

Why not just IsScaled: false? That divides the problem by 8 but the structure is still container/list-based, so GC cost stays O(entries) and still scales with RAM-driven config elsewhere. The pointer-free set removes the cost categorically (O(1) in pointers scanned), independent of host RAM.

Why not fastcache? It also avoids GC scanning (off-heap data), but it targets a few large shared caches; one instance per peer carries its fixed 512-bucket overhead and coarse chunk eviction — a poor fit for many small per-peer sets. The noscan ring+map is simpler, on-heap, exact-FIFO, dependency-free, and benchmarks equally GC-cheap.

node/sc/bridgepeer.go has the identical knownTxsCache pattern (IsScaled: true). It is the service-chain bridge peer (lower traffic) and outside the profiling above; happy to convert it here (extracting knownHashSet to a shared location) or in a follow-up — reviewer's preference.

github-actions · 2026-05-25T07:25:08Z

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

…r-free FIFO set The per-peer knownTxsCache/knownBlocksCache/knownBidsCache were golang-lru caches created with IsScaled:true, so on large-RAM nodes calculateScale() multiplied their size by TotalPhysicalMemGB/16 (x8 on 128GB), e.g. knownTxs 32768 -> 262144 entries per peer. golang-lru stores each entry in a container/list element keyed by a map, and the generic common.Cache wrapper boxes the key into an interface: ~3 pointer-rich live objects per entry. With one set per peer x dozens of peers, that is tens of millions of pointers the GC mark phase must traverse every cycle. On a long-running endpoint node, live-heap profiling showed GC at ~57% of CPU (runtime.findObject ~27% flat), and the per-peer sets filled slowly over ~a day -- a steady post-restart CPU creep that only a restart reset. These sets only need membership + oldest-first eviction (FIFO); access-recency (LRU) was never used (the caches were configured FIFO, i.e. Get == Peek). Replace them with knownHashSet: a preallocated ring ([]common.Hash) for O(1) FIFO eviction plus a map[common.Hash]struct{} for O(1) membership. Both are pointer-free ("noscan"), so the GC traverses no pointers for these structures regardless of how many entries they hold. Benchmarks (go test -bench BenchmarkKnown -benchmem ./node/cn): Add (steady state): golang-lru 112.8 ns/op 113 B/op 3 allocs/op knownHashSet 48.5 ns/op 0 B/op 0 allocs/op GC mark (live+full): n=1,000,000 golang-lru 11.08 ms knownHashSet 0.80 ms (~14x) knownHashSet GC time is flat in n; golang-lru scales linearly. Footprint: golang-lru 3.00 live objects/entry; knownHashSet ~0.

swapscanner-ryan · 2026-05-25T07:29:03Z

I have read the CLA Document and I hereby sign the CLA

ian0371 · 2026-05-27T07:44:58Z

Thanks for the contribution!

swapscanner-ryan requested review from blukat29, ian0371 and yoomee1313 as code owners May 25, 2026 07:24

swapscanner-ryan force-pushed the known-cache-fifo-set branch from 03c142f to ef2c4e9 Compare May 25, 2026 07:27

hyunsooda approved these changes May 26, 2026

View reviewed changes

ian0371 approved these changes May 27, 2026

View reviewed changes

ian0371 merged commit 73d1f8f into kaiachain:dev May 27, 2026
9 of 10 checks passed

github-actions Bot locked and limited conversation to collaborators May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node/cn: replace per-peer known-tx/block/bid LRU caches with a pointer-free FIFO set#910

node/cn: replace per-peer known-tx/block/bid LRU caches with a pointer-free FIFO set#910
ian0371 merged 1 commit into
kaiachain:devfrom
blockswords-io:known-cache-fifo-set

swapscanner-ryan commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026 •

edited

Loading

Uh oh!

swapscanner-ryan commented May 25, 2026

Uh oh!

ian0371 commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

swapscanner-ryan commented May 25, 2026

Proposed changes

Background: what these caches are

Why a small, fixed-size bound is correct (and a large one is pure cost)

The bug: the bound is scaled by host RAM

Production impact (128 GB host, 54 peers)

The fix

Benchmarks

Production verification

Types of changes

Checklist

Related issues

Further comments

Uh oh!

github-actions Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

swapscanner-ryan commented May 25, 2026

Uh oh!

ian0371 commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 25, 2026 •

edited

Loading