Oandreeva/kvcm lru mutex fix by oandreeva-nv · Pull Request #3 · oandreeva-nv/TensorRT-LLM

oandreeva-nv · 2026-06-04T19:13:18Z

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…cess The remote-G2 ZMQ REP thread (remote_g2_zmq_rep) calls claimBlock and releaseBlock on the LRUEvictionPolicy concurrently with the main scheduler loop's getFreeBlock, claimBlock, and releaseBlock. LRUEvictionPolicy's mFreeQueues, mFreeBlockIterators, and mExpiringBlockHeap had no synchronization, allowing concurrent std::list mutations to produce torn next/prev pointers, corrupted block table entries, and ultimately CUDA illegal memory accesses in the attention kernel. Layer 1 — LRU mutex: - Add std::mutex mMutex to LRUEvictionPolicy guarding all three shared data structures. - Add lock_guard to getFreeBlock, claimBlock (via claimBlockUnlocked), releaseBlock (via releaseBlockUnlocked), getNumFreeBlocks, and refresh. - Expose getMutex(), claimBlockUnlocked, releaseBlockUnlocked so callers can combine multiple LRU operations atomically. - Make KVCacheBlock::mRefCount std::atomic<SizeType32>; fix decRefCount to use fetch_sub and check the previous value. Layer 2 — onboardBlock respects pins: - Add hasRefs() check in WindowBlockManager::onboardBlock. Skip onboard if a remote-G2 caller has pinned the secondary block so its slot address stays valid for the in-flight transfer. - Use a double-check pattern: fast check before the DMA, then atomic check+swap under the LRU mutex after the DMA, closing the race window between pinBlocksById and swapMemoryPoolBlockOffset. Atomicity fixes in pin/unpin paths: - pinBlocksById: hold LRU mutex across hasRefs() + claimBlockUnlocked + incRefCount so no window exists where mRefCount == 0 but block is off the free queue. - unpinBlocksById: hold LRU mutex across decRefCount + hasRefs() + releaseBlockUnlocked for the same reason.

… setup Set enable_listen_thread=False at both nixl_agent_config construction sites in remote_g2_raw_nixl_adapter.py. When two workers run in the same network namespace (single-host docker), NIXL's metadata stream listener collides on a fixed port causing the second worker to wedge in an accept() loop on a bad fd. Safe in our flow because metadata exchange happens via dynamo NATS RPC, not NIXL's built-in metadata fetch.

…t race Introduce LRUEvictionPolicy::claimFreeBlock() which atomically selects and removes a block from the free queue under a single mutex hold, replacing the getFreeBlock (peek) + claimBlock (remove) two-step pattern at secondary-slot acquisition sites. The two-step pattern had a race window: getFreeBlock returned the block pointer without removing it from the queue, allowing the concurrent ZMQ REP thread (remote_g2_zmq_rep) to claim the same secondary block between the two calls via findAndPinSecondaryBlockByHash. The scheduler and ZMQ DMA while ZMQ was reading from it, or swapMemoryPoolBlockOffset ran on a pinned block producing a torn slot index in the GPU block table. This caused thop.attention to dereference an invalid KV cache pointer, manifesting as a CUDA illegal memory access (FmhaAutoTuner.cu). - Add claimFreeBlock(level, wantPlaceholder) to BaseEvictionPolicy and LRUEvictionPolicy: atomically pops front of queue, clears mFreeBlockIterators and mExpiringBlockHeap under mMutex. - Replace getFreeBlock(kSecondaryLevel) + claimBlock(offloadBlock) with claimFreeBlock(kSecondaryLevel) in WindowBlockManager::getFreeBlock (offload path) and WindowBlockManager::offloadBlock. - getFreeBlock is retained for primary-level allocation where no concurrent secondary-pool accessor exists.

oandreeva-nv added 2 commits June 4, 2026 00:52

github-actions Bot assigned oandreeva-nv Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Oandreeva/kvcm lru mutex fix#3

Oandreeva/kvcm lru mutex fix#3
oandreeva-nv wants to merge 3 commits into
oandreeva/remote-g2-beta-registryfrom
oandreeva/kvcm-lru-mutex-fix

oandreeva-nv commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

oandreeva-nv commented Jun 4, 2026

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant