Oandreeva/kvcm lru mutex fix#3
Open
oandreeva-nv wants to merge 3 commits into
Open
Conversation
…cess
The remote-G2 ZMQ REP thread (remote_g2_zmq_rep) calls claimBlock and
releaseBlock on the LRUEvictionPolicy concurrently with the main scheduler
loop's getFreeBlock, claimBlock, and releaseBlock. LRUEvictionPolicy's
mFreeQueues, mFreeBlockIterators, and mExpiringBlockHeap had no
synchronization, allowing concurrent std::list mutations to produce torn
next/prev pointers, corrupted block table entries, and ultimately CUDA
illegal memory accesses in the attention kernel.
Layer 1 — LRU mutex:
- Add std::mutex mMutex to LRUEvictionPolicy guarding all three shared
data structures.
- Add lock_guard to getFreeBlock, claimBlock (via claimBlockUnlocked),
releaseBlock (via releaseBlockUnlocked), getNumFreeBlocks, and refresh.
- Expose getMutex(), claimBlockUnlocked, releaseBlockUnlocked so callers
can combine multiple LRU operations atomically.
- Make KVCacheBlock::mRefCount std::atomic<SizeType32>; fix decRefCount
to use fetch_sub and check the previous value.
Layer 2 — onboardBlock respects pins:
- Add hasRefs() check in WindowBlockManager::onboardBlock. Skip onboard
if a remote-G2 caller has pinned the secondary block so its slot address
stays valid for the in-flight transfer.
- Use a double-check pattern: fast check before the DMA, then atomic
check+swap under the LRU mutex after the DMA, closing the race window
between pinBlocksById and swapMemoryPoolBlockOffset.
Atomicity fixes in pin/unpin paths:
- pinBlocksById: hold LRU mutex across hasRefs() + claimBlockUnlocked +
incRefCount so no window exists where mRefCount == 0 but block is off
the free queue.
- unpinBlocksById: hold LRU mutex across decRefCount + hasRefs() +
releaseBlockUnlocked for the same reason.
… setup Set enable_listen_thread=False at both nixl_agent_config construction sites in remote_g2_raw_nixl_adapter.py. When two workers run in the same network namespace (single-host docker), NIXL's metadata stream listener collides on a fixed port causing the second worker to wedge in an accept() loop on a bad fd. Safe in our flow because metadata exchange happens via dynamo NATS RPC, not NIXL's built-in metadata fetch.
…t race
Introduce LRUEvictionPolicy::claimFreeBlock() which atomically selects
and removes a block from the free queue under a single mutex hold,
replacing the getFreeBlock (peek) + claimBlock (remove) two-step pattern
at secondary-slot acquisition sites.
The two-step pattern had a race window: getFreeBlock returned the block
pointer without removing it from the queue, allowing the concurrent ZMQ
REP thread (remote_g2_zmq_rep) to claim the same secondary block between
the two calls via findAndPinSecondaryBlockByHash. The scheduler and ZMQ
DMA while ZMQ was reading from it, or swapMemoryPoolBlockOffset ran on a
pinned block producing a torn slot index in the GPU block table. This
caused thop.attention to dereference an invalid KV cache pointer,
manifesting as a CUDA illegal memory access (FmhaAutoTuner.cu).
- Add claimFreeBlock(level, wantPlaceholder) to BaseEvictionPolicy
and LRUEvictionPolicy: atomically pops front of queue, clears
mFreeBlockIterators and mExpiringBlockHeap under mMutex.
- Replace getFreeBlock(kSecondaryLevel) + claimBlock(offloadBlock)
with claimFreeBlock(kSecondaryLevel) in WindowBlockManager::getFreeBlock
(offload path) and WindowBlockManager::offloadBlock.
- getFreeBlock is retained for primary-level allocation where no
concurrent secondary-pool accessor exists.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
@coderabbitai summary
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.