Skip to content

Oandreeva/kvcm lru mutex fix#3

Open
oandreeva-nv wants to merge 3 commits into
oandreeva/remote-g2-beta-registryfrom
oandreeva/kvcm-lru-mutex-fix
Open

Oandreeva/kvcm lru mutex fix#3
oandreeva-nv wants to merge 3 commits into
oandreeva/remote-g2-beta-registryfrom
oandreeva/kvcm-lru-mutex-fix

Conversation

@oandreeva-nv

Copy link
Copy Markdown
Owner

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…cess

  The remote-G2 ZMQ REP thread (remote_g2_zmq_rep) calls claimBlock and
  releaseBlock on the LRUEvictionPolicy concurrently with the main scheduler
  loop's getFreeBlock, claimBlock, and releaseBlock. LRUEvictionPolicy's
  mFreeQueues, mFreeBlockIterators, and mExpiringBlockHeap had no
  synchronization, allowing concurrent std::list mutations to produce torn
  next/prev pointers, corrupted block table entries, and ultimately CUDA
  illegal memory accesses in the attention kernel.

  Layer 1 — LRU mutex:
  - Add std::mutex mMutex to LRUEvictionPolicy guarding all three shared
    data structures.
  - Add lock_guard to getFreeBlock, claimBlock (via claimBlockUnlocked),
    releaseBlock (via releaseBlockUnlocked), getNumFreeBlocks, and refresh.
  - Expose getMutex(), claimBlockUnlocked, releaseBlockUnlocked so callers
    can combine multiple LRU operations atomically.
  - Make KVCacheBlock::mRefCount std::atomic<SizeType32>; fix decRefCount
    to use fetch_sub and check the previous value.

Layer 2 — onboardBlock respects pins:
  - Add hasRefs() check in WindowBlockManager::onboardBlock. Skip onboard
    if a remote-G2 caller has pinned the secondary block so its slot address
    stays valid for the in-flight transfer.
  - Use a double-check pattern: fast check before the DMA, then atomic
    check+swap under the LRU mutex after the DMA, closing the race window
    between pinBlocksById and swapMemoryPoolBlockOffset.

  Atomicity fixes in pin/unpin paths:
  - pinBlocksById: hold LRU mutex across hasRefs() + claimBlockUnlocked +
    incRefCount so no window exists where mRefCount == 0 but block is off
    the free queue.
  - unpinBlocksById: hold LRU mutex across decRefCount + hasRefs() +
    releaseBlockUnlocked for the same reason.
… setup

  Set enable_listen_thread=False at both nixl_agent_config construction
  sites in remote_g2_raw_nixl_adapter.py. When two workers run in the
  same network namespace (single-host docker), NIXL's metadata stream
  listener collides on a fixed port causing the second worker to wedge in
  an accept() loop on a bad fd. Safe in our flow because metadata exchange
  happens via dynamo NATS RPC, not NIXL's built-in metadata fetch.
…t race

  Introduce LRUEvictionPolicy::claimFreeBlock() which atomically selects
  and removes a block from the free queue under a single mutex hold,
  replacing the getFreeBlock (peek) + claimBlock (remove) two-step pattern
  at secondary-slot acquisition sites.

  The two-step pattern had a race window: getFreeBlock returned the block
  pointer without removing it from the queue, allowing the concurrent ZMQ
  REP thread (remote_g2_zmq_rep) to claim the same secondary block between
  the two calls via findAndPinSecondaryBlockByHash. The scheduler and ZMQ
  DMA while ZMQ was reading from it, or swapMemoryPoolBlockOffset ran on a
  pinned block producing a torn slot index in the GPU block table. This
  caused thop.attention to dereference an invalid KV cache pointer,
  manifesting as a CUDA illegal memory access (FmhaAutoTuner.cu).

  - Add claimFreeBlock(level, wantPlaceholder) to BaseEvictionPolicy
    and LRUEvictionPolicy: atomically pops front of queue, clears
    mFreeBlockIterators and mExpiringBlockHeap under mMutex.
  - Replace getFreeBlock(kSecondaryLevel) + claimBlock(offloadBlock)
    with claimFreeBlock(kSecondaryLevel) in WindowBlockManager::getFreeBlock
    (offload path) and WindowBlockManager::offloadBlock.
  - getFreeBlock is retained for primary-level allocation where no
    concurrent secondary-pool accessor exists.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant