[MP] Fix GIL and CUDA driver lock order inversion deadlock in store/retrieve#2823
[MP] Fix GIL and CUDA driver lock order inversion deadlock in store/retrieve#2823maobaolong wants to merge 0 commit intoLMCache:devfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical deadlock condition that arose from an inverted lock order between Python's Global Interpreter Lock (GIL) and CUDA driver locks during data transfer operations. By carefully adjusting the scope of the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly addresses a potential deadlock issue between the Python GIL and CUDA driver locks in the store and retrieve functions. By introducing a transfer_lock, it ensures that get_slot_mapping_tensor and multi_layer_kv_transfer are serialized, preventing lock order inversion. The fix is sound. I have one suggestion for the retrieve function to further improve performance by reducing the scope of the lock.
e72bdcb to
ba3ba51
Compare
Problem
Under concurrent store workloads, the LMCache MP server can deadlock due to a GIL ↔ CUDA driver lock order inversion between
get_slot_mapping_tensor()andmulti_layer_kv_transfer().transfer_lock, callsmulti_layer_kv_transfer(pybind11gil_scoped_release), finishes the CUDA kernel, then tries to re-acquire GIL in~gil_scoped_release.get_slot_mapping_tensor→flatten()→ PyTorch C++ dispatch, which tries to acquire the CUDA driver lock.This AB-BA deadlock freezes the entire ThreadPoolExecutor, causing PING heartbeat timeouts and vLLM entering degraded/unhealthy mode.
Root Cause
get_slot_mapping_tensor()was called outsidetransfer_lock, whilemulti_layer_kv_transfer()was called inside it. Two different lock acquisition orders (GIL→CUDA vs CUDA→GIL) across threads created a classic lock order inversion.Fix
Widen
transfer_lockscope in bothstore()andretrieve()to coverget_slot_mapping_tensor()alongsidemulti_layer_kv_transfer(), ensuring both operations are serialized under the same lock and eliminating the AB-BA deadlock.transfer_lockfrom per-chunk to per-request, coveringget_slot_mapping_tensor+ the entire transfer loop.get_slot_mapping_tensorinsidetransfer_lock, wrapping it together with_retrieve_loop.Performance Impact
Negligible —
transfer_lockis per-device and CUDA stream execution is inherently serial. The lock scope change only affects the Python-level scheduling, not actual GPU parallelism.Stack while deadlock
Note
Low Risk
The provided diff contains no code changes, so there is no functional or behavioral risk to review.
Overview
No changes are included in the provided git diff (
+++ /dev/nullonly), so there is nothing to review in this PR as-is.Written by Cursor Bugbot for commit ba3ba51. This will update automatically on new commits. Configure here.