Skip to content

fix: replace global lock with per-device transfer_lock to prevent deadlock#2816

Merged
maobaolong merged 1 commit intoLMCache:devfrom
maobaolong:store_deadlock
Mar 19, 2026
Merged

fix: replace global lock with per-device transfer_lock to prevent deadlock#2816
maobaolong merged 1 commit intoLMCache:devfrom
maobaolong:store_deadlock

Conversation

@maobaolong
Copy link
Copy Markdown
Collaborator

What this PR does / why we need it:

The global self.lock in MPCacheEngine was acquired inside
torch.cuda.device() context, creating a circular dependency
with the implicit CUDA driver lock (ABBA deadlock).

Replace it with a per-device transfer_lock on GPUCacheContext
so that GPU↔CPU transfers on the same
device are serialised without cross-lock contention.

Special notes for your reviewers:

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

…dlock

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
@maobaolong maobaolong requested a review from ApostaC March 18, 2026 14:16
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical deadlock issue in the MPCacheEngine by refactoring the locking mechanism used for GPU-CPU data transfers. By moving from a single global lock to individual locks associated with each GPU device context, the change eliminates a circular dependency with the CUDA driver's implicit lock, thereby improving concurrency and stability during data operations without introducing new contention points.

Highlights

  • Deadlock Prevention: Replaced a global lock (self.lock) with a per-device transfer_lock to prevent an ABBA deadlock scenario that occurred when the global lock was acquired within a torch.cuda.device() context, conflicting with the implicit CUDA driver lock.
  • Granular Locking: Introduced a threading.Lock named transfer_lock within each GPUCacheContext instance, ensuring that GPU↔CPU data transfers are serialized on a per-device basis without blocking transfers on other devices.
  • Lock Usage Update: Modified the store and _retrieve_loop methods in MPCacheEngine to utilize the new gpu_context.transfer_lock instead of the problematic global self.lock for all data transfer operations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a per-device transfer_lock in GPUContext to serialize GPU↔CPU data transfers, replacing a previous global lock. This change aims to prevent deadlocks with the implicit CUDA driver lock and allow concurrent transfers across different devices. The server.py file has been updated to utilize this new per-device lock for both storing data (GPU to CPU) and retrieving data (CPU to GPU).

Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ApostaC ApostaC added the full Run comprehensive tests on this PR label Mar 19, 2026
@maobaolong maobaolong enabled auto-merge (squash) March 19, 2026 09:16
Copy link
Copy Markdown
Collaborator

@chunxiaozheng chunxiaozheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@maobaolong maobaolong merged commit 9b4c713 into LMCache:dev Mar 19, 2026
27 of 29 checks passed
hyunyul-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Mar 20, 2026
…dlock (LMCache#2816)

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
realAaronWu pushed a commit to realAaronWu/LMCache that referenced this pull request Mar 20, 2026
…dlock (LMCache#2816)

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Signed-off-by: Aaron Wu <aaron.wu@dell.com>
deng451e pushed a commit to deng451e/LMCache that referenced this pull request Mar 21, 2026
…dlock (LMCache#2816)

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
deng451e pushed a commit to deng451e/LMCache that referenced this pull request Mar 25, 2026
…dlock (LMCache#2816)

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
deng451e pushed a commit to deng451e/LMCache that referenced this pull request Mar 27, 2026
…dlock (LMCache#2816)

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
…dlock (LMCache#2816)

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
…dlock (LMCache#2816)

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants