Skip to content

[MP] fix: add thread safety to Session for concurrent TP worker access#2807

Merged
chunxiaozheng merged 1 commit intoLMCache:devfrom
maobaolong:mp_session_safe
Mar 18, 2026
Merged

[MP] fix: add thread safety to Session for concurrent TP worker access#2807
chunxiaozheng merged 1 commit intoLMCache:devfrom
maobaolong:mp_session_safe

Conversation

@maobaolong
Copy link
Copy Markdown
Collaborator

Multiple TP worker threads concurrently call set_tokens() and get_hashes() on the same Session object (same request_id), causing race conditions in _compute_hash() that corrupt chunk_hashes (duplicates, wrong rolling state). This leads to ObjectKey mismatch between sync_lookup/store and retrieve, resulting in KEY_NOT_EXIST errors during unsafe_read.

Add threading.Lock to Session protecting set_tokens and get_hashes.

What this PR does / why we need it:

Special notes for your reviewers:

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Multiple TP worker threads concurrently call set_tokens() and
get_hashes() on the same Session object (same request_id), causing
race conditions in _compute_hash() that corrupt chunk_hashes
(duplicates, wrong rolling state). This leads to ObjectKey mismatch
between sync_lookup/store and retrieve, resulting in KEY_NOT_EXIST
errors during unsafe_read.

Add threading.Lock to Session protecting set_tokens and get_hashes.
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of the Session object by resolving critical race conditions that previously led to data corruption and operational failures during concurrent access from multiple worker threads. The implementation of thread-safe mechanisms ensures the integrity of token and hash management, thereby improving the overall reliability and stability of the system when processing requests in parallel.

Highlights

  • Addressed Race Conditions: Fixed race conditions in the Session object that occurred when multiple TP worker threads concurrently accessed set_tokens() and get_hashes().
  • Prevented Data Corruption: Resolved issues where concurrent access led to corruption of chunk_hashes and subsequent ObjectKey mismatches and KEY_NOT_EXIST errors.
  • Implemented Thread Safety: Introduced a threading.Lock within the Session class to protect critical sections in set_tokens and get_hashes methods.
  • Added Concurrency Tests: Included new unit tests to validate the thread safety of the Session class under concurrent get_hashes and mixed set_tokens/get_hashes operations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly identifies and addresses a race condition in the Session class by introducing a threading.Lock to protect concurrent access to set_tokens and get_hashes. This is a crucial fix for ensuring thread safety when multiple worker threads access the same session. The added multithreading tests in test_session.py are a valuable addition for verifying this fix. However, my review identifies a related logical issue in set_tokens where the hashing state is not correctly reset when the token sequence is modified in a non-append-only manner. I have provided a suggestion to make the state management more robust, which will prevent potential data corruption.

Comment thread lmcache/v1/multiprocess/session.py
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ApostaC
Copy link
Copy Markdown
Contributor

ApostaC commented Mar 18, 2026

ping @Oasis-Git

@maobaolong maobaolong added the full Run comprehensive tests on this PR label Mar 18, 2026
Copy link
Copy Markdown
Member

@Oasis-Git Oasis-Git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work!

Copy link
Copy Markdown
Collaborator

@chunxiaozheng chunxiaozheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@chunxiaozheng chunxiaozheng enabled auto-merge (squash) March 18, 2026 02:02
@chunxiaozheng chunxiaozheng merged commit 1b16510 into LMCache:dev Mar 18, 2026
27 of 29 checks passed
hyunyul-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Mar 20, 2026
LMCache#2807)

fix: add thread safety to Session for concurrent TP worker access

Multiple TP worker threads concurrently call set_tokens() and
get_hashes() on the same Session object (same request_id), causing
race conditions in _compute_hash() that corrupt chunk_hashes
(duplicates, wrong rolling state). This leads to ObjectKey mismatch
between sync_lookup/store and retrieve, resulting in KEY_NOT_EXIST
errors during unsafe_read.

Add threading.Lock to Session protecting set_tokens and get_hashes.
realAaronWu pushed a commit to realAaronWu/LMCache that referenced this pull request Mar 20, 2026
LMCache#2807)

fix: add thread safety to Session for concurrent TP worker access

Multiple TP worker threads concurrently call set_tokens() and
get_hashes() on the same Session object (same request_id), causing
race conditions in _compute_hash() that corrupt chunk_hashes
(duplicates, wrong rolling state). This leads to ObjectKey mismatch
between sync_lookup/store and retrieve, resulting in KEY_NOT_EXIST
errors during unsafe_read.

Add threading.Lock to Session protecting set_tokens and get_hashes.

Signed-off-by: Aaron Wu <aaron.wu@dell.com>
deng451e pushed a commit to deng451e/LMCache that referenced this pull request Mar 21, 2026
LMCache#2807)

fix: add thread safety to Session for concurrent TP worker access

Multiple TP worker threads concurrently call set_tokens() and
get_hashes() on the same Session object (same request_id), causing
race conditions in _compute_hash() that corrupt chunk_hashes
(duplicates, wrong rolling state). This leads to ObjectKey mismatch
between sync_lookup/store and retrieve, resulting in KEY_NOT_EXIST
errors during unsafe_read.

Add threading.Lock to Session protecting set_tokens and get_hashes.
deng451e pushed a commit to deng451e/LMCache that referenced this pull request Mar 25, 2026
LMCache#2807)

fix: add thread safety to Session for concurrent TP worker access

Multiple TP worker threads concurrently call set_tokens() and
get_hashes() on the same Session object (same request_id), causing
race conditions in _compute_hash() that corrupt chunk_hashes
(duplicates, wrong rolling state). This leads to ObjectKey mismatch
between sync_lookup/store and retrieve, resulting in KEY_NOT_EXIST
errors during unsafe_read.

Add threading.Lock to Session protecting set_tokens and get_hashes.
deng451e pushed a commit to deng451e/LMCache that referenced this pull request Mar 27, 2026
LMCache#2807)

fix: add thread safety to Session for concurrent TP worker access

Multiple TP worker threads concurrently call set_tokens() and
get_hashes() on the same Session object (same request_id), causing
race conditions in _compute_hash() that corrupt chunk_hashes
(duplicates, wrong rolling state). This leads to ObjectKey mismatch
between sync_lookup/store and retrieve, resulting in KEY_NOT_EXIST
errors during unsafe_read.

Add threading.Lock to Session protecting set_tokens and get_hashes.
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
LMCache#2807)

fix: add thread safety to Session for concurrent TP worker access

Multiple TP worker threads concurrently call set_tokens() and
get_hashes() on the same Session object (same request_id), causing
race conditions in _compute_hash() that corrupt chunk_hashes
(duplicates, wrong rolling state). This leads to ObjectKey mismatch
between sync_lookup/store and retrieve, resulting in KEY_NOT_EXIST
errors during unsafe_read.

Add threading.Lock to Session protecting set_tokens and get_hashes.
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
LMCache#2807)

fix: add thread safety to Session for concurrent TP worker access

Multiple TP worker threads concurrently call set_tokens() and
get_hashes() on the same Session object (same request_id), causing
race conditions in _compute_hash() that corrupt chunk_hashes
(duplicates, wrong rolling state). This leads to ObjectKey mismatch
between sync_lookup/store and retrieve, resulting in KEY_NOT_EXIST
errors during unsafe_read.

Add threading.Lock to Session protecting set_tokens and get_hashes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants