Skip to content

[MP][Bugfix] fixing race condition for zmq output notifier#2808

Merged
ApostaC merged 4 commits intoLMCache:devfrom
ApostaC:local-dev/mp-zmq-fix-face
Mar 19, 2026
Merged

[MP][Bugfix] fixing race condition for zmq output notifier#2808
ApostaC merged 4 commits intoLMCache:devfrom
ApostaC:local-dev/mp-zmq-fix-face

Conversation

@ApostaC
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC commented Mar 18, 2026

What this PR does / why we need it:

Fixes a race condition in MessageQueueServer._call_blocking_handler where the output_notifier zmq PUSH socket was called from ThreadPoolExecutor callback threads. ZMQ sockets are not thread-safe, so concurrent send() calls from multiple worker threads could corrupt state.

Replaces the zmq PUSH/PULL inproc socket pair with a single os.eventfd. eventfd_write() is an atomic kernel syscall, making it inherently thread-safe for cross-thread notification. This also simplifies the drain logic — eventfd_read() atomically resets the counter in one call vs. a recv loop.

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical race condition within the MessageQueueServer's output notification system. By migrating from a non-thread-safe ZMQ socket mechanism to the atomic and thread-safe os.eventfd, the change ensures reliable cross-thread communication for task completion. This not only enhances stability but also streamlines the event handling process.

Highlights

  • Race Condition Fix: Addressed a race condition in MessageQueueServer._call_blocking_handler where non-thread-safe ZMQ PUSH sockets were accessed concurrently by ThreadPoolExecutor callback threads.
  • Notification Mechanism Change: Replaced the ZMQ PUSH/PULL inproc socket pair used for output notification with a single os.eventfd, leveraging its atomic and thread-safe eventfd_write() kernel syscall.
  • Simplified Drain Logic: Improved the notification drain logic by using eventfd_read() which atomically resets the counter in a single call, simplifying the previous recv loop.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ApostaC ApostaC requested a review from maobaolong March 18, 2026 01:15
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a race condition in the MessageQueueServer._call_blocking_handler by replacing the non-thread-safe ZMQ PUSH/PULL inproc socket pair with os.eventfd. This change ensures thread-safe notification for output tasks, improving the stability and correctness of the message queue. The implementation correctly integrates os.eventfd for creation, signaling, polling, and resource cleanup, aligning with the described solution and leveraging an established pattern within the codebase.

Copy link
Copy Markdown
Collaborator

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix, LGTM. left nit only.

self.output_notifier, self.output_waiter = prepare_internal_push_pull_sockets(
self.ctx
)
self._output_efd = os.eventfd(0, os.EFD_NONBLOCK | os.EFD_CLOEXEC)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Worth adding a brief comment explaining why eventfd is used instead of the previous ZMQ socket, for future readers who won't have the PR context.

@sammshen
Copy link
Copy Markdown
Contributor

@cursor review

1 similar comment
@sammshen
Copy link
Copy Markdown
Contributor

@cursor review

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ApostaC
Copy link
Copy Markdown
Contributor Author

ApostaC commented Mar 18, 2026

@codex review

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4054524b5f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lmcache/v1/multiprocess/mq.py Outdated
if self.worker_thread.is_alive():
self.worker_thread.join()
self.socket.close()
os.close(self._output_efd)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not close the eventfd before blocking callbacks finish

If close() runs while a blocking request is still executing, the callback registered in _call_blocking_handler() can fire after os.close(self._output_efd). Because thread_pool.shutdown(wait=False) does not wait for those callbacks, the late os.eventfd_write() is aimed at a closed integer fd; once that fd number is reused, Python will write eight bytes into the new resource instead of failing. I reproduced this by closing the server, reopening the same fd with os.open(), and observing b"\x01\0..." written into the new file while the client future timed out. This regression only needs an in-flight blocking handler during shutdown, which is possible from the HTTP server lifespan shutdown path.

Useful? React with 👍 / 👎.

yoo-kumaneko pushed a commit to yoo-kumaneko/LMCache that referenced this pull request Mar 18, 2026
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
LMCache#2808
Signed-off-by: ApostaC <yihua98@uchicago.edu>
@ApostaC ApostaC added the full Run comprehensive tests on this PR label Mar 18, 2026
@YaoJiayi YaoJiayi self-requested a review March 18, 2026 22:35
@ApostaC ApostaC enabled auto-merge (squash) March 19, 2026 18:10
@ApostaC ApostaC merged commit 126f1a1 into LMCache:dev Mar 19, 2026
25 of 28 checks passed
hyunyul-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Mar 20, 2026
)

* fixing race condition by using event fd

Signed-off-by: ApostaC <yihua98@uchicago.edu>
realAaronWu pushed a commit to realAaronWu/LMCache that referenced this pull request Mar 20, 2026
)

* fixing race condition by using event fd

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: Aaron Wu <aaron.wu@dell.com>
deng451e pushed a commit to deng451e/LMCache that referenced this pull request Mar 21, 2026
)

* fixing race condition by using event fd

Signed-off-by: ApostaC <yihua98@uchicago.edu>
deng451e pushed a commit to deng451e/LMCache that referenced this pull request Mar 25, 2026
)

* fixing race condition by using event fd

Signed-off-by: ApostaC <yihua98@uchicago.edu>
deng451e pushed a commit to deng451e/LMCache that referenced this pull request Mar 27, 2026
)

* fixing race condition by using event fd

Signed-off-by: ApostaC <yihua98@uchicago.edu>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
)

* fixing race condition by using event fd

Signed-off-by: ApostaC <yihua98@uchicago.edu>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
)

* fixing race condition by using event fd

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants