[MP][Bugfix] fixing race condition for zmq output notifier by ApostaC · Pull Request #2808 · LMCache/LMCache

ApostaC · 2026-03-18T01:15:17Z

What this PR does / why we need it:

Fixes a race condition in MessageQueueServer._call_blocking_handler where the output_notifier zmq PUSH socket was called from ThreadPoolExecutor callback threads. ZMQ sockets are not thread-safe, so concurrent send() calls from multiple worker threads could corrupt state.

Replaces the zmq PUSH/PULL inproc socket pair with a single os.eventfd. eventfd_write() is an atomic kernel syscall, making it inherently thread-safe for cross-thread notification. This also simplifies the drain logic — eventfd_read() atomically resets the counter in one call vs. a recv loop.

If applicable:

this PR contains user facing changes - docs added
this PR contains unit tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>

chatgpt-codex-connector · 2026-03-18T01:15:24Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

gemini-code-assist · 2026-03-18T01:15:34Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical race condition within the MessageQueueServer's output notification system. By migrating from a non-thread-safe ZMQ socket mechanism to the atomic and thread-safe os.eventfd, the change ensures reliable cross-thread communication for task completion. This not only enhances stability but also streamlines the event handling process.

Highlights

Race Condition Fix: Addressed a race condition in MessageQueueServer._call_blocking_handler where non-thread-safe ZMQ PUSH sockets were accessed concurrently by ThreadPoolExecutor callback threads.
Notification Mechanism Change: Replaced the ZMQ PUSH/PULL inproc socket pair used for output notification with a single os.eventfd, leveraging its atomic and thread-safe eventfd_write() kernel syscall.
Simplified Drain Logic: Improved the notification drain logic by using eventfd_read() which atomically resets the counter in a single call, simplifying the previous recv loop.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively resolves a race condition in the MessageQueueServer._call_blocking_handler by replacing the non-thread-safe ZMQ PUSH/PULL inproc socket pair with os.eventfd. This change ensures thread-safe notification for output tasks, improving the stability and correctness of the message queue. The implementation correctly integrates os.eventfd for creation, signaling, polling, and resource cleanup, aligning with the described solution and leveraging an established pattern within the codebase.

maobaolong

Thanks for this fix, LGTM. left nit only.

maobaolong · 2026-03-18T01:24:07Z

-        self.output_notifier, self.output_waiter = prepare_internal_push_pull_sockets(
-            self.ctx
-        )
+        self._output_efd = os.eventfd(0, os.EFD_NONBLOCK | os.EFD_CLOEXEC)


nit: Worth adding a brief comment explaining why eventfd is used instead of the previous ZMQ socket, for future readers who won't have the PR context.

sammshen · 2026-03-18T01:40:27Z

@cursor review

sammshen · 2026-03-18T01:50:54Z

@cursor review

sammshen

LGTM

ApostaC · 2026-03-18T02:14:53Z

@codex review

Signed-off-by: ApostaC <yihua98@uchicago.edu>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4054524b5f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-18T02:22:47Z

        if self.worker_thread.is_alive():
            self.worker_thread.join()
        self.socket.close()
+        os.close(self._output_efd)


Do not close the eventfd before blocking callbacks finish

If close() runs while a blocking request is still executing, the callback registered in _call_blocking_handler() can fire after os.close(self._output_efd). Because thread_pool.shutdown(wait=False) does not wait for those callbacks, the late os.eventfd_write() is aimed at a closed integer fd; once that fd number is reused, Python will write eight bytes into the new resource instead of failing. I reproduced this by closing the server, reopening the same fd with os.open(), and observing b"\x01\0..." written into the new file while the client future timed out. This regression only needs an in-flight blocking handler during shutdown, which is possible from the HTTP server lifespan shutdown path.

Useful? React with 👍 / 👎.

Signed-off-by: baoloongmao <baoloongmao@tencent.com> LMCache#2808

Signed-off-by: ApostaC <yihua98@uchicago.edu>

) * fixing race condition by using event fd Signed-off-by: ApostaC <yihua98@uchicago.edu>

) * fixing race condition by using event fd Signed-off-by: ApostaC <yihua98@uchicago.edu> Signed-off-by: Aaron Wu <aaron.wu@dell.com>

) * fixing race condition by using event fd Signed-off-by: ApostaC <yihua98@uchicago.edu>

fixing race condition by using event fd

4054524

Signed-off-by: ApostaC <yihua98@uchicago.edu>

ApostaC requested a review from maobaolong March 18, 2026 01:15

gemini-code-assist Bot reviewed Mar 18, 2026

View reviewed changes

maobaolong reviewed Mar 18, 2026

View reviewed changes

sammshen approved these changes Mar 18, 2026

View reviewed changes

fix comment

cc4a919

Signed-off-by: ApostaC <yihua98@uchicago.edu>

chatgpt-codex-connector Bot reviewed Mar 18, 2026

View reviewed changes

yoo-kumaneko pushed a commit to yoo-kumaneko/LMCache that referenced this pull request Mar 18, 2026

Backport: [MP][Bugfix] fixing race condition for zmq output notifier

62c0c01

Signed-off-by: baoloongmao <baoloongmao@tencent.com> LMCache#2808

fix codex comments

65d2323

Signed-off-by: ApostaC <yihua98@uchicago.edu>

ApostaC added the full Run comprehensive tests on this PR label Mar 18, 2026

YaoJiayi self-requested a review March 18, 2026 22:35

YaoJiayi approved these changes Mar 18, 2026

View reviewed changes

ApostaC enabled auto-merge (squash) March 19, 2026 18:10

Merge branch 'dev' into local-dev/mp-zmq-fix-face

2b96a96

ApostaC merged commit 126f1a1 into LMCache:dev Mar 19, 2026
25 of 28 checks passed

hyunyul-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Mar 20, 2026

[MP][Bugfix] fixing race condition for zmq output notifier (LMCache#2808

c79b0de

) * fixing race condition by using event fd Signed-off-by: ApostaC <yihua98@uchicago.edu>

deng451e pushed a commit to deng451e/LMCache that referenced this pull request Mar 21, 2026

[MP][Bugfix] fixing race condition for zmq output notifier (LMCache#2808

7f30faf

) * fixing race condition by using event fd Signed-off-by: ApostaC <yihua98@uchicago.edu>

deng451e pushed a commit to deng451e/LMCache that referenced this pull request Mar 25, 2026

[MP][Bugfix] fixing race condition for zmq output notifier (LMCache#2808

6e2cabf

) * fixing race condition by using event fd Signed-off-by: ApostaC <yihua98@uchicago.edu>

deng451e pushed a commit to deng451e/LMCache that referenced this pull request Mar 27, 2026

[MP][Bugfix] fixing race condition for zmq output notifier (LMCache#2808

457225f

) * fixing race condition by using event fd Signed-off-by: ApostaC <yihua98@uchicago.edu>

jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026

[MP][Bugfix] fixing race condition for zmq output notifier (LMCache#2808

5cfac63

) * fixing race condition by using event fd Signed-off-by: ApostaC <yihua98@uchicago.edu>

jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026

[MP][Bugfix] fixing race condition for zmq output notifier (LMCache#2808

785f390

) * fixing race condition by using event fd Signed-off-by: ApostaC <yihua98@uchicago.edu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MP][Bugfix] fixing race condition for zmq output notifier#2808

[MP][Bugfix] fixing race condition for zmq output notifier#2808
ApostaC merged 4 commits intoLMCache:devfrom
ApostaC:local-dev/mp-zmq-fix-face

ApostaC commented Mar 18, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented Mar 18, 2026

Uh oh!

gemini-code-assist Bot commented Mar 18, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

maobaolong left a comment

Uh oh!

maobaolong Mar 18, 2026

Uh oh!

sammshen commented Mar 18, 2026

Uh oh!

sammshen commented Mar 18, 2026

Uh oh!

sammshen left a comment

Uh oh!

ApostaC commented Mar 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ApostaC commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot commented Mar 18, 2026

Uh oh!

gemini-code-assist Bot commented Mar 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

maobaolong left a comment

Choose a reason for hiding this comment

Uh oh!

maobaolong Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

sammshen commented Mar 18, 2026

Uh oh!

sammshen commented Mar 18, 2026

Uh oh!

sammshen left a comment

Choose a reason for hiding this comment

Uh oh!

ApostaC commented Mar 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ApostaC commented Mar 18, 2026 •

edited

Loading