[MP][Bugfix] Fix deadlock caused by cuda launch host func by ApostaC · Pull Request #2952 · LMCache/LMCache

ApostaC · 2026-04-03T23:38:49Z

What this PR does / why we need it:

EventBus.publish_on_stream() schedules Python callbacks on CUDA streams via
cupy_stream.launch_host_func(self.publish, event). This causes a CUDA
driver / GIL deadlock: the host-function callback holds the CUDA driver lock
and tries to acquire the GIL, while another thread holds the GIL and waits on
the CUDA driver lock.

This PR introduces a C++ EventRecorder that calls cudaLaunchHostFunc
directly. The callback runs entirely in native code (no GIL), stamps the
wall-clock timestamp, and buffers events in a thread-safe C++ vector. The
EventBus drain thread periodically pulls these buffered events into the Python
dispatch queue via drain_recorded_events().

Changes:

New csrc/event_recorder.{h,cpp}: EventRecorder singleton with
cudaLaunchHostFunc callback and drain() API
csrc/pybind.cpp: bind record_event_on_stream (GIL released) and
drain_recorded_events
event_bus.py: publish_on_stream routes through C++ when available;
_drain_all drains the C++ buffer before processing the Python queue
Fallback to the old Python path when lmcache.c_ops is unavailable
New end to end CI test to prevent deadlock happen again

Special notes for your reviewers:

record_event_on_stream releases the GIL via py::call_guard to avoid the
reverse deadlock (Python thread → CUDA driver lock)
ROCm/HIP supported via #ifdef USE_ROCM (maps to hipLaunchHostFunc)
The 4 affected callsites (MP_STORE_START/END, MP_RETRIEVE_START/END) in
server.py are unchanged — the fix is internal to EventBus

If applicable:

this PR contains user facing changes - docs added
this PR contains unit tests

Note

Medium Risk
Adds new CUDA/HIP host-callback C++ code and pybind APIs on the MP observability hot path; mistakes could cause crashes, leaks, or missing events, but behavior is gated with a Python fallback and covered by new tests.

Overview
Fixes a multiprocessing observability deadlock by switching EventBus.publish_on_stream from scheduling a Python callback on the CUDA stream to scheduling a native cudaLaunchHostFunc/hipLaunchHostFunc callback that records events without touching the GIL.

Introduces a C++ EventRecorder buffer (record_event_on_stream + drain_recorded_events) exposed via lmcache.c_ops, and updates the EventBus drain loop to pull native-recorded events into the normal Python dispatch queue (falling back to the old Python path when the native API isn’t available).

Adds CUDA unit/integration tests for the recorder + EventBus routing, updates build configuration to compile the new source, and adds a dedicated Buildkite multiprocess deadlock regression step (while skipping the new CUDA-only test on ROCm CI).

^{Reviewed by Cursor Bugbot for commit a7b6bfd. Bugbot is set up for automated code reviews on this repo. Configure here.}

Signed-off-by: ApostaC <yihua98@uchicago.edu>

gemini-code-assist

Code Review

This pull request introduces a native C++ EventRecorder to handle event logging from CUDA and HIP host callbacks, effectively avoiding GIL-related deadlocks during event publishing. The changes include the C++ singleton implementation, pybind11 bindings, and integration into the Python EventBus with corresponding tests. Feedback was provided regarding a potential memory leak in the event recording function where the return status of the host function launch is not verified, which could leave allocated memory unmanaged if the launch fails.

gemini-code-assist · 2026-04-03T23:40:58Z

+
+  auto stream = reinterpret_cast<lmcache_stream_t>(
+      static_cast<uintptr_t>(cuda_stream_ptr));
+  LMCACHE_LAUNCH_HOST_FUNC(stream, event_host_callback, event);


The return value of LMCACHE_LAUNCH_HOST_FUNC (which maps to cudaLaunchHostFunc or hipLaunchHostFunc) is not checked. If the CUDA/HIP driver fails to schedule the host function, the event pointer allocated on line 64 will be leaked because the callback that normally deletes it will never run. You should check the return status and delete the event if it's not success (0).

if (LMCACHE_LAUNCH_HOST_FUNC(stream, event_host_callback, event) != 0) { delete event; }

@ApostaC How about this suggestion?

cursor · 2026-04-03T23:59:19Z

+                        timestamp=ts,
+                        metadata=metadata,
+                    )
+                )


Global singleton drain causes cross-instance event stealing

Low Severity

The _drain_all method drains events from the process-global C++ EventRecorder singleton. This allows events published through one EventBus instance to be consumed by any other instance, breaking per-instance isolation and leading to non-deterministic event dispatch.

^{Triggered by project rule: LMCache Code Review Style Guide}

^{Reviewed by Cursor Bugbot for commit 5b90164. Configure here.}

cursor · 2026-04-03T23:59:19Z

+                        timestamp=ts,
+                        metadata=metadata,
+                    )
+                )


Global singleton EventRecorder causes cross-bus event leakage

Medium Severity

The _drain_all method in EventBus pulls events from a process-global C++ buffer. When multiple EventBus instances are active, any instance can drain all events from this shared buffer, which may lead to events being delivered to incorrect subscribers or lost entirely.

^{Triggered by project rule: LMCache Code Review Style Guide}

^{Reviewed by Cursor Bugbot for commit 5b90164. Configure here.}

Signed-off-by: ApostaC <yihua98@uchicago.edu>

ApostaC · 2026-04-04T00:18:55Z

@maobaolong I've tested it locally. This should fix the deadlock problem (at least for me)

royyhuang

LGTM!

maobaolong

@ApostaC Great to see this PR, and thanks for help to resolve the deadlock issue as quick as possible. This fix is more thorough. It's a fundamental fix, which is greater than #2823 .

And i like the work on New end to end CI test to prevent deadlock happen again.

maobaolong · 2026-04-04T02:03:17Z

@maobaolong I've tested it locally. This should fix the deadlock problem (at least for me)

And me, I've tested it in my env just now. @ApostaC

Signed-off-by: ApostaC <yihua98@uchicago.edu>

cursor · 2026-04-06T18:07:23Z

+                if isinstance(v, int):
+                    int_metadata[k] = v
+                else:
+                    str_metadata[k] = str(v)


Bool values misrouted due to int subclass check

Low Severity

In publish_on_stream, isinstance(v, int) returns True for bool values because Python's bool is a subclass of int. Any boolean metadata value silently enters int_metadata (as 1/0) and returns from _drain_all as a plain int, losing its original type. The old code path preserved all types unchanged. Adding isinstance(v, bool) before the int check (routing bools to str_metadata) would preserve round-trip fidelity for the dict[str, Any] metadata contract.

Additional Locations (1)

lmcache/v1/mp_observability/event_bus.py#L242-L244

^{Reviewed by Cursor Bugbot for commit cde5d78. Configure here.}

maobaolong

@ApostaC LGTM

…MCache#2952 Signed-off-by: baoloongmao <baoloongmao@tencent.com>

cursor · 2026-04-08T19:14:33Z

+
+    # Copy to repo root so cleanup.sh collects it as a Buildkite artifact
+    cp "$PYSPY_LOG" "${REPO_ROOT}/build_${BUILD_ID}_pyspy.log" 2>/dev/null || true
+}


Deadlock diagnostic function defined but never called

Medium Severity

The dump_stacks() helper function is defined to capture py-spy stack dumps for deadlock diagnosis, but it is never actually called anywhere in the script. When the benchmark times out (the exact scenario this test exists to catch), only the last 50 lines of logs are printed — the py-spy stack dump that would reveal the actual deadlock threads is never triggered. This defeats the purpose of installing py-spy and defining the function.

Additional Locations (1)

.buildkite/k3_tests/multiprocess/scripts/run-deadlock.sh#L124-L132

^{Reviewed by Cursor Bugbot for commit 5e4dccc. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 5 total unresolved issues (including 4 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit a7b6bfd. Configure here.}

cursor · 2026-04-08T22:01:12Z

+
+  auto stream = reinterpret_cast<lmcache_stream_t>(
+      static_cast<uintptr_t>(cuda_stream_ptr));
+  LMCACHE_LAUNCH_HOST_FUNC(stream, event_host_callback, event);


Unchecked CUDA API return leaks memory on failure

Medium Severity

The return value of LMCACHE_LAUNCH_HOST_FUNC (cudaLaunchHostFunc/hipLaunchHostFunc) is not checked. If the call fails, the heap-allocated PendingEvent is never freed (the callback that would delete it never runs), and the error is silently swallowed. Other CUDA API calls in this codebase (e.g., in mem_alloc.cpp) consistently check return values and throw on failure.

^{Reviewed by Cursor Bugbot for commit a7b6bfd. Configure here.}

sammshen

LGTM!

* update csrc to support native launch host func * add deadlock ci test Signed-off-by: ApostaC <yihua98@uchicago.edu>

ApostaC added 2 commits April 3, 2026 23:36

update csrc to support native launch host func

6091000

Signed-off-by: ApostaC <yihua98@uchicago.edu>

wiring the c++ module to python

5b90164

Signed-off-by: ApostaC <yihua98@uchicago.edu>

gemini-code-assist Bot reviewed Apr 3, 2026

View reviewed changes

cursor Bot reviewed Apr 3, 2026

View reviewed changes

ApostaC added 2 commits April 4, 2026 00:02

Add end to end deadlock test

653976d

Signed-off-by: ApostaC <yihua98@uchicago.edu>

skip amd tests

57a3b48

Signed-off-by: ApostaC <yihua98@uchicago.edu>

ApostaC added the mp Buildkite trigger for multi-processing mode test label Apr 4, 2026

ApostaC requested review from maobaolong and royyhuang April 4, 2026 00:16

royyhuang approved these changes Apr 4, 2026

View reviewed changes

maobaolong reviewed Apr 4, 2026

View reviewed changes

Comment thread .buildkite/k3_tests/multiprocess/scripts/run-deadlock.sh

maobaolong reviewed Apr 4, 2026

View reviewed changes

Comment thread lmcache/v1/mp_observability/event_bus.py

maobaolong reviewed Apr 4, 2026

View reviewed changes

Comment thread csrc/pybind.cpp

maobaolong reviewed Apr 4, 2026

View reviewed changes

add deadlock ci test

cde5d78

Signed-off-by: ApostaC <yihua98@uchicago.edu>

ApostaC requested review from YaoJiayi, deng451e, hickeyma and sammshen as code owners April 6, 2026 17:59

cursor Bot reviewed Apr 6, 2026

View reviewed changes

Merge branch 'dev' into local-dev/fix-dead-lock

25829fd

maobaolong approved these changes Apr 8, 2026

View reviewed changes

maobaolong added a commit to maobaolong/LMCache that referenced this pull request Apr 8, 2026

Backport: [MP][Bugfix] Fix deadlock caused by cuda launch host func L…

b7642ed

…MCache#2952 Signed-off-by: baoloongmao <baoloongmao@tencent.com>

maobaolong mentioned this pull request Apr 8, 2026

[Bug] Cache hit is 0 when deploying GLM-5 with MP LMCache #2977

Open

ApostaC added the full Run comprehensive tests on this PR label Apr 8, 2026

ApostaC enabled auto-merge (squash) April 8, 2026 17:05

Merge branch 'dev' into local-dev/fix-dead-lock

5e4dccc

cursor Bot reviewed Apr 8, 2026

View reviewed changes

Merge branch 'dev' into local-dev/fix-dead-lock

a7b6bfd

cursor Bot reviewed Apr 8, 2026

View reviewed changes

sammshen approved these changes Apr 9, 2026

View reviewed changes

ApostaC merged commit 801b016 into LMCache:dev Apr 9, 2026
34 of 35 checks passed

Oasis-Git pushed a commit to Oasis-Git/LMCache that referenced this pull request Apr 13, 2026

[MP][Bugfix] Fix deadlock caused by cuda launch host func (LMCache#2952)

528f6fc

* update csrc to support native launch host func * add deadlock ci test Signed-off-by: ApostaC <yihua98@uchicago.edu>

Conversation

ApostaC commented Apr 3, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

maobaolong Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 3, 2026

Choose a reason for hiding this comment

Global singleton drain causes cross-instance event stealing

Uh oh!

cursor Bot Apr 3, 2026

Choose a reason for hiding this comment

Global singleton EventRecorder causes cross-bus event leakage

Uh oh!

ApostaC commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

royyhuang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maobaolong left a comment

Choose a reason for hiding this comment

Uh oh!

maobaolong commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot Apr 6, 2026

Choose a reason for hiding this comment

Bool values misrouted due to int subclass check

Uh oh!

maobaolong left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 8, 2026

Choose a reason for hiding this comment

Deadlock diagnostic function defined but never called

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 8, 2026

Choose a reason for hiding this comment

Unchecked CUDA API return leaks memory on failure

Uh oh!

sammshen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ApostaC commented Apr 3, 2026 •

edited by cursor Bot

Loading

ApostaC commented Apr 4, 2026 •

edited

Loading

maobaolong commented Apr 4, 2026 •

edited

Loading