Skip to content

[MP][Bugfix] Fix deadlock caused by cuda launch host func#2952

Merged
ApostaC merged 8 commits intoLMCache:devfrom
ApostaC:local-dev/fix-dead-lock
Apr 9, 2026
Merged

[MP][Bugfix] Fix deadlock caused by cuda launch host func#2952
ApostaC merged 8 commits intoLMCache:devfrom
ApostaC:local-dev/fix-dead-lock

Conversation

@ApostaC
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC commented Apr 3, 2026

What this PR does / why we need it:

EventBus.publish_on_stream() schedules Python callbacks on CUDA streams via
cupy_stream.launch_host_func(self.publish, event). This causes a CUDA
driver / GIL deadlock: the host-function callback holds the CUDA driver lock
and tries to acquire the GIL, while another thread holds the GIL and waits on
the CUDA driver lock.

This PR introduces a C++ EventRecorder that calls cudaLaunchHostFunc
directly. The callback runs entirely in native code (no GIL), stamps the
wall-clock timestamp, and buffers events in a thread-safe C++ vector. The
EventBus drain thread periodically pulls these buffered events into the Python
dispatch queue via drain_recorded_events().

Changes:

  • New csrc/event_recorder.{h,cpp}: EventRecorder singleton with
    cudaLaunchHostFunc callback and drain() API
  • csrc/pybind.cpp: bind record_event_on_stream (GIL released) and
    drain_recorded_events
  • event_bus.py: publish_on_stream routes through C++ when available;
    _drain_all drains the C++ buffer before processing the Python queue
  • Fallback to the old Python path when lmcache.c_ops is unavailable
  • New end to end CI test to prevent deadlock happen again

Special notes for your reviewers:

  • record_event_on_stream releases the GIL via py::call_guard to avoid the
    reverse deadlock (Python thread → CUDA driver lock)
  • ROCm/HIP supported via #ifdef USE_ROCM (maps to hipLaunchHostFunc)
  • The 4 affected callsites (MP_STORE_START/END, MP_RETRIEVE_START/END) in
    server.py are unchanged — the fix is internal to EventBus

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Note

Medium Risk
Adds new CUDA/HIP host-callback C++ code and pybind APIs on the MP observability hot path; mistakes could cause crashes, leaks, or missing events, but behavior is gated with a Python fallback and covered by new tests.

Overview
Fixes a multiprocessing observability deadlock by switching EventBus.publish_on_stream from scheduling a Python callback on the CUDA stream to scheduling a native cudaLaunchHostFunc/hipLaunchHostFunc callback that records events without touching the GIL.

Introduces a C++ EventRecorder buffer (record_event_on_stream + drain_recorded_events) exposed via lmcache.c_ops, and updates the EventBus drain loop to pull native-recorded events into the normal Python dispatch queue (falling back to the old Python path when the native API isn’t available).

Adds CUDA unit/integration tests for the recorder + EventBus routing, updates build configuration to compile the new source, and adds a dedicated Buildkite multiprocess deadlock regression step (while skipping the new CUDA-only test on ROCm CI).

Reviewed by Cursor Bugbot for commit a7b6bfd. Bugbot is set up for automated code reviews on this repo. Configure here.

ApostaC added 2 commits April 3, 2026 23:36
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a native C++ EventRecorder to handle event logging from CUDA and HIP host callbacks, effectively avoiding GIL-related deadlocks during event publishing. The changes include the C++ singleton implementation, pybind11 bindings, and integration into the Python EventBus with corresponding tests. Feedback was provided regarding a potential memory leak in the event recording function where the return status of the host function launch is not verified, which could leave allocated memory unmanaged if the launch fails.

Comment thread csrc/event_recorder.cpp

auto stream = reinterpret_cast<lmcache_stream_t>(
static_cast<uintptr_t>(cuda_stream_ptr));
LMCACHE_LAUNCH_HOST_FUNC(stream, event_host_callback, event);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The return value of LMCACHE_LAUNCH_HOST_FUNC (which maps to cudaLaunchHostFunc or hipLaunchHostFunc) is not checked. If the CUDA/HIP driver fails to schedule the host function, the event pointer allocated on line 64 will be leaked because the callback that normally deletes it will never run. You should check the return status and delete the event if it's not success (0).

  if (LMCACHE_LAUNCH_HOST_FUNC(stream, event_host_callback, event) != 0) {
    delete event;
  }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ApostaC How about this suggestion?

timestamp=ts,
metadata=metadata,
)
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Global singleton drain causes cross-instance event stealing

Low Severity

The _drain_all method drains events from the process-global C++ EventRecorder singleton. This allows events published through one EventBus instance to be consumed by any other instance, breaking per-instance isolation and leading to non-deterministic event dispatch.

Fix in Cursor Fix in Web

Triggered by project rule: LMCache Code Review Style Guide

Reviewed by Cursor Bugbot for commit 5b90164. Configure here.

timestamp=ts,
metadata=metadata,
)
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Global singleton EventRecorder causes cross-bus event leakage

Medium Severity

The _drain_all method in EventBus pulls events from a process-global C++ buffer. When multiple EventBus instances are active, any instance can drain all events from this shared buffer, which may lead to events being delivered to incorrect subscribers or lost entirely.

Fix in Cursor Fix in Web

Triggered by project rule: LMCache Code Review Style Guide

Reviewed by Cursor Bugbot for commit 5b90164. Configure here.

ApostaC added 2 commits April 4, 2026 00:02
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
@ApostaC ApostaC added the mp Buildkite trigger for multi-processing mode test label Apr 4, 2026
@ApostaC ApostaC requested review from maobaolong and royyhuang April 4, 2026 00:16
@ApostaC
Copy link
Copy Markdown
Contributor Author

ApostaC commented Apr 4, 2026

@maobaolong I've tested it locally. This should fix the deadlock problem (at least for me)

Copy link
Copy Markdown
Contributor

@royyhuang royyhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment thread .buildkite/k3_tests/multiprocess/scripts/run-deadlock.sh
Comment thread lmcache/v1/mp_observability/event_bus.py
Comment thread csrc/pybind.cpp
Copy link
Copy Markdown
Collaborator

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ApostaC Great to see this PR, and thanks for help to resolve the deadlock issue as quick as possible. This fix is more thorough. It's a fundamental fix, which is greater than #2823 .

And i like the work on New end to end CI test to prevent deadlock happen again.

@maobaolong
Copy link
Copy Markdown
Collaborator

maobaolong commented Apr 4, 2026

@maobaolong I've tested it locally. This should fix the deadlock problem (at least for me)

And me, I've tested it in my env just now. @ApostaC

Signed-off-by: ApostaC <yihua98@uchicago.edu>
if isinstance(v, int):
int_metadata[k] = v
else:
str_metadata[k] = str(v)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bool values misrouted due to int subclass check

Low Severity

In publish_on_stream, isinstance(v, int) returns True for bool values because Python's bool is a subclass of int. Any boolean metadata value silently enters int_metadata (as 1/0) and returns from _drain_all as a plain int, losing its original type. The old code path preserved all types unchanged. Adding isinstance(v, bool) before the int check (routing bools to str_metadata) would preserve round-trip fidelity for the dict[str, Any] metadata contract.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit cde5d78. Configure here.

Copy link
Copy Markdown
Collaborator

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ApostaC LGTM

maobaolong added a commit to maobaolong/LMCache that referenced this pull request Apr 8, 2026
@ApostaC ApostaC added the full Run comprehensive tests on this PR label Apr 8, 2026
@ApostaC ApostaC enabled auto-merge (squash) April 8, 2026 17:05

# Copy to repo root so cleanup.sh collects it as a Buildkite artifact
cp "$PYSPY_LOG" "${REPO_ROOT}/build_${BUILD_ID}_pyspy.log" 2>/dev/null || true
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deadlock diagnostic function defined but never called

Medium Severity

The dump_stacks() helper function is defined to capture py-spy stack dumps for deadlock diagnosis, but it is never actually called anywhere in the script. When the benchmark times out (the exact scenario this test exists to catch), only the last 50 lines of logs are printed — the py-spy stack dump that would reveal the actual deadlock threads is never triggered. This defeats the purpose of installing py-spy and defining the function.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5e4dccc. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 5 total unresolved issues (including 4 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a7b6bfd. Configure here.

Comment thread csrc/event_recorder.cpp

auto stream = reinterpret_cast<lmcache_stream_t>(
static_cast<uintptr_t>(cuda_stream_ptr));
LMCACHE_LAUNCH_HOST_FUNC(stream, event_host_callback, event);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unchecked CUDA API return leaks memory on failure

Medium Severity

The return value of LMCACHE_LAUNCH_HOST_FUNC (cudaLaunchHostFunc/hipLaunchHostFunc) is not checked. If the call fails, the heap-allocated PendingEvent is never freed (the callback that would delete it never runs), and the error is silently swallowed. Other CUDA API calls in this codebase (e.g., in mem_alloc.cpp) consistently check return values and throw on failure.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a7b6bfd. Configure here.

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ApostaC ApostaC merged commit 801b016 into LMCache:dev Apr 9, 2026
34 of 35 checks passed
Oasis-Git pushed a commit to Oasis-Git/LMCache that referenced this pull request Apr 13, 2026
* update csrc to support native launch host func

* add deadlock ci test

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR mp Buildkite trigger for multi-processing mode test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants