[MP][Bugfix] Fix deadlock caused by cuda launch host func#2952
[MP][Bugfix] Fix deadlock caused by cuda launch host func#2952ApostaC merged 8 commits intoLMCache:devfrom
Conversation
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
There was a problem hiding this comment.
Code Review
This pull request introduces a native C++ EventRecorder to handle event logging from CUDA and HIP host callbacks, effectively avoiding GIL-related deadlocks during event publishing. The changes include the C++ singleton implementation, pybind11 bindings, and integration into the Python EventBus with corresponding tests. Feedback was provided regarding a potential memory leak in the event recording function where the return status of the host function launch is not verified, which could leave allocated memory unmanaged if the launch fails.
|
|
||
| auto stream = reinterpret_cast<lmcache_stream_t>( | ||
| static_cast<uintptr_t>(cuda_stream_ptr)); | ||
| LMCACHE_LAUNCH_HOST_FUNC(stream, event_host_callback, event); |
There was a problem hiding this comment.
The return value of LMCACHE_LAUNCH_HOST_FUNC (which maps to cudaLaunchHostFunc or hipLaunchHostFunc) is not checked. If the CUDA/HIP driver fails to schedule the host function, the event pointer allocated on line 64 will be leaked because the callback that normally deletes it will never run. You should check the return status and delete the event if it's not success (0).
if (LMCACHE_LAUNCH_HOST_FUNC(stream, event_host_callback, event) != 0) {
delete event;
}| timestamp=ts, | ||
| metadata=metadata, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Global singleton drain causes cross-instance event stealing
Low Severity
The _drain_all method drains events from the process-global C++ EventRecorder singleton. This allows events published through one EventBus instance to be consumed by any other instance, breaking per-instance isolation and leading to non-deterministic event dispatch.
Triggered by project rule: LMCache Code Review Style Guide
Reviewed by Cursor Bugbot for commit 5b90164. Configure here.
| timestamp=ts, | ||
| metadata=metadata, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Global singleton EventRecorder causes cross-bus event leakage
Medium Severity
The _drain_all method in EventBus pulls events from a process-global C++ buffer. When multiple EventBus instances are active, any instance can drain all events from this shared buffer, which may lead to events being delivered to incorrect subscribers or lost entirely.
Triggered by project rule: LMCache Code Review Style Guide
Reviewed by Cursor Bugbot for commit 5b90164. Configure here.
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
|
@maobaolong I've tested it locally. This should fix the deadlock problem (at least for me) |
And me, I've tested it in my env just now. @ApostaC |
Signed-off-by: ApostaC <yihua98@uchicago.edu>
| if isinstance(v, int): | ||
| int_metadata[k] = v | ||
| else: | ||
| str_metadata[k] = str(v) |
There was a problem hiding this comment.
Bool values misrouted due to int subclass check
Low Severity
In publish_on_stream, isinstance(v, int) returns True for bool values because Python's bool is a subclass of int. Any boolean metadata value silently enters int_metadata (as 1/0) and returns from _drain_all as a plain int, losing its original type. The old code path preserved all types unchanged. Adding isinstance(v, bool) before the int check (routing bools to str_metadata) would preserve round-trip fidelity for the dict[str, Any] metadata contract.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit cde5d78. Configure here.
…MCache#2952 Signed-off-by: baoloongmao <baoloongmao@tencent.com>
|
|
||
| # Copy to repo root so cleanup.sh collects it as a Buildkite artifact | ||
| cp "$PYSPY_LOG" "${REPO_ROOT}/build_${BUILD_ID}_pyspy.log" 2>/dev/null || true | ||
| } |
There was a problem hiding this comment.
Deadlock diagnostic function defined but never called
Medium Severity
The dump_stacks() helper function is defined to capture py-spy stack dumps for deadlock diagnosis, but it is never actually called anywhere in the script. When the benchmark times out (the exact scenario this test exists to catch), only the last 50 lines of logs are printed — the py-spy stack dump that would reveal the actual deadlock threads is never triggered. This defeats the purpose of installing py-spy and defining the function.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 5e4dccc. Configure here.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 5 total unresolved issues (including 4 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit a7b6bfd. Configure here.
|
|
||
| auto stream = reinterpret_cast<lmcache_stream_t>( | ||
| static_cast<uintptr_t>(cuda_stream_ptr)); | ||
| LMCACHE_LAUNCH_HOST_FUNC(stream, event_host_callback, event); |
There was a problem hiding this comment.
Unchecked CUDA API return leaks memory on failure
Medium Severity
The return value of LMCACHE_LAUNCH_HOST_FUNC (cudaLaunchHostFunc/hipLaunchHostFunc) is not checked. If the call fails, the heap-allocated PendingEvent is never freed (the callback that would delete it never runs), and the error is silently swallowed. Other CUDA API calls in this codebase (e.g., in mem_alloc.cpp) consistently check return values and throw on failure.
Reviewed by Cursor Bugbot for commit a7b6bfd. Configure here.
* update csrc to support native launch host func * add deadlock ci test Signed-off-by: ApostaC <yihua98@uchicago.edu>


What this PR does / why we need it:
EventBus.publish_on_stream()schedules Python callbacks on CUDA streams viacupy_stream.launch_host_func(self.publish, event). This causes a CUDAdriver / GIL deadlock: the host-function callback holds the CUDA driver lock
and tries to acquire the GIL, while another thread holds the GIL and waits on
the CUDA driver lock.
This PR introduces a C++
EventRecorderthat callscudaLaunchHostFuncdirectly. The callback runs entirely in native code (no GIL), stamps the
wall-clock timestamp, and buffers events in a thread-safe C++ vector. The
EventBus drain thread periodically pulls these buffered events into the Python
dispatch queue via
drain_recorded_events().Changes:
csrc/event_recorder.{h,cpp}:EventRecordersingleton withcudaLaunchHostFunccallback anddrain()APIcsrc/pybind.cpp: bindrecord_event_on_stream(GIL released) anddrain_recorded_eventsevent_bus.py:publish_on_streamroutes through C++ when available;_drain_alldrains the C++ buffer before processing the Python queuelmcache.c_opsis unavailableSpecial notes for your reviewers:
record_event_on_streamreleases the GIL viapy::call_guardto avoid thereverse deadlock (Python thread → CUDA driver lock)
#ifdef USE_ROCM(maps tohipLaunchHostFunc)MP_STORE_START/END,MP_RETRIEVE_START/END) inserver.pyare unchanged — the fix is internal toEventBusIf applicable:
Note
Medium Risk
Adds new CUDA/HIP host-callback C++ code and pybind APIs on the MP observability hot path; mistakes could cause crashes, leaks, or missing events, but behavior is gated with a Python fallback and covered by new tests.
Overview
Fixes a multiprocessing observability deadlock by switching
EventBus.publish_on_streamfrom scheduling a Python callback on the CUDA stream to scheduling a nativecudaLaunchHostFunc/hipLaunchHostFunccallback that records events without touching the GIL.Introduces a C++
EventRecorderbuffer (record_event_on_stream+drain_recorded_events) exposed vialmcache.c_ops, and updates the EventBus drain loop to pull native-recorded events into the normal Python dispatch queue (falling back to the old Python path when the native API isn’t available).Adds CUDA unit/integration tests for the recorder + EventBus routing, updates build configuration to compile the new source, and adds a dedicated Buildkite multiprocess
deadlockregression step (while skipping the new CUDA-only test on ROCm CI).Reviewed by Cursor Bugbot for commit a7b6bfd. Bugbot is set up for automated code reviews on this repo. Configure here.