[MP][Bugfix] fix vllm-side lookup logical issue and cuda stream deadlock problem#2733
[MP][Bugfix] fix vllm-side lookup logical issue and cuda stream deadlock problem#2733ApostaC merged 3 commits intoLMCache:devfrom
Conversation
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses two critical bugs in the multi-process mode that were causing system hangs. It introduces a caching mechanism for vLLM prefetch lookup results to ensure that completed jobs are correctly recognized, preventing infinite polling. Additionally, it resolves a CUDA stream deadlock by properly managing Python's Global Interpreter Lock during GPU memory transfer operations, allowing host callbacks to execute without contention. These changes significantly improve the stability and reliability of the multi-process system. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses two critical bugs causing hangs in the multi-process mode, including a CUDA stream deadlock and an infinite polling issue. While these fixes enhance stability, the change for the server to return None for unknown job IDs, intended to improve API consistency, inadvertently introduces a Denial of Service vulnerability. Conflating the 'not found' state with the 'in progress' state (both returning None), especially when combined with predictable job IDs, could allow an attacker to trigger infinite polling loops in the scheduler. It is critical to address this by using unpredictable IDs and distinct return values for different job states. Additionally, there are a couple of minor suggestions to improve overall code quality and documentation clarity.
| if job_id in self._finished_lookup_jobs: | ||
| # Return cached result if the job is already finished | ||
| return self._finished_lookup_jobs[job_id] * self.chunk_size |
There was a problem hiding this comment.
This check for a cached result can be made more concise and efficient by using dict.get() to avoid two separate dictionary lookups (in followed by []). Using an assignment expression (the walrus operator :=, available in Python 3.8+) makes this pattern particularly clean and is generally preferred for this type of check.
| if job_id in self._finished_lookup_jobs: | |
| # Return cached result if the job is already finished | |
| return self._finished_lookup_jobs[job_id] * self.chunk_size | |
| if (cached_result := self._finished_lookup_jobs.get(job_id)) is not None: | |
| # Return cached result if the job is already finished | |
| return cached_result * self.chunk_size |
| Chunk count (int) when done, None if still in progress | ||
| or the job ID is unknown. |
There was a problem hiding this comment.
The updated docstring is slightly ambiguous. A clearer phrasing would improve readability and avoid any potential misinterpretation of when None is returned.
| Chunk count (int) when done, None if still in progress | |
| or the job ID is unknown. | |
| The number of matched chunks (int) if the prefetch job is complete. | |
| Returns None if the job is still in progress or the job ID is unknown. |
Signed-off-by: ApostaC <yihua98@uchicago.edu>
…ock problem (LMCache#2733) * Fix the vLLM-side logical bug * [fix] deadlock problem caused by launch_host_func Signed-off-by: ApostaC <yihua98@uchicago.edu> Signed-off-by: shaoxiawjc <wjc2800@163.com>
…ock problem (LMCache#2733) * Fix the vLLM-side logical bug * [fix] deadlock problem caused by launch_host_func Signed-off-by: ApostaC <yihua98@uchicago.edu> Signed-off-by: Aaron Wu <aaron.wu@dell.com>
…ock problem (LMCache#2733) * Fix the vLLM-side logical bug * [fix] deadlock problem caused by launch_host_func Signed-off-by: ApostaC <yihua98@uchicago.edu>
…ock problem (LMCache#2733) * Fix the vLLM-side logical bug * [fix] deadlock problem caused by launch_host_func Signed-off-by: ApostaC <yihua98@uchicago.edu>
What this PR does / why we need it:
Fixes two bugs in the multi-process (MP) mode that together cause hangs:
vLLM-side lookup result caching: After a prefetch job completes, the server removes it from its job map. Subsequent
QUERY_PREFETCH_STATUScalls for that job ID returnNone(previously0), causing the scheduler to think the lookup is still in progress and poll forever. This PR:_finished_lookup_jobs) so repeated queries return the cached count instead of re-querying the server.None(instead of0) for unknown job IDs, making the semantics consistent:None= not ready / unknown,int= done.cleanup_lookup_result.CUDA stream deadlock from
launch_host_func: Telemetry logging callbacks scheduled viacupy_stream.launch_host_funcneed the GIL, butpaged_kv_transferandlmcache_memcpy_asynchold the GIL while waiting on the CUDA stream — creating a deadlock. This PR addspy::call_guard<py::gil_scoped_release>()to both C++ bindings so the GIL is released during execution, allowing host callbacks to acquire it. This also removes the_can_log_storeworkaround that was masking the issue.Special notes for your reviewers:
_can_log_storeflag was a partial workaround that prevented store telemetry until after the first retrieve; it is no longer needed with the proper GIL fix.If applicable: