[MP] feat: add LMCacheMPPollingSchedulerAdapter for synchronous lookup and prefetch in MP mode#2735
[MP] feat: add LMCacheMPPollingSchedulerAdapter for synchronous lookup and prefetch in MP mode#2735liuyumoye wants to merge 1 commit intoLMCache:devfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new synchronous prefetch lookup capability for LMCache in multi-process mode. This enhancement provides an alternative to the existing asynchronous two-step lookup, allowing for a single, blocking call that waits for prefetch completion. The primary goal is to reduce scheduling latency, particularly beneficial for short requests or high cache hit rates, by streamlining the prefetch result retrieval process. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a synchronous prefetch lookup mechanism (LOOKUP_AND_PREFETCH_SYNC) to LMCache in multi-process mode, designed to reduce scheduling latency by allowing the vLLM connector to block until prefetch completes. While this is a valuable addition for prefetch strategies, the server-side implementation of the synchronous polling loop has been identified as vulnerable to a Denial of Service (DoS) attack. This vulnerability arises from predictable job IDs and the lack of a timeout, which could indefinitely block a worker thread if a job ID is compromised or invalid, especially critical with the default single-worker thread pool. It is recommended to implement a timeout in the polling loop and improve the security of job identifiers to mitigate this risk.
| while True: | ||
| result = self.query_prefetch_status(job_id) | ||
| if result is not None: | ||
| return result | ||
| time.sleep(poll_interval) |
There was a problem hiding this comment.
The lookup_and_prefetch_sync method implements a while True loop that polls the status of a prefetch job using self.query_prefetch_status(job_id). However, query_prefetch_status returns None both when a job is still in progress and when a job_id is not found (e.g., if it was already consumed or never existed).
Because the loop lacks a timeout and does not verify the continued existence of the job, it will spin indefinitely if a job_id becomes invalid. Since job_ids are predictable incrementing counters, a malicious client could 'steal' a job_id by querying it first, causing the legitimate caller's thread to hang forever. Given the default configuration of a single worker thread (max_workers=1), this allows a single request to cause a complete Denial of Service (DoS) for all blocking operations on the server.
Recommendation:
- Implement a maximum timeout for the synchronous lookup loop.
- Modify the logic to check if the job still exists in the
_prefetch_jobsdictionary during each iteration. - Use unpredictable identifiers (e.g., UUIDs) for
job_idto prevent cross-client interference.
| [key, self.tp_size], | ||
| ).result() | ||
|
|
||
| return (result or 0) * self.chunk_size |
There was a problem hiding this comment.
The response_class for LOOKUP_AND_PREFETCH_SYNC is defined as int in protocols/engine.py. If result can indeed be None here, the protocol definition should be updated to int | None for accuracy. If result is guaranteed to be an integer, then or 0 is redundant.
| return (result or 0) * self.chunk_size | |
| return result * self.chunk_size |
| # Returns: int - prefix_hits (chunk count), blocks until prefetch completes | ||
| "LOOKUP_AND_PREFETCH_SYNC": ProtocolDefinition( | ||
| payload_classes=[KeyType, int], | ||
| response_class=int, |
There was a problem hiding this comment.
The response_class is set to int. However, in vllm_multi_process_adapter.py, the code uses (result or 0), implying that result might be None. Please ensure consistency: if the server can return None, the response_class should be int | None. If it always returns an int, the client-side or 0 can be removed.
| key: IPCCacheEngineKey, | ||
| tp_size: int, | ||
| ) -> int: | ||
| """同步版 lookup:提交 prefetch 并阻塞等待结果,直接返回 prefix_hits。 |
| """ | ||
| # 轮询间隔 5ms:每次轮询需要抢 2 把锁(_prefetch_job_lock 和 | ||
| # _results_lock),间隔太小会与后台 prefetch 线程产生锁竞争,影响吞吐。 | ||
| poll_interval = 0.005 |
There was a problem hiding this comment.
The poll_interval is a magic number. It would be better to define this as a named constant, possibly configurable via MPServerConfig, to improve readability and maintainability. This makes it easier to understand its purpose and adjust it if needed.
| poll_interval = 0.005 | |
| POLL_INTERVAL_SECONDS = 0.005 | |
| job_id = self.lookup(key, tp_size) | |
| while True: | |
| result = self.query_prefetch_status(job_id) | |
| if result is not None: | |
| return result | |
| time.sleep(POLL_INTERVAL_SECONDS) |
ApostaC
left a comment
There was a problem hiding this comment.
Hey @liuyumoye , lookup_and_prefetch_sync will block the LMCache server's worker and delay the other store and retrieve operations.
Therefore, can we introduce the periodic polling thread in the vllm_multi_process_adapter.py? This can avoid all the code changes in v1/multiprocess.
|
Additionally, if we are going to have a special |
| token_ids: list[int], | ||
| ) -> int: | ||
| """ | ||
| 同步版 lookup:提交 prefetch 并阻塞等待结果,直接返回匹配的 token 数。 |
There was a problem hiding this comment.
Convent Chinese to English
Thanks for the suggestion! The current implementation already avoids blocking the server worker — |
Great point, and this is exactly what the current implementation does! |
29db2b1 to
cbfa48a
Compare
| ) | ||
| self._lookup_job_ids.pop(request_id, None) | ||
| return 0 | ||
| time.sleep(self._poll_interval) |
There was a problem hiding this comment.
No unit tests for new public class
Medium Severity
LMCacheMPPollingSchedulerAdapter is a new public class with non-trivial polling and timeout logic, but the PR includes zero tests. The project's style guide requires new features to include corresponding tests, and AGENTS.md states "All new features and bug fixes should include corresponding tests."
Triggered by project rule: LMCache Code Review Style Guide
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 3 total unresolved issues (including 1 from previous review).
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| job_id, | ||
| ) | ||
| self._lookup_job_ids.pop(request_id, None) | ||
| self._finished_lookup_jobs.pop(job_id, None) |
There was a problem hiding this comment.
Subclass accesses parent's private members directly
Low Severity
LMCacheMPPollingSchedulerAdapter.check_lookup_result directly reads and mutates self._lookup_job_ids and self._finished_lookup_jobs, which are private (_-prefixed) attributes defined on the parent class LMCacheMPSchedulerAdapter. The project convention (AGENTS.md) states "never access private members of other classes," and a subclass is a distinct class boundary.
Triggered by project rule: LMCache Code Review Style Guide
…ch lookup in MP mode Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>


What this PR does / why we need it:
Currently, the MP-mode prefetch lookup is asynchronous and two-step: the vllm scheduler first calls maybe_submit_lookup_request() to submit a prefetch job, then polls check_lookup_result() in subsequent scheduling rounds to retrieve the result. This introduces scheduling latency when the prefetch result is needed immediately (e.g., for short requests or when the cache hit rate is high).
This PR adds LMCacheMPPollingSchedulerAdapter, a subclass of LMCacheMPSchedulerAdapter that provides a synchronous (blocking) prefetch lookup in MP mode.
The base class check_lookup_result returns None immediately when the prefetch job is not yet complete, requiring the caller to retry across multiple scheduling steps. This new subclass overrides check_lookup_result to poll the server in a loop until the prefetch job finishes, so the caller always gets a definitive matched-token count in a single call.
Key behaviors:
Polls QUERY_PREFETCH_STATUS at a configurable interval (poll_interval, default 5ms) until the job is done.
Enforces a configurable timeout (lookup_timeout, default 5s); if exceeded, logs a warning and returns 0 matched tokens so the request falls back to normal inference without KV cache hit.
Fully backward-compatible: the base class behavior is unchanged; callers opt in by using LMCacheMPPollingSchedulerAdapter instead.
Special notes for your reviewers:
The polling loop uses time.monotonic() for timeout tracking to avoid wall-clock drift issues.
poll_interval=0.005 (5ms) and lookup_timeout=5.0 (5s) are chosen as sensible defaults but are fully configurable via constructor arguments.
On timeout, _lookup_job_ids entry is cleaned up to avoid memory leaks.
If applicable:
Note
Medium Risk
Introduces a blocking polling loop around
QUERY_PREFETCH_STATUS, which can affect scheduler latency and load on the MQ server if misconfigured. Timeout/health handling reduces worst-case impact but correctness depends on exactly-once response semantics and proper timeout tuning.Overview
Adds
LMCacheMPPollingSchedulerAdapter, an opt-in subclass ofLMCacheMPSchedulerAdapterthat makes MP-mode lookup synchronous:check_lookup_resultnow blocks and polls until the LMCache prefetch job completes, instead of returningNoneand requiring callers to re-check across scheduling rounds.The polling implementation uses a configurable
poll_intervaland overalllookup_timeout, cleans up per-request lookup state on timeout/unhealthy server, and carefully reuses a single in-flight MQ future to avoid losing results under the server’s exactly-once semantics.Written by Cursor Bugbot for commit 1a226d9. This will update automatically on new commits. Configure here.