[MP] feat: add LMCacheMPPollingSchedulerAdapter for synchronous lookup and prefetch in MP mode by liuyumoye · Pull Request #2735 · LMCache/LMCache

liuyumoye · 2026-03-11T07:00:19Z

What this PR does / why we need it:
Currently, the MP-mode prefetch lookup is asynchronous and two-step: the vllm scheduler first calls maybe_submit_lookup_request() to submit a prefetch job, then polls check_lookup_result() in subsequent scheduling rounds to retrieve the result. This introduces scheduling latency when the prefetch result is needed immediately (e.g., for short requests or when the cache hit rate is high).

This PR adds LMCacheMPPollingSchedulerAdapter, a subclass of LMCacheMPSchedulerAdapter that provides a synchronous (blocking) prefetch lookup in MP mode.

The base class check_lookup_result returns None immediately when the prefetch job is not yet complete, requiring the caller to retry across multiple scheduling steps. This new subclass overrides check_lookup_result to poll the server in a loop until the prefetch job finishes, so the caller always gets a definitive matched-token count in a single call.

Key behaviors:

Polls QUERY_PREFETCH_STATUS at a configurable interval (poll_interval, default 5ms) until the job is done.
Enforces a configurable timeout (lookup_timeout, default 5s); if exceeded, logs a warning and returns 0 matched tokens so the request falls back to normal inference without KV cache hit.
Fully backward-compatible: the base class behavior is unchanged; callers opt in by using LMCacheMPPollingSchedulerAdapter instead.

Special notes for your reviewers:

The polling loop uses time.monotonic() for timeout tracking to avoid wall-clock drift issues.
poll_interval=0.005 (5ms) and lookup_timeout=5.0 (5s) are chosen as sensible defaults but are fully configurable via constructor arguments.
On timeout, _lookup_job_ids entry is cleaned up to avoid memory leaks.

If applicable:

this PR contains user facing changes - docs added
this PR contains unit tests

Note

Medium Risk
Introduces a blocking polling loop around QUERY_PREFETCH_STATUS, which can affect scheduler latency and load on the MQ server if misconfigured. Timeout/health handling reduces worst-case impact but correctness depends on exactly-once response semantics and proper timeout tuning.

Overview
Adds LMCacheMPPollingSchedulerAdapter, an opt-in subclass of LMCacheMPSchedulerAdapter that makes MP-mode lookup synchronous: check_lookup_result now blocks and polls until the LMCache prefetch job completes, instead of returning None and requiring callers to re-check across scheduling rounds.

The polling implementation uses a configurable poll_interval and overall lookup_timeout, cleans up per-request lookup state on timeout/unhealthy server, and carefully reuses a single in-flight MQ future to avoid losing results under the server’s exactly-once semantics.

^{Written by Cursor Bugbot for commit 1a226d9. This will update automatically on new commits. Configure here.}

gemini-code-assist · 2026-03-11T07:00:40Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new synchronous prefetch lookup capability for LMCache in multi-process mode. This enhancement provides an alternative to the existing asynchronous two-step lookup, allowing for a single, blocking call that waits for prefetch completion. The primary goal is to reduce scheduling latency, particularly beneficial for short requests or high cache hit rates, by streamlining the prefetch result retrieval process.

Highlights

New Synchronous Interface: A new request type, LOOKUP_AND_PREFETCH_SYNC, and its corresponding protocol definition have been introduced to enable synchronous prefetch lookups.
Blocking Prefetch Lookup Implementation: The MPCacheEngine now includes a lookup_and_prefetch_sync method that initiates a prefetch job and then actively polls its status at 5ms intervals until the prefetch operation is completed, returning the matched token count directly.
VLLM Integration: The vllm_multi_process_adapter.py file has been updated with maybe_submit_lookup_and_prefetch_request, allowing the vLLM connector to utilize the new one-shot blocking lookup mechanism.
Non-blocking Server Operation: The blocking nature of the synchronous lookup is handled on the server's dedicated BLOCKING handler thread pool, ensuring that the main asynchronous event loop remains responsive and unblocked.
Opt-in Functionality: This synchronous interface is an optional feature, configurable via lmcache.mp.sync_lookup on the vLLM connector side, preserving the existing two-step asynchronous prefetch flow as the default.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

lmcache/integration/vllm/vllm_multi_process_adapter.py
- Added maybe_submit_lookup_and_prefetch_request for a one-shot blocking lookup and prefetch.
lmcache/v1/multiprocess/protocols/base.py
- Added LOOKUP_AND_PREFETCH_SYNC to the RequestType enum.
lmcache/v1/multiprocess/protocols/engine.py
- Added LOOKUP_AND_PREFETCH_SYNC to the __all__ export list.
- Defined ProtocolDefinition for LOOKUP_AND_PREFETCH_SYNC specifying its payload, response, and handler type.
lmcache/v1/multiprocess/server.py
- Implemented lookup_and_prefetch_sync method which performs a lookup and then polls query_prefetch_status until the prefetch completes.
- Registered the lookup_and_prefetch_sync method as a handler for RequestType.LOOKUP_AND_PREFETCH_SYNC.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a synchronous prefetch lookup mechanism (LOOKUP_AND_PREFETCH_SYNC) to LMCache in multi-process mode, designed to reduce scheduling latency by allowing the vLLM connector to block until prefetch completes. While this is a valuable addition for prefetch strategies, the server-side implementation of the synchronous polling loop has been identified as vulnerable to a Denial of Service (DoS) attack. This vulnerability arises from predictable job IDs and the lack of a timeout, which could indefinitely block a worker thread if a job ID is compromised or invalid, especially critical with the default single-worker thread pool. It is recommended to implement a timeout in the polling loop and improve the security of job identifiers to mitigate this risk.

gemini-code-assist · 2026-03-11T07:03:38Z

+        while True:
+            result = self.query_prefetch_status(job_id)
+            if result is not None:
+                return result
+            time.sleep(poll_interval)


The lookup_and_prefetch_sync method implements a while True loop that polls the status of a prefetch job using self.query_prefetch_status(job_id). However, query_prefetch_status returns None both when a job is still in progress and when a job_id is not found (e.g., if it was already consumed or never existed).

Because the loop lacks a timeout and does not verify the continued existence of the job, it will spin indefinitely if a job_id becomes invalid. Since job_ids are predictable incrementing counters, a malicious client could 'steal' a job_id by querying it first, causing the legitimate caller's thread to hang forever. Given the default configuration of a single worker thread (max_workers=1), this allows a single request to cause a complete Denial of Service (DoS) for all blocking operations on the server.

Recommendation:

Implement a maximum timeout for the synchronous lookup loop.

Modify the logic to check if the job still exists in the _prefetch_jobs dictionary during each iteration.

Use unpredictable identifiers (e.g., UUIDs) for job_id to prevent cross-client interference.

gemini-code-assist · 2026-03-11T07:03:38Z

+            [key, self.tp_size],
+        ).result()
+
+        return (result or 0) * self.chunk_size


The response_class for LOOKUP_AND_PREFETCH_SYNC is defined as int in protocols/engine.py. If result can indeed be None here, the protocol definition should be updated to int | None for accuracy. If result is guaranteed to be an integer, then or 0 is redundant.

Suggested change

return (result or 0) * self.chunk_size

return result * self.chunk_size

gemini-code-assist · 2026-03-11T07:03:39Z

+        # Returns: int - prefix_hits (chunk count), blocks until prefetch completes
+        "LOOKUP_AND_PREFETCH_SYNC": ProtocolDefinition(
+            payload_classes=[KeyType, int],
+            response_class=int,


The response_class is set to int. However, in vllm_multi_process_adapter.py, the code uses (result or 0), implying that result might be None. Please ensure consistency: if the server can return None, the response_class should be int | None. If it always returns an int, the client-side or 0 can be removed.

gemini-code-assist · 2026-03-11T07:03:39Z

+        key: IPCCacheEngineKey,
+        tp_size: int,
+    ) -> int:
+        """同步版 lookup：提交 prefetch 并阻塞等待结果，直接返回 prefix_hits。


For consistency with the English PR description and the overall codebase, it would be beneficial to translate this docstring to English. This improves maintainability and readability for all contributors.

gemini-code-assist · 2026-03-11T07:03:39Z

+        """
+        # 轮询间隔 5ms：每次轮询需要抢 2 把锁（_prefetch_job_lock 和
+        # _results_lock），间隔太小会与后台 prefetch 线程产生锁竞争，影响吞吐。
+        poll_interval = 0.005


The poll_interval is a magic number. It would be better to define this as a named constant, possibly configurable via MPServerConfig, to improve readability and maintainability. This makes it easier to understand its purpose and adjust it if needed.

Suggested change

poll_interval = 0.005

POLL_INTERVAL_SECONDS = 0.005

job_id = self.lookup(key, tp_size)

while True:

result = self.query_prefetch_status(job_id)

if result is not None:

return result

time.sleep(POLL_INTERVAL_SECONDS)

ApostaC

Hey @liuyumoye , lookup_and_prefetch_sync will block the LMCache server's worker and delay the other store and retrieve operations.

Therefore, can we introduce the periodic polling thread in the vllm_multi_process_adapter.py? This can avoid all the code changes in v1/multiprocess.

ApostaC · 2026-03-12T18:25:59Z

Additionally, if we are going to have a special lmcache_mp_connector.py anyway, it might be a good idea to have the new scheduler adapter (with lookup polling) also implemented in a new class. The new class can inherit from the existing LMCacheMPSchedulerAdapter and only overrides the check_lookup_result function, IIUC.

maobaolong · 2026-03-15T02:37:55Z

+        token_ids: list[int],
+    ) -> int:
+        """
+        同步版 lookup：提交 prefetch 并阻塞等待结果，直接返回匹配的 token 数。


Convent Chinese to English

liuyumoye · 2026-03-15T05:57:11Z

Additionally, if we are going to have a special lmcache_mp_connector.py anyway, it might be a good idea to have the new scheduler adapter (with lookup polling) also implemented in a new class. The new class can inherit from the existing LMCacheMPSchedulerAdapter and only overrides the check_lookup_result function, IIUC.

Thanks for the suggestion! The current implementation already avoids blocking the server worker — check_lookup_result polls QUERY_PREFETCH_STATUS on the scheduler adapter side, so the server's worker thread remains free for other operations.

liuyumoye · 2026-03-15T06:01:25Z

Additionally, if we are going to have a special lmcache_mp_connector.py anyway, it might be a good idea to have the new scheduler adapter (with lookup polling) also implemented in a new class. The new class can inherit from the existing LMCacheMPSchedulerAdapter and only overrides the check_lookup_result function, IIUC.

Great point, and this is exactly what the current implementation does! LMCacheMPPollingSchedulerAdapter in vllm_multi_process_adapter.py inherits from LMCacheMPSchedulerAdapter and only overrides check_lookup_result to add the polling-until-done behavior. The lmcache_mp_connector.py then uses this subclass on the scheduler side. Please let me know if you'd like any further adjustments to the class structure!

cursor · 2026-04-01T07:34:27Z

+                )
+                self._lookup_job_ids.pop(request_id, None)
+                return 0
+            time.sleep(self._poll_interval)


No unit tests for new public class

Medium Severity

LMCacheMPPollingSchedulerAdapter is a new public class with non-trivial polling and timeout logic, but the PR includes zero tests. The project's style guide requires new features to include corresponding tests, and AGENTS.md states "All new features and bug fixes should include corresponding tests."

^{Triggered by project rule: LMCache Code Review Style Guide}

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-04-01T08:14:20Z

+                    job_id,
+                )
+                self._lookup_job_ids.pop(request_id, None)
+                self._finished_lookup_jobs.pop(job_id, None)


Subclass accesses parent's private members directly

Low Severity

LMCacheMPPollingSchedulerAdapter.check_lookup_result directly reads and mutates self._lookup_job_ids and self._finished_lookup_jobs, which are private (_-prefixed) attributes defined on the parent class LMCacheMPSchedulerAdapter. The project convention (AGENTS.md) states "never access private members of other classes," and a subclass is a distinct class boundary.

^{Triggered by project rule: LMCache Code Review Style Guide}

…ch lookup in MP mode Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>

gemini-code-assist Bot reviewed Mar 11, 2026

View reviewed changes

liuyumoye closed this Mar 11, 2026

liuyumoye reopened this Mar 12, 2026

liuyumoye changed the title ~~[MP] feat: add LOOKUP_AND_PREFETCH_SYNC for synchronous prefetch lookup in MP mode~~ [WIP][MP] feat: add LOOKUP_AND_PREFETCH_SYNC for synchronous prefetch lookup in MP mode Mar 12, 2026

ApostaC reviewed Mar 12, 2026

View reviewed changes

maobaolong reviewed Mar 15, 2026

View reviewed changes

liuyumoye force-pushed the dev branch from 40aa79b to 1ef306f Compare March 15, 2026 05:24

liuyumoye closed this Mar 15, 2026

liuyumoye force-pushed the dev branch from 1ef306f to dedea50 Compare March 15, 2026 05:33

liuyumoye reopened this Mar 15, 2026

liuyumoye force-pushed the dev branch from c303d5d to a34e01d Compare March 15, 2026 05:45

liuyumoye force-pushed the dev branch 2 times, most recently from 29db2b1 to cbfa48a Compare March 15, 2026 06:22

ApostaC mentioned this pull request Mar 19, 2026

[MP][Feat] Query lookup-phase status for MP mode #2818

Merged

2 tasks

liuyumoye closed this Mar 28, 2026

liuyumoye reopened this Apr 1, 2026

liuyumoye changed the title ~~[WIP][MP] feat: add LOOKUP_AND_PREFETCH_SYNC for synchronous prefetch lookup in MP mode~~ [MP] feat: add LMCacheMPPollingSchedulerAdapter for synchronous lookup and prefetch in MP mode Apr 1, 2026

cursor Bot reviewed Apr 1, 2026

View reviewed changes

Comment thread lmcache/integration/vllm/vllm_multi_process_adapter.py Outdated

Comment thread lmcache/integration/vllm/vllm_multi_process_adapter.py Outdated

liuyumoye force-pushed the dev branch from cbfa48a to c0951b9 Compare April 1, 2026 07:30

cursor Bot reviewed Apr 1, 2026

View reviewed changes

liuyumoye force-pushed the dev branch from c0951b9 to 02e9084 Compare April 1, 2026 08:07

cursor Bot reviewed Apr 1, 2026

View reviewed changes

[MP]feat: add LMCacheMPPollingSchedulerAdapter for synchronous prefet…

1a226d9

…ch lookup in MP mode Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>

liuyumoye force-pushed the dev branch from 02e9084 to 1a226d9 Compare April 2, 2026 02:38

	return (result or 0) * self.chunk_size
	return result * self.chunk_size

Conversation

liuyumoye commented Mar 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Mar 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

ApostaC commented Mar 12, 2026

Uh oh!

maobaolong Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

liuyumoye commented Mar 15, 2026

Uh oh!

liuyumoye commented Mar 15, 2026

Uh oh!

Uh oh!

Uh oh!

cursor Bot Apr 1, 2026

Choose a reason for hiding this comment

No unit tests for new public class

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot Apr 1, 2026

Choose a reason for hiding this comment

Subclass accesses parent's private members directly

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liuyumoye commented Mar 11, 2026 •

edited by cursor Bot

Loading