Skip to content

[MP] feat: add LMCacheMPPollingSchedulerAdapter for synchronous lookup and prefetch in MP mode#2735

Open
liuyumoye wants to merge 1 commit intoLMCache:devfrom
liuyumoye:dev
Open

[MP] feat: add LMCacheMPPollingSchedulerAdapter for synchronous lookup and prefetch in MP mode#2735
liuyumoye wants to merge 1 commit intoLMCache:devfrom
liuyumoye:dev

Conversation

@liuyumoye
Copy link
Copy Markdown
Contributor

@liuyumoye liuyumoye commented Mar 11, 2026

What this PR does / why we need it:
Currently, the MP-mode prefetch lookup is asynchronous and two-step: the vllm scheduler first calls maybe_submit_lookup_request() to submit a prefetch job, then polls check_lookup_result() in subsequent scheduling rounds to retrieve the result. This introduces scheduling latency when the prefetch result is needed immediately (e.g., for short requests or when the cache hit rate is high).

This PR adds LMCacheMPPollingSchedulerAdapter, a subclass of LMCacheMPSchedulerAdapter that provides a synchronous (blocking) prefetch lookup in MP mode.

The base class check_lookup_result returns None immediately when the prefetch job is not yet complete, requiring the caller to retry across multiple scheduling steps. This new subclass overrides check_lookup_result to poll the server in a loop until the prefetch job finishes, so the caller always gets a definitive matched-token count in a single call.

Key behaviors:

  • Polls QUERY_PREFETCH_STATUS at a configurable interval (poll_interval, default 5ms) until the job is done.

  • Enforces a configurable timeout (lookup_timeout, default 5s); if exceeded, logs a warning and returns 0 matched tokens so the request falls back to normal inference without KV cache hit.

  • Fully backward-compatible: the base class behavior is unchanged; callers opt in by using LMCacheMPPollingSchedulerAdapter instead.

Special notes for your reviewers:

  • The polling loop uses time.monotonic() for timeout tracking to avoid wall-clock drift issues.

  • poll_interval=0.005 (5ms) and lookup_timeout=5.0 (5s) are chosen as sensible defaults but are fully configurable via constructor arguments.

  • On timeout, _lookup_job_ids entry is cleaned up to avoid memory leaks.

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Note

Medium Risk
Introduces a blocking polling loop around QUERY_PREFETCH_STATUS, which can affect scheduler latency and load on the MQ server if misconfigured. Timeout/health handling reduces worst-case impact but correctness depends on exactly-once response semantics and proper timeout tuning.

Overview
Adds LMCacheMPPollingSchedulerAdapter, an opt-in subclass of LMCacheMPSchedulerAdapter that makes MP-mode lookup synchronous: check_lookup_result now blocks and polls until the LMCache prefetch job completes, instead of returning None and requiring callers to re-check across scheduling rounds.

The polling implementation uses a configurable poll_interval and overall lookup_timeout, cleans up per-request lookup state on timeout/unhealthy server, and carefully reuses a single in-flight MQ future to avoid losing results under the server’s exactly-once semantics.

Written by Cursor Bugbot for commit 1a226d9. This will update automatically on new commits. Configure here.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new synchronous prefetch lookup capability for LMCache in multi-process mode. This enhancement provides an alternative to the existing asynchronous two-step lookup, allowing for a single, blocking call that waits for prefetch completion. The primary goal is to reduce scheduling latency, particularly beneficial for short requests or high cache hit rates, by streamlining the prefetch result retrieval process.

Highlights

  • New Synchronous Interface: A new request type, LOOKUP_AND_PREFETCH_SYNC, and its corresponding protocol definition have been introduced to enable synchronous prefetch lookups.
  • Blocking Prefetch Lookup Implementation: The MPCacheEngine now includes a lookup_and_prefetch_sync method that initiates a prefetch job and then actively polls its status at 5ms intervals until the prefetch operation is completed, returning the matched token count directly.
  • VLLM Integration: The vllm_multi_process_adapter.py file has been updated with maybe_submit_lookup_and_prefetch_request, allowing the vLLM connector to utilize the new one-shot blocking lookup mechanism.
  • Non-blocking Server Operation: The blocking nature of the synchronous lookup is handled on the server's dedicated BLOCKING handler thread pool, ensuring that the main asynchronous event loop remains responsive and unblocked.
  • Opt-in Functionality: This synchronous interface is an optional feature, configurable via lmcache.mp.sync_lookup on the vLLM connector side, preserving the existing two-step asynchronous prefetch flow as the default.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • lmcache/integration/vllm/vllm_multi_process_adapter.py
    • Added maybe_submit_lookup_and_prefetch_request for a one-shot blocking lookup and prefetch.
  • lmcache/v1/multiprocess/protocols/base.py
    • Added LOOKUP_AND_PREFETCH_SYNC to the RequestType enum.
  • lmcache/v1/multiprocess/protocols/engine.py
    • Added LOOKUP_AND_PREFETCH_SYNC to the __all__ export list.
    • Defined ProtocolDefinition for LOOKUP_AND_PREFETCH_SYNC specifying its payload, response, and handler type.
  • lmcache/v1/multiprocess/server.py
    • Implemented lookup_and_prefetch_sync method which performs a lookup and then polls query_prefetch_status until the prefetch completes.
    • Registered the lookup_and_prefetch_sync method as a handler for RequestType.LOOKUP_AND_PREFETCH_SYNC.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a synchronous prefetch lookup mechanism (LOOKUP_AND_PREFETCH_SYNC) to LMCache in multi-process mode, designed to reduce scheduling latency by allowing the vLLM connector to block until prefetch completes. While this is a valuable addition for prefetch strategies, the server-side implementation of the synchronous polling loop has been identified as vulnerable to a Denial of Service (DoS) attack. This vulnerability arises from predictable job IDs and the lack of a timeout, which could indefinitely block a worker thread if a job ID is compromised or invalid, especially critical with the default single-worker thread pool. It is recommended to implement a timeout in the polling loop and improve the security of job identifiers to mitigate this risk.

Comment thread lmcache/v1/multiprocess/server.py Outdated
Comment on lines +700 to +704
while True:
result = self.query_prefetch_status(job_id)
if result is not None:
return result
time.sleep(poll_interval)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The lookup_and_prefetch_sync method implements a while True loop that polls the status of a prefetch job using self.query_prefetch_status(job_id). However, query_prefetch_status returns None both when a job is still in progress and when a job_id is not found (e.g., if it was already consumed or never existed).

Because the loop lacks a timeout and does not verify the continued existence of the job, it will spin indefinitely if a job_id becomes invalid. Since job_ids are predictable incrementing counters, a malicious client could 'steal' a job_id by querying it first, causing the legitimate caller's thread to hang forever. Given the default configuration of a single worker thread (max_workers=1), this allows a single request to cause a complete Denial of Service (DoS) for all blocking operations on the server.

Recommendation:

  1. Implement a maximum timeout for the synchronous lookup loop.
  2. Modify the logic to check if the job still exists in the _prefetch_jobs dictionary during each iteration.
  3. Use unpredictable identifiers (e.g., UUIDs) for job_id to prevent cross-client interference.

[key, self.tp_size],
).result()

return (result or 0) * self.chunk_size
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The response_class for LOOKUP_AND_PREFETCH_SYNC is defined as int in protocols/engine.py. If result can indeed be None here, the protocol definition should be updated to int | None for accuracy. If result is guaranteed to be an integer, then or 0 is redundant.

Suggested change
return (result or 0) * self.chunk_size
return result * self.chunk_size

# Returns: int - prefix_hits (chunk count), blocks until prefetch completes
"LOOKUP_AND_PREFETCH_SYNC": ProtocolDefinition(
payload_classes=[KeyType, int],
response_class=int,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The response_class is set to int. However, in vllm_multi_process_adapter.py, the code uses (result or 0), implying that result might be None. Please ensure consistency: if the server can return None, the response_class should be int | None. If it always returns an int, the client-side or 0 can be removed.

Comment thread lmcache/v1/multiprocess/server.py Outdated
key: IPCCacheEngineKey,
tp_size: int,
) -> int:
"""同步版 lookup:提交 prefetch 并阻塞等待结果,直接返回 prefix_hits。
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with the English PR description and the overall codebase, it would be beneficial to translate this docstring to English. This improves maintainability and readability for all contributors.

Comment thread lmcache/v1/multiprocess/server.py Outdated
"""
# 轮询间隔 5ms:每次轮询需要抢 2 把锁(_prefetch_job_lock 和
# _results_lock),间隔太小会与后台 prefetch 线程产生锁竞争,影响吞吐。
poll_interval = 0.005
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The poll_interval is a magic number. It would be better to define this as a named constant, possibly configurable via MPServerConfig, to improve readability and maintainability. This makes it easier to understand its purpose and adjust it if needed.

Suggested change
poll_interval = 0.005
POLL_INTERVAL_SECONDS = 0.005
job_id = self.lookup(key, tp_size)
while True:
result = self.query_prefetch_status(job_id)
if result is not None:
return result
time.sleep(POLL_INTERVAL_SECONDS)

@liuyumoye liuyumoye closed this Mar 11, 2026
@liuyumoye liuyumoye reopened this Mar 12, 2026
@liuyumoye liuyumoye changed the title [MP] feat: add LOOKUP_AND_PREFETCH_SYNC for synchronous prefetch lookup in MP mode [WIP][MP] feat: add LOOKUP_AND_PREFETCH_SYNC for synchronous prefetch lookup in MP mode Mar 12, 2026
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @liuyumoye , lookup_and_prefetch_sync will block the LMCache server's worker and delay the other store and retrieve operations.

Therefore, can we introduce the periodic polling thread in the vllm_multi_process_adapter.py? This can avoid all the code changes in v1/multiprocess.

@ApostaC
Copy link
Copy Markdown
Contributor

ApostaC commented Mar 12, 2026

Additionally, if we are going to have a special lmcache_mp_connector.py anyway, it might be a good idea to have the new scheduler adapter (with lookup polling) also implemented in a new class. The new class can inherit from the existing LMCacheMPSchedulerAdapter and only overrides the check_lookup_result function, IIUC.

token_ids: list[int],
) -> int:
"""
同步版 lookup:提交 prefetch 并阻塞等待结果,直接返回匹配的 token 数。
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convent Chinese to English

@liuyumoye
Copy link
Copy Markdown
Contributor Author

Additionally, if we are going to have a special lmcache_mp_connector.py anyway, it might be a good idea to have the new scheduler adapter (with lookup polling) also implemented in a new class. The new class can inherit from the existing LMCacheMPSchedulerAdapter and only overrides the check_lookup_result function, IIUC.

Thanks for the suggestion! The current implementation already avoids blocking the server worker — check_lookup_result polls QUERY_PREFETCH_STATUS on the scheduler adapter side, so the server's worker thread remains free for other operations.

@liuyumoye
Copy link
Copy Markdown
Contributor Author

Additionally, if we are going to have a special lmcache_mp_connector.py anyway, it might be a good idea to have the new scheduler adapter (with lookup polling) also implemented in a new class. The new class can inherit from the existing LMCacheMPSchedulerAdapter and only overrides the check_lookup_result function, IIUC.

Great point, and this is exactly what the current implementation does! LMCacheMPPollingSchedulerAdapter in vllm_multi_process_adapter.py inherits from LMCacheMPSchedulerAdapter and only overrides check_lookup_result to add the polling-until-done behavior. The lmcache_mp_connector.py then uses this subclass on the scheduler side. Please let me know if you'd like any further adjustments to the class structure!

@liuyumoye liuyumoye force-pushed the dev branch 2 times, most recently from 29db2b1 to cbfa48a Compare March 15, 2026 06:22
@liuyumoye liuyumoye closed this Mar 28, 2026
@liuyumoye liuyumoye reopened this Apr 1, 2026
@liuyumoye liuyumoye changed the title [WIP][MP] feat: add LOOKUP_AND_PREFETCH_SYNC for synchronous prefetch lookup in MP mode [MP] feat: add LMCacheMPPollingSchedulerAdapter for synchronous lookup and prefetch in MP mode Apr 1, 2026
Comment thread lmcache/integration/vllm/vllm_multi_process_adapter.py Outdated
Comment thread lmcache/integration/vllm/vllm_multi_process_adapter.py Outdated
)
self._lookup_job_ids.pop(request_id, None)
return 0
time.sleep(self._poll_interval)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No unit tests for new public class

Medium Severity

LMCacheMPPollingSchedulerAdapter is a new public class with non-trivial polling and timeout logic, but the PR includes zero tests. The project's style guide requires new features to include corresponding tests, and AGENTS.md states "All new features and bug fixes should include corresponding tests."

Fix in Cursor Fix in Web

Triggered by project rule: LMCache Code Review Style Guide

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment thread lmcache/integration/vllm/vllm_multi_process_adapter.py
job_id,
)
self._lookup_job_ids.pop(request_id, None)
self._finished_lookup_jobs.pop(job_id, None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subclass accesses parent's private members directly

Low Severity

LMCacheMPPollingSchedulerAdapter.check_lookup_result directly reads and mutates self._lookup_job_ids and self._finished_lookup_jobs, which are private (_-prefixed) attributes defined on the parent class LMCacheMPSchedulerAdapter. The project convention (AGENTS.md) states "never access private members of other classes," and a subclass is a distinct class boundary.

Fix in Cursor Fix in Web

Triggered by project rule: LMCache Code Review Style Guide

…ch lookup in MP mode

Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants