Skip to content

[MP][Feat] Query lookup-phase status for MP mode#2818

Merged
ApostaC merged 3 commits intoLMCache:devfrom
ApostaC:local-dev/mp-lookup-interface
Mar 23, 2026
Merged

[MP][Feat] Query lookup-phase status for MP mode#2818
ApostaC merged 3 commits intoLMCache:devfrom
ApostaC:local-dev/mp-lookup-interface

Conversation

@ApostaC
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC commented Mar 18, 2026

What this PR does / why we need it:

Adds a new query_prefetch_lookup_hits API that lets callers check L2 prefix hits
as soon as the lookup phase finishes, without waiting for the full prefetch (load)
to complete. This enables earlier scheduling decisions in the engine when only the
hit count matters.

Changes:

  • PrefetchController: new query_lookup_result() method with separate lock and
    dict for lookup-phase results; query_prefetch_result() cleans up lookup entries
  • StorageManager: new query_prefetch_lookup_hits() combining L1 + L2 hit counts
  • MPCacheEngine: new query_prefetch_lookup_hits() handler
  • MP protocol: new QUERY_PREFETCH_LOOKUP_HITS request type; QUERY_PREFETCH_STATUS
    handler type changed from SYNC to BLOCKING

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to the prefetching mechanism by allowing early access to lookup-phase results. The primary goal is to decouple the availability of prefix hit information from the completion of the entire prefetch operation, thereby enabling more agile and informed scheduling decisions. This change provides a more granular view into the prefetch process, allowing systems to react faster based on initial lookup outcomes rather than waiting for the full data load.

Highlights

  • New API for Lookup-Phase Status: Introduced a new query_prefetch_lookup_hits API that allows callers to check L2 prefix hits as soon as the lookup phase completes, without waiting for the full prefetch (load) to finish.
  • Enhanced Scheduling Decisions: This new capability enables earlier scheduling decisions within the engine, particularly when only the hit count is critical, improving efficiency.
  • PrefetchController Updates: The PrefetchController now includes a query_lookup_result() method with its own lock and dictionary for managing lookup-phase results. The existing query_prefetch_result() method has been updated to clean up these lookup entries.
  • StorageManager Integration: The StorageManager gained a new query_prefetch_lookup_hits() method, which combines both L1 and L2 hit counts.
  • MPCacheEngine and Protocol Changes: A new query_prefetch_lookup_hits() handler was added to MPCacheEngine, and the MP protocol now includes a QUERY_PREFETCH_LOOKUP_HITS request type. Additionally, the QUERY_PREFETCH_STATUS handler type was changed from SYNC to BLOCKING.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a query_prefetch_lookup_hits API to allow for earlier insight into prefetch status. The changes are well-contained and include corresponding updates to the multiprocess protocol and new unit tests. My review identifies two key issues in the PrefetchController: a critical memory leak vulnerability due to a fragile cleanup mechanism that relies on the caller, and a high-severity race condition that could lead to inconsistent state. The feedback provides suggestions to make the new API more robust and thread-safe.

Comment on lines +308 to +312
Note:
This function does not pop the result. The caller need to make sure to call
the query_prefetch_result after calling this function, otherwise nobody
will clean up the completed lookups dictionary, causing memory leak.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The note in this docstring highlights a significant design issue that can lead to memory leaks. Relying on the caller to invoke query_prefetch_result for cleanup makes the API fragile. If a client submits a request but never queries the final result (e.g., due to a crash or logic error), the request ID and its result will be permanently stored in _completed_lookups and _completed_results, causing unbounded memory growth over time.

To make the controller more robust, I strongly recommend implementing an automatic cleanup mechanism within PrefetchController itself. A good approach would be to associate a timestamp with each completed request and have a periodic task (e.g., within _prefetch_loop) that purges entries older than a certain TTL. This would ensure resources are eventually reclaimed without depending on caller behavior.

Comment on lines +336 to +341
with self._prefetch_results_lock:
result = self._completed_results.pop(request_id, None)
if result is not None:
with self._lookup_results_lock:
self._completed_lookups.pop(request_id, None)
return result
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a race condition here. The cleanup of _completed_results and _completed_lookups is not atomic because the locks are acquired separately. This can lead to incorrect behavior.

Consider this sequence:

  1. Thread 1 calls query_prefetch_result, pops from _completed_results and releases _prefetch_results_lock.
  2. Thread 2 calls query_lookup_result for the same request_id. It will successfully get the result from _completed_lookups.
  3. Thread 1 proceeds to pop from _completed_lookups.

This violates the documented contract that query_lookup_result should return None if the request has been consumed by query_prefetch_result.

To fix this, you should acquire both locks together to ensure atomicity. Since no other part of the code holds both locks, you can establish a consistent locking order (e.g., _prefetch_results_lock then _lookup_results_lock) to prevent deadlocks.

Suggested change
with self._prefetch_results_lock:
result = self._completed_results.pop(request_id, None)
if result is not None:
with self._lookup_results_lock:
self._completed_lookups.pop(request_id, None)
return result
with self._prefetch_results_lock, self._lookup_results_lock:
result = self._completed_results.pop(request_id, None)
if result is not None:
self._completed_lookups.pop(request_id, None)
return result

@YaoJiayi
Copy link
Copy Markdown
Collaborator

@ApostaC The blend server test is failing

Signed-off-by: ApsotaC <yihua98@uchicago.edu>
@ApostaC ApostaC force-pushed the local-dev/mp-lookup-interface branch from c193fbf to f753f17 Compare March 19, 2026 00:26
@ApostaC
Copy link
Copy Markdown
Contributor Author

ApostaC commented Mar 19, 2026

@YaoJiayi Should be fixed now

@ApostaC
Copy link
Copy Markdown
Contributor Author

ApostaC commented Mar 19, 2026

@maobaolong @yoo-kumaneko PTAL, this is related to #2735 and #2769

self._prefetch_jobs[job_id] = job
return job_id

def query_prefetch_lookup_hits(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also normalize:

found_count = found_count // job.world_size

like query_prefetch_status?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to keep the two APIs consistent

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some related PR about vllm side changes? @ApostaC

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! one small nit

@maobaolong
Copy link
Copy Markdown
Collaborator

@yoo-kumaneko Would you like to take a look at this PR? Thanks!

Signed-off-by: ApsotaC <yihua98@uchicago.edu>
Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ApostaC ApostaC enabled auto-merge (squash) March 23, 2026 22:37
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Mar 23, 2026
@ApostaC ApostaC merged commit 6ff8efa into LMCache:dev Mar 23, 2026
26 of 27 checks passed
realAaronWu pushed a commit to realAaronWu/LMCache that referenced this pull request Mar 26, 2026
* fix dco and failed tests

Signed-off-by: ApsotaC <yihua98@uchicago.edu>
deng451e pushed a commit to deng451e/LMCache that referenced this pull request Mar 27, 2026
* fix dco and failed tests

Signed-off-by: ApsotaC <yihua98@uchicago.edu>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
* fix dco and failed tests

Signed-off-by: ApsotaC <yihua98@uchicago.edu>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
* fix dco and failed tests

Signed-off-by: ApsotaC <yihua98@uchicago.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants