[MP][Feat] Query lookup-phase status for MP mode#2818
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant enhancement to the prefetching mechanism by allowing early access to lookup-phase results. The primary goal is to decouple the availability of prefix hit information from the completion of the entire prefetch operation, thereby enabling more agile and informed scheduling decisions. This change provides a more granular view into the prefetch process, allowing systems to react faster based on initial lookup outcomes rather than waiting for the full data load. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a query_prefetch_lookup_hits API to allow for earlier insight into prefetch status. The changes are well-contained and include corresponding updates to the multiprocess protocol and new unit tests. My review identifies two key issues in the PrefetchController: a critical memory leak vulnerability due to a fragile cleanup mechanism that relies on the caller, and a high-severity race condition that could lead to inconsistent state. The feedback provides suggestions to make the new API more robust and thread-safe.
| Note: | ||
| This function does not pop the result. The caller need to make sure to call | ||
| the query_prefetch_result after calling this function, otherwise nobody | ||
| will clean up the completed lookups dictionary, causing memory leak. | ||
| """ |
There was a problem hiding this comment.
The note in this docstring highlights a significant design issue that can lead to memory leaks. Relying on the caller to invoke query_prefetch_result for cleanup makes the API fragile. If a client submits a request but never queries the final result (e.g., due to a crash or logic error), the request ID and its result will be permanently stored in _completed_lookups and _completed_results, causing unbounded memory growth over time.
To make the controller more robust, I strongly recommend implementing an automatic cleanup mechanism within PrefetchController itself. A good approach would be to associate a timestamp with each completed request and have a periodic task (e.g., within _prefetch_loop) that purges entries older than a certain TTL. This would ensure resources are eventually reclaimed without depending on caller behavior.
| with self._prefetch_results_lock: | ||
| result = self._completed_results.pop(request_id, None) | ||
| if result is not None: | ||
| with self._lookup_results_lock: | ||
| self._completed_lookups.pop(request_id, None) | ||
| return result |
There was a problem hiding this comment.
There is a race condition here. The cleanup of _completed_results and _completed_lookups is not atomic because the locks are acquired separately. This can lead to incorrect behavior.
Consider this sequence:
- Thread 1 calls
query_prefetch_result, pops from_completed_resultsand releases_prefetch_results_lock. - Thread 2 calls
query_lookup_resultfor the samerequest_id. It will successfully get the result from_completed_lookups. - Thread 1 proceeds to pop from
_completed_lookups.
This violates the documented contract that query_lookup_result should return None if the request has been consumed by query_prefetch_result.
To fix this, you should acquire both locks together to ensure atomicity. Since no other part of the code holds both locks, you can establish a consistent locking order (e.g., _prefetch_results_lock then _lookup_results_lock) to prevent deadlocks.
| with self._prefetch_results_lock: | |
| result = self._completed_results.pop(request_id, None) | |
| if result is not None: | |
| with self._lookup_results_lock: | |
| self._completed_lookups.pop(request_id, None) | |
| return result | |
| with self._prefetch_results_lock, self._lookup_results_lock: | |
| result = self._completed_results.pop(request_id, None) | |
| if result is not None: | |
| self._completed_lookups.pop(request_id, None) | |
| return result | |
|
@ApostaC The blend server test is failing |
Signed-off-by: ApsotaC <yihua98@uchicago.edu>
c193fbf to
f753f17
Compare
|
@YaoJiayi Should be fixed now |
|
@maobaolong @yoo-kumaneko PTAL, this is related to #2735 and #2769 |
| self._prefetch_jobs[job_id] = job | ||
| return job_id | ||
|
|
||
| def query_prefetch_lookup_hits( |
There was a problem hiding this comment.
also normalize:
found_count = found_count // job.world_sizelike query_prefetch_status?
There was a problem hiding this comment.
to keep the two APIs consistent
There was a problem hiding this comment.
Is there some related PR about vllm side changes? @ApostaC
|
@yoo-kumaneko Would you like to take a look at this PR? Thanks! |
Signed-off-by: ApsotaC <yihua98@uchicago.edu>
* fix dco and failed tests Signed-off-by: ApsotaC <yihua98@uchicago.edu>
* fix dco and failed tests Signed-off-by: ApsotaC <yihua98@uchicago.edu>
* fix dco and failed tests Signed-off-by: ApsotaC <yihua98@uchicago.edu>
* fix dco and failed tests Signed-off-by: ApsotaC <yihua98@uchicago.edu>
What this PR does / why we need it:
Adds a new
query_prefetch_lookup_hitsAPI that lets callers check L2 prefix hitsas soon as the lookup phase finishes, without waiting for the full prefetch (load)
to complete. This enables earlier scheduling decisions in the engine when only the
hit count matters.
Changes:
PrefetchController: newquery_lookup_result()method with separate lock anddict for lookup-phase results;
query_prefetch_result()cleans up lookup entriesStorageManager: newquery_prefetch_lookup_hits()combining L1 + L2 hit countsMPCacheEngine: newquery_prefetch_lookup_hits()handlerQUERY_PREFETCH_LOOKUP_HITSrequest type;QUERY_PREFETCH_STATUShandler type changed from SYNC to BLOCKING
If applicable: