[Edge Case]: Preemption Loading#2007
Conversation
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Summary of ChangesHello @sammshen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the KV cache loading mechanism by integrating robust support for preempted requests. It refactors how request states are managed upon resumption, ensuring that the entire token sequence is considered for cache lookups, and optimizes the lookup process with a new preliminary cache check. These changes are vital for maintaining correctness and efficiency when interacting with vLLM's preemption capabilities, allowing for smoother recovery and continued processing of interrupted requests. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant changes to handle request preemption, primarily by making the logic in get_num_new_matched_tokens idempotent and updating the RequestTracker state for preempted requests. The changes are generally well-structured, especially the refactoring of the lookup client to support idempotency. However, I've identified a critical issue in how the RequestTracker's token history is updated for preempted requests, which could lead to incorrect caching behavior. Additionally, there's an opportunity to improve code maintainability by refactoring some duplicated logic. My detailed feedback is in the comments below.
| if preempted: | ||
| # the block ids will change after preemption | ||
| self.allocated_block_ids = new_block_ids | ||
| # reset the number of saved tokens | ||
| self.num_saved_tokens = lmcache_cached_tokens | ||
|
|
||
| self.token_ids = [] | ||
| else: | ||
| self.allocated_block_ids.extend(new_block_ids) |
There was a problem hiding this comment.
When a request is preempted, self.token_ids is reset to an empty list. It is then extended with new_token_ids, which, based on its calculation in build_connector_meta, is only a slice of newly scheduled tokens. This results in the tracker losing the history of previously processed tokens for the preempted request, which can lead to an inconsistent state and incorrect caching behavior.
If vLLM rewinds the request upon preemption, token_ids should be truncated to the correct length, not completely reset. If there is no rewind, simply appending new_token_ids should be sufficient. The current implementation appears to be a bug. Removing the reset of token_ids would be a safer approach, although handling potential rewinds might require more context.
if preempted:
# the block ids will change after preemption
self.allocated_block_ids = new_block_ids
# reset the number of saved tokens
self.num_saved_tokens = lmcache_cached_tokens
else:
self.allocated_block_ids.extend(new_block_ids)| load_spec = self.load_specs.pop(req_id, None) | ||
| lmcache_cached_tokens = 0 | ||
| if load_spec is not None: | ||
| lmcache_cached_tokens = load_spec.lmcache_cached_tokens |
There was a problem hiding this comment.
This block of code to get load_spec and lmcache_cached_tokens is duplicated from lines 1519-1522. To improve maintainability and reduce redundancy, consider extracting this logic into a private helper method. For example:
def _get_lmcache_tokens_from_load_spec(self, req_id: str) -> tuple[Optional[LoadSpec], int]:
load_spec = self.load_specs.pop(req_id, None)
lmcache_cached_tokens = 0
if load_spec is not None:
lmcache_cached_tokens = load_spec.lmcache_cached_tokens
return load_spec, lmcache_cached_tokensYou could then call this helper in both places.
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
| class LookupClientInterface(metaclass=abc.ABCMeta): | ||
| """Abstract interface for lookup clients.""" | ||
|
|
||
| def lookup_cache(self, lookup_id: str) -> Optional[int]: |
There was a problem hiding this comment.
the default does nothing
| f"{self.load_specs[request.request_id].lmcache_cached_tokens} - " | ||
| f"{self.load_specs[request.request_id].vllm_cached_tokens}" | ||
| f" for request {request.request_id}" | ||
| recalc_last = ( |
There was a problem hiding this comment.
the logic is the same. we also check for prompt hit case with the recalc_last
| self._request_trackers.pop(finished_req_id, None) | ||
| self._unfinished_requests.pop(finished_req_id, None) | ||
|
|
||
| # We should load KV for: |
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
| self.allocated_block_ids = new_block_ids | ||
| # reset the number of saved tokens | ||
| self.num_saved_tokens = lmcache_cached_tokens | ||
| # we don't need to extend the token ids in the preempted case |
There was a problem hiding this comment.
because there is no previous step where we generated an extra token since the running request was preempted
| the number of tokens that can be loaded from the | ||
| external KV cache beyond what is already computed. | ||
| """ | ||
| # to handle preempted requests, we want `get_num_new_matched_tokens` to be |
There was a problem hiding this comment.
this is the main logic for preemption
| # TODO: this is a dangerous reference to the request object inside vllm | ||
| if request := self._unfinished_requests.get(req_id): | ||
| num_current_tokens = len(request_tracker.token_ids) | ||
| num_current_tokens = request.num_computed_tokens |
There was a problem hiding this comment.
if we use len(request_tracker.token_ids), this is a serious correctness bug
| for i, req_id in enumerate(cached_reqs.req_ids): | ||
| request_tracker = self._request_trackers[req_id] | ||
| num_new_tokens = scheduler_output.num_scheduled_tokens[req_id] | ||
| # TODO: this is a dangerous reference to the request object inside vllm |
There was a problem hiding this comment.
we can think of a better way to handle this in the future
|
|
||
| is_last_prefill = False | ||
| if input_token_len == tracker.prompt_len: | ||
| if input_token_len >= tracker.prompt_len: |
There was a problem hiding this comment.
Is this a related bug or a separate one?
There was a problem hiding this comment.
this is the same bug. the input tokens may be larger than the prompt_len when a preempted request is recovered.
| req_id = request.request_id | ||
|
|
||
| # consult the cache before any processing | ||
| if cached_num_hit_toks := self.lookup_client.lookup_cache(lookup_id=req_id): |
There was a problem hiding this comment.
If we config to initialize a HitLimitLookupClient instance, it would be wrong.
Because the two above did not implement the new api you added. @sammshen
There was a problem hiding this comment.
Maybe #2021 this can help to make sure we have implemented all the method without forget any one.
There was a problem hiding this comment.
@sammshen You can add the following code to the tail of lmcache/v1/lookup_client/hit_limit_lookup_client.py
def lookup_cache(self, lookup_id: str) -> Optional[int]:
return self.actual_lookup_client.lookup_cache(lookup_id)There was a problem hiding this comment.
the default implementation will cover it right? the default implementation is to return None in lmcache/v1/lookup_client/abstract_client.py
There was a problem hiding this comment.
thanks @maobaolong I did not read the hit limit client carefully enough before, but updated now!
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
|
@maobaolong @chunxiaozheng feel free to take a look again |
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
…LMCache into localdev/preemption-loading
|
@sammshen Sorry, as def lookup_cache(self, lookup_id: str) -> Optional[int]:
return self.actual_lookup_client.lookup_cache(lookup_id) |
|
@sammshen I pressed the |
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
|
The UT is very useful! updated @maobaolong |
ApostaC
left a comment
There was a problem hiding this comment.
LGTM as we discussed offline.
|
@sammshen Do we want to create a special comprehensive test case if we know how to reproduce the preemption case? |
This is a great idea! |
* initial commit Signed-off-by: Samuel Shen <slshen@uchciago.edu> * remove abstract default implementation Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change naming of preempted cached requests Signed-off-by: Samuel Shen <slshen@uchciago.edu> * renaming Signed-off-by: Samuel Shen <slshen@uchciago.edu> * finally working Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add lookup_cache to hit limit lookup client Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add lookup_cache to ChunkStatisticsLookupClient Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: maobaolong <baoloongmao@tencent.com>
* initial commit Signed-off-by: Samuel Shen <slshen@uchciago.edu> * remove abstract default implementation Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change naming of preempted cached requests Signed-off-by: Samuel Shen <slshen@uchciago.edu> * renaming Signed-off-by: Samuel Shen <slshen@uchciago.edu> * finally working Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add lookup_cache to hit limit lookup client Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add lookup_cache to ChunkStatisticsLookupClient Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: maobaolong <baoloongmao@tencent.com>
* initial commit Signed-off-by: Samuel Shen <slshen@uchciago.edu> * remove abstract default implementation Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change naming of preempted cached requests Signed-off-by: Samuel Shen <slshen@uchciago.edu> * renaming Signed-off-by: Samuel Shen <slshen@uchciago.edu> * finally working Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add lookup_cache to hit limit lookup client Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add lookup_cache to ChunkStatisticsLookupClient Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: maobaolong <baoloongmao@tencent.com>
FIX #1969 #1361
This is the script I used to test preemption minimal case.
Logs before the fix (we see the tokens but don't load them):
Logs after the fix: