[Bugfix] fix crash in wait_for_save when retrieve fail from lmcache_engine#2516
Conversation
Summary of ChangesHello @liubj77, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical bug in the LMCache integration with vLLM, specifically addressing a crash that occurs when the cache retrieval mechanism fails and vLLM rolls back a request's computed tokens. The changes ensure that the internal state of the LMCache connector remains consistent with vLLM's token tracking, thereby preventing system instability and improving the robustness of the caching layer. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a critical crash that occurs when a retrieval from lmcache_engine fails, causing a desynchronization between vLLM's request state and LMCache's internal tracker. The fix correctly handles this token rollback scenario by truncating the tracker state to match vLLM's state. The implementation is robust, including a check for available token slots. I've included one suggestion to refactor the new logic for improved clarity and efficiency.
051cc4c to
c831d95
Compare
…ngine Signed-off-by: liubj77 <liubj77@gmail.com>
105291c to
ed36ab6
Compare
|
Hello @liubj77, |
hi @DongDongJu , you probably input comments in wrong context? this PR is not from me :) ps, my pending PRs are #2467, #2478, #2509, please also help review. thanks! 👍 |
|
btw, for retrieve fail... I'm really not aware such handing in vllm. maybe we're not running the exact version or cases.... so i can't tell more about this patch. |
My bad. Sorry! I will take a look too. |
@DongDongJu I discovered this issue while using eic_connector to proxy requests to our backend. The hack code above is to reproduce this scenario.
This logic has been present in vllm from version 0.12 to 0.14; I haven't checked earlier versions. |
DongDongJu
left a comment
There was a problem hiding this comment.
This makes sense to me.
When KV retrieval fails, vLLM can roll a request back by adjusting request.num_computed_tokens to the longest valid prefix (invalid-block handling).
Before this change, request_tracker.token_ids could remain ahead of num_computed_tokens, which matches the reported wait_for_save crash len(slot_mapping) != len(token_ids).
With this patch rollback num_current_tokens < len(token_ids) and truncate tracker state so token_ids stays aligned with vLLM’s rolled-back progress, preventing the assertion.
Nit: Should num_saved_tokens be clamped to tokens_to_keep to keep it <= len(token_ids) after truncation?
@DongDongJu I don't think these two are the same concept. |
DongDongJu
left a comment
There was a problem hiding this comment.
I indicated the line I mentioned earlier.
| request.all_token_ids[:tokens_to_keep] | ||
| ) | ||
| request_tracker.num_saved_tokens = min( | ||
| request_tracker.num_saved_tokens, num_current_tokens |
There was a problem hiding this comment.
My concern is here.
IIUC, vllm using num_saved_tokens as skip_leading_tokens.
please correct me if im wrong.
Can we change num_current_tokens to tokens_to_keep?
There was a problem hiding this comment.
I've reviewed the logic again, as you said, using tokens_to_keep instead of num_current_tokens would be more robust, and I've updated the pr.
When retrieval fails, the computed number of tokens must be less than both the previous num_saved_tokens and last_allocated_block_ids * block_size. Therefore, the condition num_token_slots < num_current_tokens will never be triggered in this case, here is just a double check.
|
please check the DCO |
53ead90 to
b709b6d
Compare
@DongDongJu done |
|
I applied this change to my test in #2294, this change does not fix the problem. The assertion disappeared but the next line failed: I believe other lines should be also modified to completely fix this issue, like this def update(...)
...
self.allocated_block_ids.extend(new_block_ids)
self.token_ids.extend(new_token_ids)I added logs here and found it still mistakenly extend self.token_ids Hope this helps |
Can you confirm that these are the same issue? This fixes the |
I guess so. Your fixs does get rid of |
Hello @ziruiliu, Thanks for the call out. let me take a look more detail today. |
|
Thanks for the fix, @liubj77! 🙏 When retrieval fails for some blocks, lmcache reports invalid blocks and the vllm scheduler rolls back request.num_computed_tokens in Scheduler._update_requests_with_invalid_blocks, but the allocated block_ids aren’t evicted for request. However, lmcache does not roll back RequestTracker.token_ids accordingly. Later, in lmache ReqMeta.from_request_tracker, slot_mapping is derived as: |
|
@deng451e would you like to create a PR for the fix discussed offline? |
@deng451e Modifying only |
| new_token_ids, | ||
| new_block_ids, | ||
| preempted=preempted, | ||
| lmcache_cached_tokens=lmcache_cached_tokens, |
There was a problem hiding this comment.
Maybe we should also update load_spec.lmcache_cached_tokens here? Otherwise this change could overwrite num_saved_tokens when updating the request tracker.
There was a problem hiding this comment.
I think this is unnecessary. The num_saved_tokens in RequestTracker is only updated in preempted scenarios, which is not the case here.
sammshen
left a comment
There was a problem hiding this comment.
LGTM! This is a great fix
DongDongJu
left a comment
There was a problem hiding this comment.
Thanks for revising. Good to go.
|
@liubj77 lots of CI failed. PTAL |
Signed-off-by: liubj77 <liubj77@gmail.com>
@DongDongJu The code formatting has already been handled. Is there anything else I need to do? The workflow requires approval to continue running, and links like buildkite/k3-comprehensive-test are showing "page not found". |
|
please do one more pre-commit @liubj77 |
|
updating the branch to address the not found but the k3 tests are non-blocking for now |
Signed-off-by: liubj77 <liubj77@gmail.com>
Head branch was pushed to by a user without write access
|
Thanks! That's solves the issues with unreliable backends for us indeed. |
…ngine (LMCache#2516) * [bugfix] fix crash in wait_for_save when retrieve fail from lmcache_engine Signed-off-by: liubj77 <liubj77@gmail.com> * update Signed-off-by: liubj77 <liubj77@gmail.com> * format code Signed-off-by: liubj77 <liubj77@gmail.com> * format code Signed-off-by: liubj77 <liubj77@gmail.com> --------- Signed-off-by: liubj77 <liubj77@gmail.com> Co-authored-by: Samuel Shen <slshen@tensormesh.ai> Signed-off-by: shaoxiawjc <wjc2800@163.com>
…ngine (LMCache#2516) * [bugfix] fix crash in wait_for_save when retrieve fail from lmcache_engine Signed-off-by: liubj77 <liubj77@gmail.com> * update Signed-off-by: liubj77 <liubj77@gmail.com> * format code Signed-off-by: liubj77 <liubj77@gmail.com> * format code Signed-off-by: liubj77 <liubj77@gmail.com> --------- Signed-off-by: liubj77 <liubj77@gmail.com> Co-authored-by: Samuel Shen <slshen@tensormesh.ai> Signed-off-by: Aaron Wu <aaron.wu@dell.com>
…ngine (LMCache#2516) * [bugfix] fix crash in wait_for_save when retrieve fail from lmcache_engine Signed-off-by: liubj77 <liubj77@gmail.com> * update Signed-off-by: liubj77 <liubj77@gmail.com> * format code Signed-off-by: liubj77 <liubj77@gmail.com> * format code Signed-off-by: liubj77 <liubj77@gmail.com> --------- Signed-off-by: liubj77 <liubj77@gmail.com> Co-authored-by: Samuel Shen <slshen@tensormesh.ai>
…ngine (LMCache#2516) * [bugfix] fix crash in wait_for_save when retrieve fail from lmcache_engine Signed-off-by: liubj77 <liubj77@gmail.com> * update Signed-off-by: liubj77 <liubj77@gmail.com> * format code Signed-off-by: liubj77 <liubj77@gmail.com> * format code Signed-off-by: liubj77 <liubj77@gmail.com> --------- Signed-off-by: liubj77 <liubj77@gmail.com> Co-authored-by: Samuel Shen <slshen@tensormesh.ai>
Fixes #2294 #2327 #1732 #2204
@DongDongJu @sammshen Please take a look
What this PR does / why we need it:
In LMCacheConnectorV1Impl, if retrieval from the lmcache_engine fails, vLLM will call _handle_invalid_blocks to process the computed tokens of the request. However, this scenario is not currently handled by LMCacheConnectorV1Impl.
The crash stack show below:
It can be easily reproduce:
the token length of the request is greater than 1 block,
enforce schedule the request in multiple steps
get_num_new_matched_tokensreturn is greater than 0,get_num_new_matched_tokensself.lmcache_engine.retrievereturn fail instart_load_kv_process_tokens_internalinLMCacheEngine.retrieveto simulate retrieve failThe vLLM start command like below:
The prompt like below:
The same crash will appear when the above request is received.
Special notes for your reviewers:
After this change, I compared the request.num_computed_tokens in self._unfinished_requests and len(self._request_trackers[req_id].token_ids), which are exactly the same, and there is no way to guarantee it before.
I have been running this modification online for more than 4 day until now. Before that, the same pressure tests would crash in up to 3 to 4 hours.
If applicable: