[Bugfix] fix crash in wait_for_save when retrieve fail from lmcache_engine by liubj77 · Pull Request #2516 · LMCache/LMCache

liubj77 · 2026-01-30T09:59:09Z

@DongDongJu @sammshen Please take a look

What this PR does / why we need it:

In LMCacheConnectorV1Impl, if retrieval from the lmcache_engine fails, vLLM will call _handle_invalid_blocks to process the computed tokens of the request. However, this scenario is not currently handled by LMCacheConnectorV1Impl.

The crash stack show below:

(EngineCore_DP0 pid=85772)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=85772)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(EngineCore_DP0 pid=85772)     next(self.gen)
(EngineCore_DP0 pid=85772)   File "/home/baojun.lbj/pycharm/vllm/vllm/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 131, in _get_kv_connector_output
(EngineCore_DP0 pid=85772)     kv_connector.wait_for_save()
(EngineCore_DP0 pid=85772)   File "/home/baojun.lbj/pycharm/vllm/vllm/vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py", line 187, in wait_for_save
(EngineCore_DP0 pid=85772)     self._lmcache_engine.wait_for_save()
(EngineCore_DP0 pid=85772)   File "/usr/local/lib/python3.12/dist-packages/lmcache/integration/vllm/vllm_v1_adapter.py", line 1105, in wait_for_save
(EngineCore_DP0 pid=85772)     assert len(slot_mapping) == len(token_ids)
(EngineCore_DP0 pid=85772)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=85772) AssertionError

It can be easily reproduce:

the token length of the request is greater than 1 block,
enforce schedule the request in multiple steps

get_num_new_matched_tokens return is greater than 0,

we can simply hack the code in get_num_new_matched_tokens

# num_external_hit_tokens = self.lookup_client.lookup(
#     token_ids,
#     lookup_id=req_id,
#     request_configs=request_configs,
# )
num_external_hit_tokens = 2

self.lmcache_engine.retrieve return fail in start_load_kv
- just comment _process_tokens_internal in LMCacheEngine.retrieve to simulate retrieve fail

The vLLM start command like below:

LMCACHE_CHUNK_SIZE=1 vllm serve facebook/opt-125m --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' --chat-template=/home/baojun.lbj/work/chat-tmpl.jinja --max-num-batched-tokens=4 --no-async-scheduling
#  set LMCACHE_CHUNK_SIZE=1 
#  set max-num-batched-tokens=4, so the request can be easily scheduled in multiple steps

The prompt like below:

curl -X POST "http://localhost:8000/v1/completions" -H "Content-Type: application/json" -d '{
    "model": "facebook/opt-125m",
    "prompt": "Where is the capital of China? And tell me what the whether of Hangzhou, please as detail as possible",
    "temperature": 0.0,
    "best_of": 1,
    "max_tokens": 100
}'

The same crash will appear when the above request is received.

Special notes for your reviewers:

After this change, I compared the request.num_computed_tokens in self._unfinished_requests and len(self._request_trackers[req_id].token_ids), which are exactly the same, and there is no way to guarantee it before.

I have been running this modification online for more than 4 day until now. Before that, the same pressure tests would crash in up to 3 to 4 hours.

If applicable:

this PR contains user facing changes - docs added
this PR contains unit tests

gemini-code-assist · 2026-01-30T09:59:28Z

Summary of Changes

Hello @liubj77, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug in the LMCache integration with vLLM, specifically addressing a crash that occurs when the cache retrieval mechanism fails and vLLM rolls back a request's computed tokens. The changes ensure that the internal state of the LMCache connector remains consistent with vLLM's token tracking, thereby preventing system instability and improving the robustness of the caching layer.

Highlights

Crash Fix: Fixes a crash in LMCacheConnectorV1Impl.wait_for_save that occurred when lmcache_engine.retrieve failed, leading to an assertion error due to mismatched token counts.
State Synchronization: Synchronizes request_tracker.token_ids and request_tracker.num_saved_tokens with the actual num_current_tokens after vLLM rolls back a request, ensuring consistency between vLLM's internal state and the LMCache connector.
Token Slot Safeguard: Adds safeguards to truncate token_ids if they exceed the allocated block slots, preventing further inconsistencies and potential memory issues.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses a critical crash that occurs when a retrieval from lmcache_engine fails, causing a desynchronization between vLLM's request state and LMCache's internal tracker. The fix correctly handles this token rollback scenario by truncating the tracker state to match vLLM's state. The implementation is robust, including a check for available token slots. I've included one suggestion to refactor the new logic for improved clarity and efficiency.

…ngine Signed-off-by: liubj77 <liubj77@gmail.com>

DongDongJu · 2026-02-02T16:49:48Z

Hello @liubj77,
Thank you for your hard work!
Do you know when retrieve failed in vllm?

hlin99 · 2026-02-03T00:06:22Z

Hello @hlin99, Thank you for your hard work! Do you know when retrieve failed in vllm?

hi @DongDongJu , you probably input comments in wrong context? this PR is not from me :)

ps, my pending PRs are #2467, #2478, #2509, please also help review. thanks! 👍

hlin99 · 2026-02-03T00:08:15Z

btw, for retrieve fail... I'm really not aware such handing in vllm. maybe we're not running the exact version or cases.... so i can't tell more about this patch.

DongDongJu · 2026-02-03T00:09:11Z

Hello @hlin99, Thank you for your hard work! Do you know when retrieve failed in vllm?

hi @DongDongJu , you probably input comments in wrong context? this PR is not from me :)

ps, my pending PRs are #2467, #2478, #2509, please also help review. thanks! 👍

My bad. Sorry! I will take a look too.

liubj77 · 2026-02-03T03:33:50Z

Hello @liubj77, Thank you for your hard work! Do you know when retrieve failed in vllm?

@DongDongJu I discovered this issue while using eic_connector to proxy requests to our backend. The hack code above is to reproduce this scenario.

In lmcache, if retrieve in start_load_kv fails, _invalid_block_ids will be updated.
vllm calls output.invalid_block_ids = kv_connector.get_block_ids_with_load_errors() in function KVConnectorModelRunnerMixin._get_kv_connector_output to obtain these invalid_block_ids for processing.
Then the processing function is _handle_invalid_blocks called by update_from_output in scheduler, which modifies request.num_computed_tokens, but lmcache does not handle this scenario.

This logic has been present in vllm from version 0.12 to 0.14; I haven't checked earlier versions.

DongDongJu

This makes sense to me.

When KV retrieval fails, vLLM can roll a request back by adjusting request.num_computed_tokens to the longest valid prefix (invalid-block handling).

Before this change, request_tracker.token_ids could remain ahead of num_computed_tokens, which matches the reported wait_for_save crash len(slot_mapping) != len(token_ids).

With this patch rollback num_current_tokens < len(token_ids) and truncate tracker state so token_ids stays aligned with vLLM’s rolled-back progress, preventing the assertion.

Nit: Should num_saved_tokens be clamped to tokens_to_keep to keep it <= len(token_ids) after truncation?

liubj77 · 2026-02-03T17:19:19Z

This makes sense to me.

When KV retrieval fails, vLLM can roll a request back by adjusting request.num_computed_tokens to the longest valid prefix (invalid-block handling).

Before this change, request_tracker.token_ids could remain ahead of num_computed_tokens, which matches the reported wait_for_save crash len(slot_mapping) != len(token_ids).

With this patch rollback num_current_tokens < len(token_ids) and truncate tracker state so token_ids stays aligned with vLLM’s rolled-back progress, preventing the assertion.

Nit: Should num_saved_tokens be clamped to tokens_to_keep to keep it <= len(token_ids) after truncation?

@DongDongJu I don't think these two are the same concept. num_saved_tokens simply indicates the number of tokens already saved to lmcache. And both new requests and preempted requests will retrieve this information from the last lookup result, and this length will not exceed the length of num_computed_tokens.

DongDongJu

I indicated the line I mentioned earlier.

DongDongJu · 2026-02-03T18:53:21Z

+                    request.all_token_ids[:tokens_to_keep]
+                )
+                request_tracker.num_saved_tokens = min(
+                    request_tracker.num_saved_tokens, num_current_tokens


My concern is here.
IIUC, vllm using num_saved_tokens as skip_leading_tokens.
please correct me if im wrong.
Can we change num_current_tokens to tokens_to_keep?

I've reviewed the logic again, as you said, using tokens_to_keep instead of num_current_tokens would be more robust, and I've updated the pr.

When retrieval fails, the computed number of tokens must be less than both the previous num_saved_tokens and last_allocated_block_ids * block_size. Therefore, the condition num_token_slots < num_current_tokens will never be triggered in this case, here is just a double check.

Yes exactly. Thanks!

DongDongJu · 2026-02-04T18:44:19Z

please check the DCO

Signed-off-by: liubj77 <liubj77@gmail.com>

liubj77 · 2026-02-05T03:55:05Z

please check the DCO

@DongDongJu done

ziruiliu · 2026-02-10T09:41:45Z

I applied this change to my test in #2294, this change does not fix the problem. The assertion disappeared but the next line failed:

(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982]   File "/work/ziliu/vllm_exp/vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py", line 187, in wait_for_save
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982]     self._lmcache_engine.wait_for_save()
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982]   File "/work/ziliu/LMCache-0.3.12/lmcache/integration/vllm/vllm_v1_adapter.py", line 1123, in wait_for_save
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982]     slot_mapping = slot_mapping.to(self.device)
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982] torch.AcceleratorError: CUDA error: device-side assert triggered

I believe other lines should be also modified to completely fix this issue, like this

    def update(...)
    ...
            self.allocated_block_ids.extend(new_block_ids)
            self.token_ids.extend(new_token_ids)

I added logs here and found it still mistakenly extend self.token_ids

Hope this helps

liubj77 · 2026-02-11T02:12:46Z

I applied this change to my test in #2294, this change does not fix the problem. The assertion disappeared but the next line failed:

(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982]   File "/work/ziliu/vllm_exp/vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py", line 187, in wait_for_save
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982]     self._lmcache_engine.wait_for_save()
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982]   File "/work/ziliu/LMCache-0.3.12/lmcache/integration/vllm/vllm_v1_adapter.py", line 1123, in wait_for_save
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982]     slot_mapping = slot_mapping.to(self.device)
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982] torch.AcceleratorError: CUDA error: device-side assert triggered

I believe other lines should be also modified to completely fix this issue, like this

    def update(...)
    ...
            self.allocated_block_ids.extend(new_block_ids)
            self.token_ids.extend(new_token_ids)

I added logs here and found it still mistakenly extend self.token_ids

Hope this helps

Can you confirm that these are the same issue? This fixes the assert len(slot_mapping) == len(token_ids) error. allocated_block_ids cannot be guaranteed to align with token_ids because blocks allocated in the last vLLM allocation were not reclaimed, and slot_mapping is pruned based on len(token_ids).

ziruiliu · 2026-02-11T07:49:31Z

I applied this change to my test in #2294, this change does not fix the problem. The assertion disappeared but the next line failed:
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982]   File "/work/ziliu/vllm_exp/vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py", line 187, in wait_for_save
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982]     self._lmcache_engine.wait_for_save()
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982]   File "/work/ziliu/LMCache-0.3.12/lmcache/integration/vllm/vllm_v1_adapter.py", line 1123, in wait_for_save
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982]     slot_mapping = slot_mapping.to(self.device)
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982] torch.AcceleratorError: CUDA error: device-side assert triggered
I believe other lines should be also modified to completely fix this issue, like this
    def update(...)
    ...
            self.allocated_block_ids.extend(new_block_ids)
            self.token_ids.extend(new_token_ids)
I added logs here and found it still mistakenly extend self.token_ids
Hope this helps
Can you confirm that these are the same issue? This fixes the assert len(slot_mapping) == len(token_ids) error. allocated_block_ids cannot be guaranteed to align with token_ids because blocks allocated in the last vLLM allocation were not reclaimed, and slot_mapping is pruned based on len(token_ids).

I guess so. Your fixs does get rid of assert len(slot_mapping) == len(token_ids) , but there might be something else remaining so the next line slot_mapping = slot_mapping.to(self.device) throws an exception in my test.
As you included in the begining, issue #2294 is similar to your issue that vllm crashes when lmcache found but failed to retrieve the cache. Both hit the same place: assert len(slot_mapping) == len(token_ids) because request is rescheduled but slot_mapping is not aligned with token_ids.
However, these 2 issues may have difference root causes. I did not have much time to dive deep into it.

DongDongJu · 2026-02-11T22:59:16Z

I applied this change to my test in #2294, this change does not fix the problem. The assertion disappeared but the next line failed:
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982]   File "/work/ziliu/vllm_exp/vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py", line 187, in wait_for_save
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982]     self._lmcache_engine.wait_for_save()
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982]   File "/work/ziliu/LMCache-0.3.12/lmcache/integration/vllm/vllm_v1_adapter.py", line 1123, in wait_for_save
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982]     slot_mapping = slot_mapping.to(self.device)
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2978907) ERROR 02-10 09:33:27 [core.py:982] torch.AcceleratorError: CUDA error: device-side assert triggered
I believe other lines should be also modified to completely fix this issue, like this
    def update(...)
    ...
            self.allocated_block_ids.extend(new_block_ids)
            self.token_ids.extend(new_token_ids)
I added logs here and found it still mistakenly extend self.token_ids
Hope this helps
Can you confirm that these are the same issue? This fixes the assert len(slot_mapping) == len(token_ids) error. allocated_block_ids cannot be guaranteed to align with token_ids because blocks allocated in the last vLLM allocation were not reclaimed, and slot_mapping is pruned based on len(token_ids).
I guess so. Your fixs does get rid of assert len(slot_mapping) == len(token_ids) , but there might be something else remaining so the next line slot_mapping = slot_mapping.to(self.device) throws an exception in my test. As you included in the begining, issue #2294 is similar to your issue that vllm crashes when lmcache found but failed to retrieve the cache. Both hit the same place: assert len(slot_mapping) == len(token_ids) because request is rescheduled but slot_mapping is not aligned with token_ids. However, these 2 issues may have difference root causes. I did not have much time to dive deep into it.

Hello @ziruiliu, Thanks for the call out. let me take a look more detail today.

deng451e · 2026-02-13T00:03:42Z

Thanks for the fix, @liubj77! 🙏
Just wanted to add a bit more context on the code flow behind this bug.

When retrieval fails for some blocks, lmcache reports invalid blocks and the vllm scheduler rolls back request.num_computed_tokens in Scheduler._update_requests_with_invalid_blocks, but the allocated block_ids aren’t evicted for request. However, lmcache does not roll back RequestTracker.token_ids accordingly.

Later, in lmache ReqMeta.from_request_tracker, slot_mapping is derived as:
slot_mapping = ( block_offsets.reshape((1, block_size)) + block_ids.reshape((num_blocks, 1)) * block_size ).flatten()[: len(token_ids)]
When:
Rounded up(len(token_ids) / block_size) > Rounded up(request.num_computed_tokens / block_size)
no new blocks are assigned from scheduler because the num_blocks are sufficient for request.num_computed_tokens, causing:
len(slot_mapping) < len(token_ids) and triggering the assertion failure.
With your fix, I think simply rolling back RequestsTracker.token_ids to match request.num_computed_tokens should already be sufficient as .
request_tracker.token_ids = list(request.all_token_ids[:num_current_tokens])
num_saved_tokens can probably be ignored for now since it mainly affects KV saving (lmcache already has the tokens but retrieval failed), and those cases could be handled separately — e.g., removing the failed keys or verifying their existence (please correct me if I’m mistaken).

sammshen · 2026-02-13T02:48:57Z

@deng451e would you like to create a PR for the fix discussed offline?

liubj77 · 2026-02-14T03:44:26Z

Thanks for the fix, @liubj77! 🙏 Just wanted to add a bit more context on the code flow behind this bug.

When retrieval fails for some blocks, lmcache reports invalid blocks and the vllm scheduler rolls back request.num_computed_tokens in Scheduler._update_requests_with_invalid_blocks, but the allocated block_ids aren’t evicted for request. However, lmcache does not roll back RequestTracker.token_ids accordingly.

Later, in lmache ReqMeta.from_request_tracker, slot_mapping is derived as: slot_mapping = ( block_offsets.reshape((1, block_size)) + block_ids.reshape((num_blocks, 1)) * block_size ).flatten()[: len(token_ids)] When: Rounded up(len(token_ids) / block_size) > Rounded up(request.num_computed_tokens / block_size) no new blocks are assigned from scheduler because the num_blocks are sufficient for request.num_computed_tokens, causing: len(slot_mapping) < len(token_ids) and triggering the assertion failure. With your fix, I think simply rolling back RequestsTracker.token_ids to match request.num_computed_tokens should already be sufficient as . request_tracker.token_ids = list(request.all_token_ids[:num_current_tokens]) num_saved_tokens can probably be ignored for now since it mainly affects KV saving (lmcache already has the tokens but retrieval failed), and those cases could be handled separately — e.g., removing the failed keys or verifying their existence (please correct me if I’m mistaken).

@deng451e Modifying only request_tracker.token_ids is fine, but I prefer to modify num_saved_tokens synchronously, as this is used to handle skip_leading_tokens.
If a previous retrieve failed, it's highly likely because the token no longer exists in the backend storage, resaving it is better. On the other hand, some people might wonder why num_saved_tokens is longer than token_ids when tracking logs, modifying num_saved_tokens synchronously is preferable.

deng451e · 2026-02-17T20:35:35Z

                new_token_ids,
                new_block_ids,
                preempted=preempted,
                lmcache_cached_tokens=lmcache_cached_tokens,


Maybe we should also update load_spec.lmcache_cached_tokens here? Otherwise this change could overwrite num_saved_tokens when updating the request tracker.

I think this is unnecessary. The num_saved_tokens in RequestTracker is only updated in preempted scenarios, which is not the case here.

I see — LGTM

sammshen

LGTM! This is a great fix

DongDongJu

Thanks for revising. Good to go.

DongDongJu · 2026-03-04T18:16:43Z

@liubj77 lots of CI failed. PTAL

Signed-off-by: liubj77 <liubj77@gmail.com>

liubj77 · 2026-03-06T09:34:34Z

@liubj77 lots of CI failed. PTAL

@DongDongJu The code formatting has already been handled. Is there anything else I need to do? The workflow requires approval to continue running, and links like buildkite/k3-comprehensive-test are showing "page not found".

sammshen · 2026-03-06T20:02:42Z

please do one more pre-commit @liubj77

sammshen · 2026-03-06T20:03:45Z

updating the branch to address the not found but the k3 tests are non-blocking for now

Signed-off-by: liubj77 <liubj77@gmail.com>

yanok · 2026-03-09T14:21:54Z

Thanks! That's solves the issues with unreliable backends for us indeed.

…ngine (LMCache#2516) * [bugfix] fix crash in wait_for_save when retrieve fail from lmcache_engine Signed-off-by: liubj77 <liubj77@gmail.com> * update Signed-off-by: liubj77 <liubj77@gmail.com> * format code Signed-off-by: liubj77 <liubj77@gmail.com> * format code Signed-off-by: liubj77 <liubj77@gmail.com> --------- Signed-off-by: liubj77 <liubj77@gmail.com> Co-authored-by: Samuel Shen <slshen@tensormesh.ai> Signed-off-by: shaoxiawjc <wjc2800@163.com>

…ngine (LMCache#2516) * [bugfix] fix crash in wait_for_save when retrieve fail from lmcache_engine Signed-off-by: liubj77 <liubj77@gmail.com> * update Signed-off-by: liubj77 <liubj77@gmail.com> * format code Signed-off-by: liubj77 <liubj77@gmail.com> * format code Signed-off-by: liubj77 <liubj77@gmail.com> --------- Signed-off-by: liubj77 <liubj77@gmail.com> Co-authored-by: Samuel Shen <slshen@tensormesh.ai> Signed-off-by: Aaron Wu <aaron.wu@dell.com>

…ngine (LMCache#2516) * [bugfix] fix crash in wait_for_save when retrieve fail from lmcache_engine Signed-off-by: liubj77 <liubj77@gmail.com> * update Signed-off-by: liubj77 <liubj77@gmail.com> * format code Signed-off-by: liubj77 <liubj77@gmail.com> * format code Signed-off-by: liubj77 <liubj77@gmail.com> --------- Signed-off-by: liubj77 <liubj77@gmail.com> Co-authored-by: Samuel Shen <slshen@tensormesh.ai>

gemini-code-assist Bot reviewed Jan 30, 2026

View reviewed changes

Comment thread lmcache/integration/vllm/vllm_v1_adapter.py Outdated

liubj77 force-pushed the fix/crash_when_retrieve_fail branch 7 times, most recently from 051cc4c to c831d95 Compare February 2, 2026 03:21

[bugfix] fix crash in wait_for_save when retrieve fail from lmcache_e…

ed36ab6

…ngine Signed-off-by: liubj77 <liubj77@gmail.com>

liubj77 force-pushed the fix/crash_when_retrieve_fail branch from 105291c to ed36ab6 Compare February 2, 2026 08:34

DongDongJu reviewed Feb 3, 2026

View reviewed changes

update

b709b6d

Signed-off-by: liubj77 <liubj77@gmail.com>

liubj77 force-pushed the fix/crash_when_retrieve_fail branch from 53ead90 to b709b6d Compare February 5, 2026 03:33

ziruiliu mentioned this pull request Feb 10, 2026

[Bug] vllm crashes if retrieve() fails #2294

Closed

deng451e reviewed Feb 17, 2026

View reviewed changes

deng451e approved these changes Mar 3, 2026

View reviewed changes

sammshen reviewed Mar 4, 2026

View reviewed changes

Comment thread lmcache/integration/vllm/vllm_v1_adapter.py

sammshen approved these changes Mar 4, 2026

View reviewed changes

DongDongJu approved these changes Mar 4, 2026

View reviewed changes

liubj77 added 3 commits March 5, 2026 10:46

format code

9d13d42

Signed-off-by: liubj77 <liubj77@gmail.com>

Merge branch 'dev' into fix/crash_when_retrieve_fail

7bfc090

Merge branch 'dev' into fix/crash_when_retrieve_fail

3c9c6d4

DongDongJu enabled auto-merge (squash) March 6, 2026 17:46

github-actions Bot added the full Run comprehensive tests on this PR label Mar 6, 2026

Merge branch 'dev' into fix/crash_when_retrieve_fail

ac3e186

liubj77 added 2 commits March 9, 2026 10:45

Merge branch 'dev' into fix/crash_when_retrieve_fail

da3c4ff

format code

9d6d99b

Signed-off-by: liubj77 <liubj77@gmail.com>

auto-merge was automatically disabled March 9, 2026 02:47
Head branch was pushed to by a user without write access

github-actions Bot removed the full Run comprehensive tests on this PR label Mar 9, 2026

Merge branch 'dev' into fix/crash_when_retrieve_fail

2a631d0

sammshen enabled auto-merge (squash) March 9, 2026 17:10

github-actions Bot added the full Run comprehensive tests on this PR label Mar 9, 2026

sammshen merged commit 7f2df61 into LMCache:dev Mar 9, 2026
26 of 29 checks passed

sihara mentioned this pull request May 6, 2026

[Bug] LMCache 0.4.4 + vLLM 0.20.0: CUDA OOB crash after partial retrieve #3207

Open

Conversation

liubj77 commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Jan 30, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

DongDongJu commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hlin99 commented Feb 3, 2026

Uh oh!

hlin99 commented Feb 3, 2026

Uh oh!

DongDongJu commented Feb 3, 2026

Uh oh!

liubj77 commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DongDongJu left a comment

Choose a reason for hiding this comment

Uh oh!

liubj77 commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DongDongJu left a comment

Choose a reason for hiding this comment

Uh oh!

DongDongJu Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

liubj77 Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DongDongJu Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

DongDongJu commented Feb 4, 2026

Uh oh!

liubj77 commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ziruiliu commented Feb 10, 2026

Uh oh!

liubj77 commented Feb 11, 2026

Uh oh!

ziruiliu commented Feb 11, 2026

Uh oh!

DongDongJu commented Feb 11, 2026

Uh oh!

deng451e commented Feb 13, 2026

Uh oh!

sammshen commented Feb 13, 2026

Uh oh!

liubj77 commented Feb 14, 2026

Uh oh!

deng451e Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liubj77 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

deng451e Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sammshen left a comment

Choose a reason for hiding this comment

Uh oh!

DongDongJu left a comment

Choose a reason for hiding this comment

liubj77 commented Jan 30, 2026 •

edited

Loading

DongDongJu commented Feb 2, 2026 •

edited

Loading

liubj77 commented Feb 3, 2026 •

edited

Loading

liubj77 commented Feb 3, 2026 •

edited

Loading

liubj77 Feb 4, 2026 •

edited

Loading

liubj77 commented Feb 5, 2026 •

edited

Loading

deng451e Feb 17, 2026 •

edited

Loading