[Bugfix] vllm layerwise save stopiteration#2268
[Bugfix] vllm layerwise save stopiteration#2268DongDongJu wants to merge 8 commits intoLMCache:devfrom
Conversation
vLLM calls LMCache's layerwise save pump (store_layer generator) from save_kv_layer() and then finalizes it in wait_for_save(). In practice, wait_for_save() can be invoked when there is nothing to save (e.g. empty connector metadata), or invoked multiple times, and the generator may already be exhausted. Previously, next(layerwise_storer) could raise StopIteration, which vLLM treats as fatal (PEP 479 turns it into RuntimeError in generator contexts), bringing down EngineCore. Make worker-side layerwise saving idempotent: - Drain layerwise storers to completion in wait_for_save() and clear them. - Track req_ids for unpin even when connector metadata is empty. - Ensure store_layer() yields the expected step count in freeze mode. - Add an INFO log when ZMQOffloadServer starts to aid debugging. Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Signed-off-by: DongDongJu <commisori28@gmail.com>
parse_cache_key() determines whether a key string includes a layer_id by checking which '@' segment is numeric. The key format now includes a dtype segment, so the layer_id index shifts by one. Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Signed-off-by: DongDongJu <commisori28@gmail.com>
Summary of ChangesHello @DongDongJu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a significant stability issue in the vLLM v1 integration with LMCache, where the vLLM engine could crash due to Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a StopIteration crash in the vLLM integration by implementing a more robust handling of layerwise KV saving generators. The changes, including draining generators completely, clearing stale ones, and ensuring proper unpinning of resources, are well-thought-out and directly tackle the described race conditions. A related fix for cache key parsing is also included, improving correctness. My review includes one suggestion to refactor a loop to be more idiomatic, which improves code readability.
Use a simple for-loop to exhaust the store_layer generator in wait_for_save(). Behavior is unchanged, but readability is improved. Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Signed-off-by: DongDongJu <commisori28@gmail.com>
| # plus one extra in wait_for_save()). | ||
| for layer_id in range(self.num_layers): | ||
| yield | ||
| # One extra yield for the "finalize" step. |
There was a problem hiding this comment.
what's the finalize step? wait_for_save?
There was a problem hiding this comment.
Yes it's the final stage of the store_layer() generator (the extra progress after the last layer) so the pipeline can finish the last pending put/cleanup.
vLLM reaches it by exhausting the generator in wait_for_save().
but same here, please correct me if im wrong.
There was a problem hiding this comment.
could you leave this as a comment here?
There was a problem hiding this comment.
Sure will do. Thanks!
| for _ in layerwise_storer: | ||
| pass | ||
|
|
||
| # Prevent double-finalization if wait_for_save() is called twice. |
There was a problem hiding this comment.
why will wait_for_save() be called twice?
There was a problem hiding this comment.
from my understanding, vLLM can invoke it in multiple forward steps even when a later step has nothing to save (e.g., empty connector metadata), and without clearing, stale storers from a previous step could be finalized again.
Clearing makes the method idempotent/safe under such call patterns.
But, let me know if I am wrong at here.
|
apologies for the delay, will review this tomorrow! |
sammshen
left a comment
There was a problem hiding this comment.
This PR makes sense to me. Thanks for the extremely detailed write up.
Do you think the generator code needs refactoring or is most of the complexity unavoidable due to vllm scheduler behavior? how did you find out most of these unusual behaviors?
Did you run any tests?
There was a problem hiding this comment.
@DongDongJu I'm still a bit confused about why this PR is needed. maybe it's more helpful to point out the related vllm code as well? Do you mean that vllm can execute multiple forward steps at the same time?
| """ | ||
| parts = key_str.strip().split("@") | ||
| if len(parts) >= 6 and parts[5].isdigit(): | ||
| if len(parts) >= 7 and parts[6].isdigit(): |
There was a problem hiding this comment.
This is not the direct cause of the StopIteration crash above, but it is a correctness fix for key parsing that was required to keep layerwise mode test.
|
Hello @YaoJiayi , Happy holidays.
Sure, Let me try to clarify what I understood about layerwise mode's design first.
So what exactly solving here is that,
Let me indicate more details with vllm codes. vLLM v1 calls But the LMCache save hook is conditional: So, actually for metadata, most easy fix is change this lines which makes metadata to none if nothing in there or change this lines check that length of meta is not 0. I thought that might be good to have a guard in LMCache side for the future too. Additionally, why it was not happened during sync one is when use_layerwise is off, LMCache isn’t driving a multi-yield generator that needs “finalization by next()”, so we don’t get the specific “StopIteration → EngineCore dies” failure mode. The “weird” behavior (vLLM calling wait_for_save() even on no-save steps) can still happen, but without the layerwise generator in the middle, there’s nothing to accidentally advance past completion. @ApostaC you might also need to check. |
Hello @sammshen. Happy holidays. I believe vllm's behavior is complex enough and like I mentioned reply on jiayi's thread there are multiple case this one can happened and in the future with vllm code changes as well.
Yes, like recent issues #2327, we can easily reproduce it with example workloads. I used multi turn chat with 10 user concurrencies. |
I pulled the commits in this PR to test and confirmed the issue in #2327 still exists. |
Hello @ningziwen, Sorry for my wrong indication. |
|
IDW which patch resolve this but it seems problem disappear. |
bug fix for #2135 and #2137
For #2135, motivated by @crystalxyz's fix but little bit different.
What this PR does / why we need it:
In the vLLM v1 integration, LMCache performs layerwise KV saving using a Python generator returned by LMCacheEngine.store_layer().
vLLM pumps this generator during attention execution (save_kv_layer) and finalizes it at the end of a
forward step (wait_for_save).
Under some scheduling / concurrency patterns, vLLM can call wait_for_save() when there is
nothing to save (e.g., empty connector metadata) or can call it multiple times.
In the old implementation, LMCache's worker-side adapter could call next() on an already exhausted
store_layer generator, raising StopIteration.
vLLM treats that exception as fatal and kills EngineCore (EngineDeadError).
In generator context-manager paths, Python may also convert it into RuntimeError, making it even more disruptive.
What exactly crashed
The crash chain looked like:
Why it did NOT happen often under low concurrency
This bug requires a specific sequence of events that is less common with low concurrency:
A previous forward step creates layerwise storers (store_layer generators) and advances
them.
A later forward step has nothing to save, for example:
Even in step (2), vLLM still calls wait_for_save() at forward exit.
If the adapter still holds stale layerwise storers from step (1), then calling next() again
hits StopIteration.
With low concurrency, batches tend to be more stable: many steps still have at least one
savable request, and the save/finalize pattern is more consistent. As concurrency increases,
more no-save steps occur (due to scheduling, batching, prefix caching, chunked prefill,
compile/warmup behavior, and request composition). This makes the required sequence more
frequent and exposes the bug.
Root cause (fundamental issue)
The fundamental issue is a contract mismatch / fragility around using a generator as a
multi-stage async pipeline:
This is not safe if:
Additionally, LMCacheEngine.store_layer() had a freeze-mode branch that yielded a different
number of steps than the normal path, which could make expected number of next() calls
assumptions even more brittle.
Fix / solution approach
The fix makes layerwise saving robust and idempotent.
Drain generators to completion in wait_for_save()
generator is simply already done.
Clear layerwise storers after finalization
forward steps.
Ensure unpin happens even when metadata is empty
fallback for lookup_unpin(). This avoids leaked pins when metadata is missing.
Make freeze-mode yield count consistent
Operational note
The fix is not intended to hide real errors. It treats StopIteration from the layerwise
pipeline generator as normal completion, which is the correct generator semantics.
All other unexpected exceptions still surface.
Files changed (implementation)
lmcache/integration/vllm/vllm_v1_adapter.py
lmcache/v1/cache_engine.py
lmcache/v1/offload_server/zmq_server.py
Related fix discovered while debugging
Separately, we also fixed lmcache/utils.py:parse_cache_key().
The cache key string format now includes an explicit dtype segment:
fmt@model@world_size@worker_id@chunk_hash@dtype[@layer_id][@tag%value...]
parse_cache_key() decides whether a key contains a layer_id by checking whether the
corresponding '@' segment is numeric. After adding the dtype segment, the layer_id
index shifts by one, so the old detection logic could mis-classify layer keys.
This is not the direct cause of the StopIteration crash above, but it is a correctness
fix for key parsing that was required to keep layer-key detection consistent with the
new format.