[Bugfix] vllm layerwise save stopiteration by DongDongJu · Pull Request #2268 · LMCache/LMCache

DongDongJu · 2025-12-17T21:03:40Z

bug fix for #2135 and #2137
For #2135, motivated by @crystalxyz's fix but little bit different.

What this PR does / why we need it:
In the vLLM v1 integration, LMCache performs layerwise KV saving using a Python generator returned by LMCacheEngine.store_layer().
vLLM pumps this generator during attention execution (save_kv_layer) and finalizes it at the end of a
forward step (wait_for_save).

Under some scheduling / concurrency patterns, vLLM can call wait_for_save() when there is
nothing to save (e.g., empty connector metadata) or can call it multiple times.
In the old implementation, LMCache's worker-side adapter could call next() on an already exhausted
store_layer generator, raising StopIteration.
vLLM treats that exception as fatal and kills EngineCore (EngineDeadError).
In generator context-manager paths, Python may also convert it into RuntimeError, making it even more disruptive.

What exactly crashed

The crash chain looked like:

vLLM worker forward exits
kv_connector.wait_for_save() is invoked
LMCache vLLM adapter calls next(layerwise_storer)
layerwise_storer is already exhausted -> StopIteration
vLLM propagates the exception as fatal -> EngineCore terminates

Why it did NOT happen often under low concurrency

This bug requires a specific sequence of events that is less common with low concurrency:

A previous forward step creates layerwise storers (store_layer generators) and advances
them.
A later forward step has nothing to save, for example:
- connector metadata becomes empty (requests=[]), or
- all requests are not savable (can_save=False), or
- all tokens are effectively skipped (skip_leading_tokens causes a full skip), or
- the step is a warmup/compile/graph path where save_kv_layer is not triggered.
Even in step (2), vLLM still calls wait_for_save() at forward exit.

If the adapter still holds stale layerwise storers from step (1), then calling next() again
hits StopIteration.

With low concurrency, batches tend to be more stable: many steps still have at least one
savable request, and the save/finalize pattern is more consistent. As concurrency increases,
more no-save steps occur (due to scheduling, batching, prefix caching, chunked prefill,
compile/warmup behavior, and request composition). This makes the required sequence more
frequent and exposes the bug.

Root cause (fundamental issue)

The fundamental issue is a contract mismatch / fragility around using a generator as a
multi-stage async pipeline:

LMCache’s layerwise save is a multi-step generator.
vLLM calls into the adapter in a pattern that vary by step.
The old adapter assumed that calling next() once in wait_for_save() is always valid.
This is not safe if:
- wait_for_save() is called when there are no new storers,
- wait_for_save() is called multiple times,
- the generator finishes earlier than expected for any reason.

Additionally, LMCacheEngine.store_layer() had a freeze-mode branch that yielded a different
number of steps than the normal path, which could make expected number of next() calls
assumptions even more brittle.

Fix / solution approach

The fix makes layerwise saving robust and idempotent.

Drain generators to completion in wait_for_save()
- Instead of calling next() exactly once, we keep calling next() until StopIteration.
- This guarantees all remaining pipeline stages are completed.
- It also makes wait_for_save() safe if called multiple times: an already completed
  generator is simply already done.
Clear layerwise storers after finalization
- After draining, the adapter clears the list to avoid reusing stale generators in later
  forward steps.
Ensure unpin happens even when metadata is empty
- Under some steps, connector metadata can be empty at wait_for_save() time.
- We track req_ids used for layerwise storing in the current forward and use them as a
  fallback for lookup_unpin(). This avoids leaked pins when metadata is missing.
Make freeze-mode yield count consistent
- In freeze mode, store_layer() still yields the expected number of steps.
- This prevents subtle off-by-one behavior where integrations assume a specific step count.

Operational note

The fix is not intended to hide real errors. It treats StopIteration from the layerwise
pipeline generator as normal completion, which is the correct generator semantics.
All other unexpected exceptions still surface.

Files changed (implementation)

lmcache/integration/vllm/vllm_v1_adapter.py
- Drain store_layer generators in wait_for_save() and clear them.
- Track req_ids for unpin even if connector metadata becomes empty.
lmcache/v1/cache_engine.py
- In freeze mode, yield an additional finalize step to match expected step count.
lmcache/v1/offload_server/zmq_server.py
- Add an INFO log on server start to help debug

Related fix discovered while debugging

Separately, we also fixed lmcache/utils.py:parse_cache_key().

The cache key string format now includes an explicit dtype segment:

fmt@model@world_size@worker_id@chunk_hash@dtype[@layer_id][@tag%value...]

parse_cache_key() decides whether a key contains a layer_id by checking whether the
corresponding '@' segment is numeric. After adding the dtype segment, the layer_id
index shifts by one, so the old detection logic could mis-classify layer keys.

This is not the direct cause of the StopIteration crash above, but it is a correctness
fix for key parsing that was required to keep layer-key detection consistent with the
new format.

this PR contains user facing changes - docs added
this PR contains unit tests

vLLM calls LMCache's layerwise save pump (store_layer generator) from save_kv_layer() and then finalizes it in wait_for_save(). In practice, wait_for_save() can be invoked when there is nothing to save (e.g. empty connector metadata), or invoked multiple times, and the generator may already be exhausted. Previously, next(layerwise_storer) could raise StopIteration, which vLLM treats as fatal (PEP 479 turns it into RuntimeError in generator contexts), bringing down EngineCore. Make worker-side layerwise saving idempotent: - Drain layerwise storers to completion in wait_for_save() and clear them. - Track req_ids for unpin even when connector metadata is empty. - Ensure store_layer() yields the expected step count in freeze mode. - Add an INFO log when ZMQOffloadServer starts to aid debugging. Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Signed-off-by: DongDongJu <commisori28@gmail.com>

parse_cache_key() determines whether a key string includes a layer_id by checking which '@' segment is numeric. The key format now includes a dtype segment, so the layer_id index shifts by one. Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Signed-off-by: DongDongJu <commisori28@gmail.com>

gemini-code-assist · 2025-12-17T21:03:58Z

Summary of Changes

Hello @DongDongJu, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a significant stability issue in the vLLM v1 integration with LMCache, where the vLLM engine could crash due to StopIteration exceptions originating from the layerwise KV saving mechanism. The changes enhance the robustness of generator handling, ensuring that KV cache finalization and unpinning occur reliably across various operational scenarios, including those with high concurrency or when no data needs saving. Additionally, it includes a necessary adjustment to cache key parsing to maintain correctness with an updated key format.

Highlights

Bug Fix: vLLM Engine Crash: Addresses a critical bug where vLLM's EngineCore would crash due to StopIteration exceptions from LMCache's layerwise KV saving generators. This occurred when generators were prematurely exhausted or wait_for_save() was called multiple times, especially under high concurrency.
Robust Generator Handling: The fix makes layerwise saving robust and idempotent by draining generators to completion in wait_for_save() and clearing them afterwards, preventing issues with stale generators or multiple finalization calls.
Consistent KV Cache Unpinning: Ensures KV caches are correctly unpinned even when connector metadata is empty by tracking req_ids that initiated layerwise storing, providing a fallback for lookup_unpin().
Freeze Mode Generator Consistency: Modified store_layer() in freeze mode to yield an additional finalize step, ensuring consistency with the expected step count for layerwise integrations like vLLM.
Cache Key Parsing Correction: Includes a related fix for parse_cache_key() in lmcache/utils.py to correctly identify layer_id after the introduction of a dtype segment in the cache key format.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses a StopIteration crash in the vLLM integration by implementing a more robust handling of layerwise KV saving generators. The changes, including draining generators completely, clearing stale ones, and ensuring proper unpinning of resources, are well-thought-out and directly tackle the described race conditions. A related fix for cache key parsing is also included, improving correctness. My review includes one suggestion to refactor a loop to be more idiomatic, which improves code readability.

Use a simple for-loop to exhaust the store_layer generator in wait_for_save(). Behavior is unchanged, but readability is improved. Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Signed-off-by: DongDongJu <commisori28@gmail.com>

sammshen · 2025-12-17T23:35:22Z

+            # plus one extra in wait_for_save()).
            for layer_id in range(self.num_layers):
                yield
+            # One extra yield for the "finalize" step.


what's the finalize step? wait_for_save?

Yes it's the final stage of the store_layer() generator (the extra progress after the last layer) so the pipeline can finish the last pending put/cleanup.
vLLM reaches it by exhausting the generator in wait_for_save().
but same here, please correct me if im wrong.

could you leave this as a comment here?

Sure will do. Thanks!

sammshen · 2025-12-17T23:41:22Z

+                for _ in layerwise_storer:
+                    pass
+
+            # Prevent double-finalization if wait_for_save() is called twice.


why will wait_for_save() be called twice?

from my understanding, vLLM can invoke it in multiple forward steps even when a later step has nothing to save (e.g., empty connector metadata), and without clearing, stale storers from a previous step could be finalized again.
Clearing makes the method idempotent/safe under such call patterns.
But, let me know if I am wrong at here.

sammshen · 2025-12-26T07:42:50Z

apologies for the delay, will review this tomorrow!

sammshen

This PR makes sense to me. Thanks for the extremely detailed write up.

Do you think the generator code needs refactoring or is most of the complexity unavoidable due to vllm scheduler behavior? how did you find out most of these unusual behaviors?

Did you run any tests?

YaoJiayi

@DongDongJu I'm still a bit confused about why this PR is needed. maybe it's more helpful to point out the related vllm code as well? Do you mean that vllm can execute multiple forward steps at the same time?

YaoJiayi · 2025-12-29T01:52:41Z

    """
    parts = key_str.strip().split("@")
-    if len(parts) >= 6 and parts[5].isdigit():
+    if len(parts) >= 7 and parts[6].isdigit():


Why is this needed?

This is not the direct cause of the StopIteration crash above, but it is a correctness fix for key parsing that was required to keep layerwise mode test.

DongDongJu · 2025-12-29T16:15:21Z

Hello @YaoJiayi , Happy holidays.

@DongDongJu I'm still a bit confused about why this PR is needed. maybe it's more helpful to point out the related vllm code as well? Do you mean that vllm can execute multiple forward steps at the same time?

Sure, Let me try to clarify what I understood about layerwise mode's design first.

save_kv_layer() advances a storage generator(per layer) via next().
wait_for_save() finalizes that Storage Generator with a last next() call ( and, completes the batched puts/ cleanup).

So what exactly solving here is that,

layerwise saving uses a generator returned by LMCacheEngine.store_layer()
vLLM pumps it during attention (save_kv_layer) and “finalizes” it at forward-exit (wait_for_save) but vLLM can call wait_for_save() on a forward step that has nothing to save (metadata empty- (current case is not None, empty list) / not savable / warmup, etc.) or call it multiple times.
few examples : metadata empty, dummy runs, request not savable.
“Wait-for-save” happens even if metadata is empty (default behavior)
LMCache connector start_load_kv also no-ops when attn_metadata is None
if LMCache’s adapter still holds stale layerwise storers from a prior step, the old code would call next() on an already exhausted generator → StopIteration → vLLM treats it as fatal

Let me indicate more details with vllm codes.

vLLM v1 calls kv_connector.wait_for_save() at the end of every forward step via

KVConnectorModelRunnerMixin._get_kv_connector_output(...): 
finally: 
    if wait_for_save: 
        kv_connector.wait_for_save()

But the LMCache save hook is conditional: maybe_transfer_kv_layer is a no-op when attn_metadata is None or not connector.has_connector_metadata(), so some steps legitimately never call save_kv_layer() (e.g. empty/unsavable connector metadata, or dummy/warmup/graph runs without attn metadata). reference

So, actually for metadata, most easy fix is change this lines which makes metadata to none if nothing in there or change this lines check that length of meta is not 0. I thought that might be good to have a guard in LMCache side for the future too.

Additionally, why it was not happened during sync one is when use_layerwise is off, LMCache isn’t driving a multi-yield generator that needs “finalization by next()”, so we don’t get the specific “StopIteration → EngineCore dies” failure mode. The “weird” behavior (vLLM calling wait_for_save() even on no-save steps) can still happen, but without the layerwise generator in the middle, there’s nothing to accidentally advance past completion.

@ApostaC you might also need to check.

DongDongJu · 2025-12-29T16:30:24Z

This PR makes sense to me. Thanks for the extremely detailed write up.

Do you think the generator code needs refactoring or is most of the complexity unavoidable due to vllm scheduler behavior? how did you find out most of these unusual behaviors?

Hello @sammshen. Happy holidays.

I believe vllm's behavior is complex enough and like I mentioned reply on jiayi's thread there are multiple case this one can happened and in the future with vllm code changes as well.
So, good to have it as a guard for breakdown because this patch's logic do not break the KVCache itself.

Did you run any tests?

Yes, like recent issues #2327, we can easily reproduce it with example workloads. I used multi turn chat with 10 user concurrencies.

ningziwen · 2025-12-29T23:53:04Z

This PR makes sense to me. Thanks for the extremely detailed write up.
Do you think the generator code needs refactoring or is most of the complexity unavoidable due to vllm scheduler behavior? how did you find out most of these unusual behaviors?

Hello @sammshen. Happy holidays.

I believe vllm's behavior is complex enough and like I mentioned reply on jiayi's thread there are multiple case this one can happened and in the future with vllm code changes as well. So, good to have it as a guard for breakdown because this patch's logic do not break the KVCache itself.

Did you run any tests?

Yes, like recent issues #2327, we can easily reproduce it with example workloads. I used multi turn chat with 10 user concurrencies.

I pulled the commits in this PR to test and confirmed the issue in #2327 still exists.

DongDongJu · 2025-12-30T00:05:05Z

This PR makes sense to me. Thanks for the extremely detailed write up.
Do you think the generator code needs refactoring or is most of the complexity unavoidable due to vllm scheduler behavior? how did you find out most of these unusual behaviors?

Hello @sammshen. Happy holidays.
I believe vllm's behavior is complex enough and like I mentioned reply on jiayi's thread there are multiple case this one can happened and in the future with vllm code changes as well. So, good to have it as a guard for breakdown because this patch's logic do not break the KVCache itself.

Did you run any tests?

Yes, like recent issues #2327, we can easily reproduce it with example workloads. I used multi turn chat with 10 user concurrencies.

I pulled the commits in this PR to test and confirmed the issue in #2327 still exists.

Hello @ningziwen, Sorry for my wrong indication.
I was looking over quickly and this one is other issue.
I will continue the discuss it in #2327.

DongDongJu · 2026-02-17T19:50:32Z

IDW which patch resolve this but it seems problem disappear.

DongDongJu added 2 commits December 17, 2025 10:26

DongDongJu changed the title ~~Bugfix/vllm layerwise save stopiteration~~ [Bugfix] vllm layerwise save stopiteration Dec 17, 2025

DongDongJu requested a review from ApostaC December 17, 2025 21:04

gemini-code-assist Bot reviewed Dec 17, 2025

View reviewed changes

Comment thread lmcache/integration/vllm/vllm_v1_adapter.py Outdated

DongDongJu requested review from YaoJiayi and sammshen December 17, 2025 21:10

sammshen reviewed Dec 17, 2025

View reviewed changes

Comment thread lmcache/integration/vllm/vllm_v1_adapter.py

sammshen reviewed Dec 17, 2025

View reviewed changes

DongDongJu added 2 commits December 17, 2025 16:01

Merge branch 'dev' into bugfix/vllm-layerwise-save-stopiteration

58defda

Merge branch 'dev' into bugfix/vllm-layerwise-save-stopiteration

2fcfd40

DongDongJu mentioned this pull request Dec 18, 2025

[BUG]:lmcache v0.3.10 mooncake 0.3.7.post2，ref_count problem in mooncake #2240

Closed

Merge branch 'dev' into bugfix/vllm-layerwise-save-stopiteration

06b46da

sammshen approved these changes Dec 28, 2025

View reviewed changes

YaoJiayi reviewed Dec 29, 2025

View reviewed changes

Merge branch 'dev' into bugfix/vllm-layerwise-save-stopiteration

293212f

DongDongJu mentioned this pull request Dec 29, 2025

[Bugfix] fix CacheEngineKey.to_string() str fmt type err #2313

Closed

2 tasks

Merge branch 'dev' into bugfix/vllm-layerwise-save-stopiteration

5e21e1a

tunglinwood mentioned this pull request Jan 20, 2026

[Bug] After setting LMCACHE_USE_LAYERWISE=True, vLLM crashes after a request #2453

Closed

sammshen mentioned this pull request Feb 12, 2026

[Refactor]: Introduce GPUKVFormat #2567

Merged

9 tasks

DongDongJu closed this Feb 17, 2026

sammshen mentioned this pull request Mar 10, 2026

[Correctness]: Fix the overlapping race condition for non-MP as well #2706

Merged

Conversation

DongDongJu commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What exactly crashed

Why it did NOT happen often under low concurrency

Root cause (fundamental issue)

Fix / solution approach

Operational note

Files changed (implementation)

Related fix discovered while debugging

Uh oh!

gemini-code-assist Bot commented Dec 17, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sammshen commented Dec 26, 2025

Uh oh!

sammshen left a comment

Choose a reason for hiding this comment

Uh oh!

YaoJiayi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DongDongJu commented Dec 29, 2025

Uh oh!

DongDongJu commented Dec 29, 2025

Uh oh!

ningziwen commented Dec 29, 2025

Uh oh!

DongDongJu commented Dec 30, 2025

Uh oh!

DongDongJu commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DongDongJu commented Dec 17, 2025 •

edited

Loading

YaoJiayi left a comment •

edited

Loading