Skip to content

[Bugfix] vllm layerwise save stopiteration#2268

Closed
DongDongJu wants to merge 8 commits intoLMCache:devfrom
DongDongJu:bugfix/vllm-layerwise-save-stopiteration
Closed

[Bugfix] vllm layerwise save stopiteration#2268
DongDongJu wants to merge 8 commits intoLMCache:devfrom
DongDongJu:bugfix/vllm-layerwise-save-stopiteration

Conversation

@DongDongJu
Copy link
Copy Markdown
Collaborator

@DongDongJu DongDongJu commented Dec 17, 2025

bug fix for #2135 and #2137
For #2135, motivated by @crystalxyz's fix but little bit different.

What this PR does / why we need it:
In the vLLM v1 integration, LMCache performs layerwise KV saving using a Python generator returned by LMCacheEngine.store_layer().
vLLM pumps this generator during attention execution (save_kv_layer) and finalizes it at the end of a
forward step (wait_for_save).

Under some scheduling / concurrency patterns, vLLM can call wait_for_save() when there is
nothing to save (e.g., empty connector metadata) or can call it multiple times.
In the old implementation, LMCache's worker-side adapter could call next() on an already exhausted
store_layer generator, raising StopIteration.
vLLM treats that exception as fatal and kills EngineCore (EngineDeadError).
In generator context-manager paths, Python may also convert it into RuntimeError, making it even more disruptive.

What exactly crashed

The crash chain looked like:

  • vLLM worker forward exits
  • kv_connector.wait_for_save() is invoked
  • LMCache vLLM adapter calls next(layerwise_storer)
  • layerwise_storer is already exhausted -> StopIteration
  • vLLM propagates the exception as fatal -> EngineCore terminates

Why it did NOT happen often under low concurrency

This bug requires a specific sequence of events that is less common with low concurrency:

  1. A previous forward step creates layerwise storers (store_layer generators) and advances
    them.

  2. A later forward step has nothing to save, for example:

    • connector metadata becomes empty (requests=[]), or
    • all requests are not savable (can_save=False), or
    • all tokens are effectively skipped (skip_leading_tokens causes a full skip), or
    • the step is a warmup/compile/graph path where save_kv_layer is not triggered.
  3. Even in step (2), vLLM still calls wait_for_save() at forward exit.

If the adapter still holds stale layerwise storers from step (1), then calling next() again
hits StopIteration.

With low concurrency, batches tend to be more stable: many steps still have at least one
savable request, and the save/finalize pattern is more consistent. As concurrency increases,
more no-save steps occur (due to scheduling, batching, prefix caching, chunked prefill,
compile/warmup behavior, and request composition). This makes the required sequence more
frequent and exposes the bug.

Root cause (fundamental issue)

The fundamental issue is a contract mismatch / fragility around using a generator as a
multi-stage async pipeline:

  • LMCache’s layerwise save is a multi-step generator.
  • vLLM calls into the adapter in a pattern that vary by step.
  • The old adapter assumed that calling next() once in wait_for_save() is always valid.
    This is not safe if:
    • wait_for_save() is called when there are no new storers,
    • wait_for_save() is called multiple times,
    • the generator finishes earlier than expected for any reason.

Additionally, LMCacheEngine.store_layer() had a freeze-mode branch that yielded a different
number of steps than the normal path, which could make expected number of next() calls
assumptions even more brittle.

Fix / solution approach

The fix makes layerwise saving robust and idempotent.

  1. Drain generators to completion in wait_for_save()

    • Instead of calling next() exactly once, we keep calling next() until StopIteration.
    • This guarantees all remaining pipeline stages are completed.
    • It also makes wait_for_save() safe if called multiple times: an already completed
      generator is simply already done.
  2. Clear layerwise storers after finalization

    • After draining, the adapter clears the list to avoid reusing stale generators in later
      forward steps.
  3. Ensure unpin happens even when metadata is empty

    • Under some steps, connector metadata can be empty at wait_for_save() time.
    • We track req_ids used for layerwise storing in the current forward and use them as a
      fallback for lookup_unpin(). This avoids leaked pins when metadata is missing.
  4. Make freeze-mode yield count consistent

    • In freeze mode, store_layer() still yields the expected number of steps.
    • This prevents subtle off-by-one behavior where integrations assume a specific step count.

Operational note

The fix is not intended to hide real errors. It treats StopIteration from the layerwise
pipeline generator as normal completion, which is the correct generator semantics.
All other unexpected exceptions still surface.

Files changed (implementation)

  • lmcache/integration/vllm/vllm_v1_adapter.py

    • Drain store_layer generators in wait_for_save() and clear them.
    • Track req_ids for unpin even if connector metadata becomes empty.
  • lmcache/v1/cache_engine.py

    • In freeze mode, yield an additional finalize step to match expected step count.
  • lmcache/v1/offload_server/zmq_server.py

    • Add an INFO log on server start to help debug

Related fix discovered while debugging

Separately, we also fixed lmcache/utils.py:parse_cache_key().

The cache key string format now includes an explicit dtype segment:

fmt@model@world_size@worker_id@chunk_hash@dtype[@layer_id][@tag%value...]

parse_cache_key() decides whether a key contains a layer_id by checking whether the
corresponding '@' segment is numeric. After adding the dtype segment, the layer_id
index shifts by one, so the old detection logic could mis-classify layer keys.

This is not the direct cause of the StopIteration crash above, but it is a correctness
fix for key parsing that was required to keep layer-key detection consistent with the
new format.

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

vLLM calls LMCache's layerwise save pump (store_layer generator) from
save_kv_layer() and then finalizes it in wait_for_save(). In practice,
wait_for_save() can be invoked when there is nothing to save (e.g. empty
connector metadata), or invoked multiple times, and the generator may
already be exhausted.

Previously, next(layerwise_storer) could raise StopIteration, which vLLM
treats as fatal (PEP 479 turns it into RuntimeError in generator contexts),
bringing down EngineCore.

Make worker-side layerwise saving idempotent:
- Drain layerwise storers to completion in wait_for_save() and clear them.
- Track req_ids for unpin even when connector metadata is empty.
- Ensure store_layer() yields the expected step count in freeze mode.
- Add an INFO log when ZMQOffloadServer starts to aid debugging.

Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Signed-off-by: DongDongJu <commisori28@gmail.com>
parse_cache_key() determines whether a key string includes a layer_id by
checking which '@' segment is numeric. The key format now includes a dtype
segment, so the layer_id index shifts by one.

Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Signed-off-by: DongDongJu <commisori28@gmail.com>
@DongDongJu DongDongJu changed the title Bugfix/vllm layerwise save stopiteration [Bugfix] vllm layerwise save stopiteration Dec 17, 2025
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @DongDongJu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a significant stability issue in the vLLM v1 integration with LMCache, where the vLLM engine could crash due to StopIteration exceptions originating from the layerwise KV saving mechanism. The changes enhance the robustness of generator handling, ensuring that KV cache finalization and unpinning occur reliably across various operational scenarios, including those with high concurrency or when no data needs saving. Additionally, it includes a necessary adjustment to cache key parsing to maintain correctness with an updated key format.

Highlights

  • Bug Fix: vLLM Engine Crash: Addresses a critical bug where vLLM's EngineCore would crash due to StopIteration exceptions from LMCache's layerwise KV saving generators. This occurred when generators were prematurely exhausted or wait_for_save() was called multiple times, especially under high concurrency.
  • Robust Generator Handling: The fix makes layerwise saving robust and idempotent by draining generators to completion in wait_for_save() and clearing them afterwards, preventing issues with stale generators or multiple finalization calls.
  • Consistent KV Cache Unpinning: Ensures KV caches are correctly unpinned even when connector metadata is empty by tracking req_ids that initiated layerwise storing, providing a fallback for lookup_unpin().
  • Freeze Mode Generator Consistency: Modified store_layer() in freeze mode to yield an additional finalize step, ensuring consistency with the expected step count for layerwise integrations like vLLM.
  • Cache Key Parsing Correction: Includes a related fix for parse_cache_key() in lmcache/utils.py to correctly identify layer_id after the introduction of a dtype segment in the cache key format.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@DongDongJu DongDongJu requested a review from ApostaC December 17, 2025 21:04
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a StopIteration crash in the vLLM integration by implementing a more robust handling of layerwise KV saving generators. The changes, including draining generators completely, clearing stale ones, and ensuring proper unpinning of resources, are well-thought-out and directly tackle the described race conditions. A related fix for cache key parsing is also included, improving correctness. My review includes one suggestion to refactor a loop to be more idiomatic, which improves code readability.

Comment thread lmcache/integration/vllm/vllm_v1_adapter.py Outdated
Use a simple for-loop to exhaust the store_layer generator in wait_for_save().
Behavior is unchanged, but readability is improved.

Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Signed-off-by: DongDongJu <commisori28@gmail.com>
# plus one extra in wait_for_save()).
for layer_id in range(self.num_layers):
yield
# One extra yield for the "finalize" step.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the finalize step? wait_for_save?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's the final stage of the store_layer() generator (the extra progress after the last layer) so the pipeline can finish the last pending put/cleanup.
vLLM reaches it by exhausting the generator in wait_for_save().
but same here, please correct me if im wrong.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you leave this as a comment here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will do. Thanks!

Comment thread lmcache/integration/vllm/vllm_v1_adapter.py
for _ in layerwise_storer:
pass

# Prevent double-finalization if wait_for_save() is called twice.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why will wait_for_save() be called twice?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from my understanding, vLLM can invoke it in multiple forward steps even when a later step has nothing to save (e.g., empty connector metadata), and without clearing, stale storers from a previous step could be finalized again.
Clearing makes the method idempotent/safe under such call patterns.
But, let me know if I am wrong at here.

@sammshen
Copy link
Copy Markdown
Contributor

apologies for the delay, will review this tomorrow!

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR makes sense to me. Thanks for the extremely detailed write up.

Do you think the generator code needs refactoring or is most of the complexity unavoidable due to vllm scheduler behavior? how did you find out most of these unusual behaviors?

Did you run any tests?

Copy link
Copy Markdown
Collaborator

@YaoJiayi YaoJiayi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DongDongJu I'm still a bit confused about why this PR is needed. maybe it's more helpful to point out the related vllm code as well? Do you mean that vllm can execute multiple forward steps at the same time?

Comment thread lmcache/utils.py
"""
parts = key_str.strip().split("@")
if len(parts) >= 6 and parts[5].isdigit():
if len(parts) >= 7 and parts[6].isdigit():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the direct cause of the StopIteration crash above, but it is a correctness fix for key parsing that was required to keep layerwise mode test.

@DongDongJu
Copy link
Copy Markdown
Collaborator Author

Hello @YaoJiayi , Happy holidays.

@DongDongJu I'm still a bit confused about why this PR is needed. maybe it's more helpful to point out the related vllm code as well? Do you mean that vllm can execute multiple forward steps at the same time?

Sure, Let me try to clarify what I understood about layerwise mode's design first.

save_kv_layer() advances a storage generator(per layer) via next().
wait_for_save() finalizes that Storage Generator with a last next() call ( and, completes the batched puts/ cleanup).

So what exactly solving here is that,

Let me indicate more details with vllm codes.

vLLM v1 calls kv_connector.wait_for_save() at the end of every forward step via

KVConnectorModelRunnerMixin._get_kv_connector_output(...): 
finally: 
    if wait_for_save: 
        kv_connector.wait_for_save()

But the LMCache save hook is conditional: maybe_transfer_kv_layer is a no-op when attn_metadata is None or not connector.has_connector_metadata(), so some steps legitimately never call save_kv_layer() (e.g. empty/unsavable connector metadata, or dummy/warmup/graph runs without attn metadata). reference

So, actually for metadata, most easy fix is change this lines which makes metadata to none if nothing in there or change this lines check that length of meta is not 0. I thought that might be good to have a guard in LMCache side for the future too.

Additionally, why it was not happened during sync one is when use_layerwise is off, LMCache isn’t driving a multi-yield generator that needs “finalization by next()”, so we don’t get the specific “StopIteration → EngineCore dies” failure mode. The “weird” behavior (vLLM calling wait_for_save() even on no-save steps) can still happen, but without the layerwise generator in the middle, there’s nothing to accidentally advance past completion.

@ApostaC you might also need to check.

@DongDongJu
Copy link
Copy Markdown
Collaborator Author

This PR makes sense to me. Thanks for the extremely detailed write up.

Do you think the generator code needs refactoring or is most of the complexity unavoidable due to vllm scheduler behavior? how did you find out most of these unusual behaviors?

Hello @sammshen. Happy holidays.

I believe vllm's behavior is complex enough and like I mentioned reply on jiayi's thread there are multiple case this one can happened and in the future with vllm code changes as well.
So, good to have it as a guard for breakdown because this patch's logic do not break the KVCache itself.

Did you run any tests?

Yes, like recent issues #2327, we can easily reproduce it with example workloads. I used multi turn chat with 10 user concurrencies.

@ningziwen
Copy link
Copy Markdown
Contributor

This PR makes sense to me. Thanks for the extremely detailed write up.
Do you think the generator code needs refactoring or is most of the complexity unavoidable due to vllm scheduler behavior? how did you find out most of these unusual behaviors?

Hello @sammshen. Happy holidays.

I believe vllm's behavior is complex enough and like I mentioned reply on jiayi's thread there are multiple case this one can happened and in the future with vllm code changes as well. So, good to have it as a guard for breakdown because this patch's logic do not break the KVCache itself.

Did you run any tests?

Yes, like recent issues #2327, we can easily reproduce it with example workloads. I used multi turn chat with 10 user concurrencies.

I pulled the commits in this PR to test and confirmed the issue in #2327 still exists.

@DongDongJu
Copy link
Copy Markdown
Collaborator Author

This PR makes sense to me. Thanks for the extremely detailed write up.
Do you think the generator code needs refactoring or is most of the complexity unavoidable due to vllm scheduler behavior? how did you find out most of these unusual behaviors?

Hello @sammshen. Happy holidays.
I believe vllm's behavior is complex enough and like I mentioned reply on jiayi's thread there are multiple case this one can happened and in the future with vllm code changes as well. So, good to have it as a guard for breakdown because this patch's logic do not break the KVCache itself.

Did you run any tests?

Yes, like recent issues #2327, we can easily reproduce it with example workloads. I used multi turn chat with 10 user concurrencies.

I pulled the commits in this PR to test and confirmed the issue in #2327 still exists.

Hello @ningziwen, Sorry for my wrong indication.
I was looking over quickly and this one is other issue.
I will continue the discuss it in #2327.

@DongDongJu
Copy link
Copy Markdown
Collaborator Author

IDW which patch resolve this but it seems problem disappear.

@DongDongJu DongDongJu closed this Feb 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants