Skip to content

[KVConnector][Feature] Support KV connector cache reset via /reset_prefix_cache#27170

Merged
NickLucche merged 8 commits into
vllm-project:mainfrom
ptovam:reset_connector_cache
Dec 5, 2025
Merged

[KVConnector][Feature] Support KV connector cache reset via /reset_prefix_cache#27170
NickLucche merged 8 commits into
vllm-project:mainfrom
ptovam:reset_connector_cache

Conversation

@ptovam

@ptovam ptovam commented Oct 19, 2025

Copy link
Copy Markdown
Contributor

Purpose

Extends the existing /reset_prefix_cache endpoint with an optional reset_external flag.
When set, both the local (GPU) and external (connector) prefix caches are reset.

This allows clearing connector-side caches dynamically without restarting the server -useful for benchmarking and testing.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new API to reset the KV connector cache, which is a useful feature for testing and benchmarking. The implementation is well-structured, adding the reset_cache method to the base connector class and implementing it through the stack for both synchronous and asynchronous paths.

My main feedback is about a bug in the synchronous path where the boolean return value indicating the success of the cache reset is not propagated up to the LLM class. This results in LLM.reset_connector_cache having an incorrect return type hint and behavior. I've provided detailed comments and suggestions to fix this issue across the affected files. The asynchronous path used by the API server correctly handles this by design, as it doesn't check the return status.

Overall, the changes are good, and with the suggested fix, the new API will be more robust.

Comment thread vllm/entrypoints/llm.py Outdated
Comment thread vllm/v1/engine/llm_engine.py Outdated
Comment thread vllm/v1/engine/core.py Outdated
Comment thread vllm/v1/engine/core_client.py Outdated
Comment thread vllm/v1/engine/core_client.py Outdated
Comment thread vllm/v1/engine/core_client.py Outdated
@markmc

markmc commented Oct 20, 2025

Copy link
Copy Markdown
Member

Is there a compelling reason to add a new endpoint? It would seem reasonable to piggy-back on /reset_prefix_cache ?

@ptovam

ptovam commented Oct 20, 2025

Copy link
Copy Markdown
Contributor Author

Is there a compelling reason to add a new endpoint? It would seem reasonable to piggy-back on /reset_prefix_cache ?

Good point, there's no strict need for a separate endpoint.
The goal was mainly to keep the option to reset only the GPU prefix cache, or both GPU and connector caches together.
Extending /reset_prefix_cache with an optional flag would work just as well if that makes more sense.

@ptovam ptovam force-pushed the reset_connector_cache branch from 4b86b33 to ae370d8 Compare October 20, 2025 16:29
@ptovam

ptovam commented Oct 20, 2025

Copy link
Copy Markdown
Contributor Author

Is there a compelling reason to add a new endpoint? It would seem reasonable to piggy-back on /reset_prefix_cache ?

Updated to extend /reset_prefix_cache with an optional reset_external flag, replacing the separate endpoint.

@markmc markmc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment thread vllm/entrypoints/openai/api_server.py Outdated
@ptovam ptovam force-pushed the reset_connector_cache branch from ae370d8 to dc9168f Compare October 21, 2025 06:43
@ptovam ptovam changed the title [KVConnector][Feature] Add API to reset KV connector cache [KVConnector][Feature] Support KV connector cache reset via /reset_prefix_cache Oct 21, 2025
@markmc markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 21, 2025
@DarkLight1337

Copy link
Copy Markdown
Member

cc @NickLucche

@NickLucche NickLucche left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@NickLucche NickLucche left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, sorry, one comment on the interface for the multiconnector case

Comment thread vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py Outdated
Returns:
bool: True if the cache was successfully reset, False otherwise.
"""
return False

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be optional to implement the suggestion below

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYM?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional[bool] or bool | None as return type, to allow for the all(..) check to work fine (you rule-out the connectors that do NOT implement it)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why treat connectors that haven’t implemented this as successful?
It could confuse users who aren’t aware of the implementation details.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connectors that have not implemented it will return None which is neither.
If you do any(..), one sub-connector being successful will result in multiconnector returning successful transfers.

What I would expect semantically is that multiconnector only returns true when all sub-connectors that implement the interface return True (hence None return value is needed to discern those that have implemented and yet fail).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - I’ve pushed a commit with that change, thanks for clarifying.

@ptovam ptovam force-pushed the reset_connector_cache branch 2 times, most recently from 1dd3ae5 to 26292d1 Compare October 23, 2025 09:04
@ptovam

ptovam commented Oct 23, 2025

Copy link
Copy Markdown
Contributor Author

Once the prefix cache metrics get merged - #26245, we should reset the metric too.

Edit: Done

@ptovam ptovam force-pushed the reset_connector_cache branch from 26292d1 to e8045dc Compare October 23, 2025 11:50
@mergify mergify Bot removed the needs-rebase label Oct 30, 2025
@ptovam

ptovam commented Nov 6, 2025

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a feature to reset the external KV connector cache via the /reset_prefix_cache endpoint, controlled by a new reset_external flag. The changes are well-propagated from the API layer down to the scheduler and connectors. My review focuses on the implementation of the reset logic, where I've identified two high-severity issues related to short-circuiting evaluation that could prevent all relevant caches from being reset. I've provided suggestions to ensure the reset operations are always fully executed as intended.

Comment thread vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py Outdated
Comment thread vllm/v1/core/sched/scheduler.py Outdated
@ptovam

ptovam commented Nov 10, 2025

Copy link
Copy Markdown
Contributor Author

@NickLucche Ready for another review — please let me know what you think.
Thanks.

ptovam and others added 6 commits December 2, 2025 14:06
…efix_cache

Signed-off-by: tovam <tovam@pliops.com>
Signed-off-by: tovam <tovam@pliops.com>
Signed-off-by: tovam <tovam@pliops.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Tova Movshovitz <tovam@pliops.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Tova Movshovitz <tovam@pliops.com>
@ptovam ptovam force-pushed the reset_connector_cache branch from 395dd1a to 12b5448 Compare December 2, 2025 12:51
@ptovam

ptovam commented Dec 2, 2025

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a feature to reset the external KV connector cache through the existing /reset_prefix_cache endpoint by adding a reset_external flag. The implementation is well-structured, with the new functionality correctly propagated from the API layer down to the scheduler and KV connectors. The logic for handling cache resets, including for MultiConnector and for connectors that may not implement this functionality, is sound. The code is clean and adheres to the existing patterns within the codebase. I have not identified any high or critical severity issues in these changes.

@NickLucche NickLucche left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay in getting back to you, this looks well done!

Comment thread vllm/v1/core/sched/scheduler.py
Comment thread vllm/v1/core/sched/scheduler.py
@NickLucche NickLucche enabled auto-merge (squash) December 5, 2025 17:16
@NickLucche

Copy link
Copy Markdown
Member

@ptovam enabling auto-merge, let's see if we can get CI green

@NickLucche NickLucche merged commit adb3150 into vllm-project:main Dec 5, 2025
56 checks passed
@ivanium

ivanium commented Mar 30, 2026

Copy link
Copy Markdown
Collaborator

Curious should we make reset_connector=True by default? I may not have the full picture here but I think reset_prefix_cache is mainly used in RL cases, where we also need to reset connector caches.

mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026
…efix_cache (vllm-project#27170)

Signed-off-by: tovam <tovam@pliops.com>
Signed-off-by: Tova Movshovitz <tovam@pliops.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
aoshen02 pushed a commit to aoshen02/vllm that referenced this pull request May 15, 2026
…nnector_cache

``Scheduler.reset_connector_cache()`` currently returns ``False`` when
no KV connector is configured, which makes
``Scheduler.reset_prefix_cache(reset_running_requests, reset_connector=True)``
return ``False`` on every engine that doesn't have a connector --
even when the local prefix cache reset itself succeeded.

A caller that interprets the return value as "did the reset I asked
for succeed?" (e.g. an RL framework that asks for a clear before a
weight update, or an ops tool that pipes the return value into an
SLO check) sees a spurious failure for the most common deployment
shape (no KV connector configured). The user-facing log line was
also a ``warning``, even though "no connector exists, so nothing to
reset" is the expected steady-state behavior of most engines.

Fix:
- Return ``True`` when ``self.connector is None``. There is no
  external KV store to invalidate, so the operation trivially
  succeeds; the local prefix cache (handled earlier in
  ``reset_prefix_cache``) decides the real return value.
- Demote the log line from ``warning`` to ``debug``: this branch is
  hit on every engine without a connector, which is the norm rather
  than a misconfiguration.

The True/False/None return contract on ``KVConnector.reset_cache``
itself (None = not implemented, False = explicit failure, anything
else = success) is unchanged; only the "no connector at all" path
moves from a sentinel failure to a sentinel success, matching how
``reset_mm_cache`` / ``reset_encoder_cache`` already behave when
their target structures are empty.

Tests: ``tests/v1/core/test_scheduler.py::test_reset_connector_cache_no_connector_is_no_op_success``
covers both the direct method call and the
``reset_prefix_cache(reset_connector=True)`` cascade end-to-end on a
no-connector scheduler.

## Duplicate-work check

- ``gh pr list --repo vllm-project/vllm --state open --search "reset_connector_cache in:body"`` -> no results
- ``gh pr list --repo vllm-project/vllm --state open --search "reset_connector reset_prefix_cache"`` -> no results
- ``gh issue list --repo vllm-project/vllm --state open --search "reset_connector_cache"`` -> no results

The only prior PR touching this method is vllm-project#27170 (which introduced
the ``reset_connector`` parameter); that PR is merged and this is
a follow-up correctness fix in the same method.

## AI assistance

This patch was authored with assistance from Claude Opus 4.7 (1M
context). Every changed line was reviewed by the submitter before
opening the PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: aoshen524 <aoshen524@gmail.com>
aoshen02 pushed a commit to aoshen02/vllm that referenced this pull request May 15, 2026
`AsyncLLM.pause_generation(clear_cache=True)` internally calls
`EngineCore._reset_caches`, which then calls
`Scheduler.reset_prefix_cache(reset_running_requests=True)` -- with
the new `reset_connector` parameter (added in vllm-project#27170) defaulting to
False. For engines with a configured external KV store connector
(e.g. `MooncakeStoreConnector`, `LMCacheConnectorV1`,
`HF3FSConnector`, ...) that means a user asking the engine to
"clear cache" after a weight update gets the local prefix cache
cleared but leaves the external store full of KV blocks hashed
against the previous policy.

For RL post-training workflows this is a silent stale-cache
correctness hole: the next request can read external KV that was
written by the old policy and serve it as a prefix hit against
the new policy.

This patch parameterizes `EngineCore._reset_caches` with
`reset_connector: bool = True` and forwards the value to
`reset_prefix_cache`. The default flips from False to True so that
`pause_generation(clear_cache=True)` -- the user-facing API whose
contract is "clear caches" -- now actually clears the external KV
store too. `Scheduler.reset_connector_cache` already gracefully
handles the no-connector case (warns + returns False), so this is
a no-op for engines that don't configure a connector. Internal
callers that want a local-only invalidation can override with
`reset_connector=False`; outside callers can still bypass the
cascade by calling `scheduler.reset_prefix_cache(reset_connector=False)`
directly.

This intentionally keeps `reset_connector` *out* of the user-facing
`AsyncLLM.pause_generation` signature: threading an opt-in flag all
the way up turns the safety net into a footgun (user forgets the
flag -> silent stale KV). The user-facing surface stays a single
`clear_cache` semantic ("clear means clear"). Future internal
callers can override via the new kwarg if they need to.

## Why this is not a duplicate

- `gh pr list --repo vllm-project/vllm --state open --search "_reset_caches reset_connector"` -> empty
- `gh pr list --repo vllm-project/vllm --state open --search "pause_generation clear_cache reset_connector"` -> empty
- `gh issue list --repo vllm-project/vllm --state open --search "pause_generation reset_connector"` -> empty

Related PRs:
- vllm-project#27170 introduced `reset_connector` on `Scheduler.reset_prefix_cache`.
- vllm-project#42693 is a parallel correctness fix for the no-connector branch of
  the same code path (returns True instead of False when no connector
  is attached) -- independent of this change.
- vllm-project#42694 implements `MooncakeStoreConnector.reset_cache`, the most
  immediate beneficiary of this cascade.

## Test plan

No new test is added: the behavior change is exercised by the existing
`tests/v1/core/test_reset_prefix_cache_e2e.py` (no-connector engine
keeps working) and by the new tests in vllm-project#42694 (Mooncake connector now
sees its `reset_cache` called via `pause_generation(clear_cache=True)`).

Commands run locally:

```bash
pre-commit run --files vllm/v1/engine/core.py
# -> ruff check, ruff format, mypy-3.10, SPDX, etc. all Passed
```

The full pytest run could not be executed in my local sandbox (the
editable `vllm` install in the venv points at a different worktree
whose precompiled CUDA flash-attention extensions don't match upstream
`main`). CI will be the source of truth.

## AI assistance

This patch was authored with assistance from Claude Opus 4.7 (1M
context). Every changed line was reviewed by the submitter before
opening the PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: aoshen524 <aoshen524@gmail.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…efix_cache (vllm-project#27170)

Signed-off-by: tovam <tovam@pliops.com>
Signed-off-by: Tova Movshovitz <tovam@pliops.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…efix_cache (vllm-project#27170)

Signed-off-by: tovam <tovam@pliops.com>
Signed-off-by: Tova Movshovitz <tovam@pliops.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
0826joyce pushed a commit to 0826joyce/vllm-serving-optimization that referenced this pull request May 19, 2026
…efix_cache (vllm-project#27170)

Signed-off-by: tovam <tovam@pliops.com>
Signed-off-by: Tova Movshovitz <tovam@pliops.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants