[KVConnector][Feature] Support KV connector cache reset via /reset_prefix_cache#27170
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new API to reset the KV connector cache, which is a useful feature for testing and benchmarking. The implementation is well-structured, adding the reset_cache method to the base connector class and implementing it through the stack for both synchronous and asynchronous paths.
My main feedback is about a bug in the synchronous path where the boolean return value indicating the success of the cache reset is not propagated up to the LLM class. This results in LLM.reset_connector_cache having an incorrect return type hint and behavior. I've provided detailed comments and suggestions to fix this issue across the affected files. The asynchronous path used by the API server correctly handles this by design, as it doesn't check the return status.
Overall, the changes are good, and with the suggested fix, the new API will be more robust.
|
Is there a compelling reason to add a new endpoint? It would seem reasonable to piggy-back on |
Good point, there's no strict need for a separate endpoint. |
4b86b33 to
ae370d8
Compare
Updated to extend |
ae370d8 to
dc9168f
Compare
|
cc @NickLucche |
NickLucche
left a comment
There was a problem hiding this comment.
Actually, sorry, one comment on the interface for the multiconnector case
| Returns: | ||
| bool: True if the cache was successfully reset, False otherwise. | ||
| """ | ||
| return False |
There was a problem hiding this comment.
this could be optional to implement the suggestion below
There was a problem hiding this comment.
Optional[bool] or bool | None as return type, to allow for the all(..) check to work fine (you rule-out the connectors that do NOT implement it)
There was a problem hiding this comment.
Why treat connectors that haven’t implemented this as successful?
It could confuse users who aren’t aware of the implementation details.
There was a problem hiding this comment.
connectors that have not implemented it will return None which is neither.
If you do any(..), one sub-connector being successful will result in multiconnector returning successful transfers.
What I would expect semantically is that multiconnector only returns true when all sub-connectors that implement the interface return True (hence None return value is needed to discern those that have implemented and yet fail).
There was a problem hiding this comment.
Got it - I’ve pushed a commit with that change, thanks for clarifying.
1dd3ae5 to
26292d1
Compare
|
Once the prefix cache metrics get merged - #26245, we should reset the metric too. Edit: Done |
26292d1 to
e8045dc
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a feature to reset the external KV connector cache via the /reset_prefix_cache endpoint, controlled by a new reset_external flag. The changes are well-propagated from the API layer down to the scheduler and connectors. My review focuses on the implementation of the reset logic, where I've identified two high-severity issues related to short-circuiting evaluation that could prevent all relevant caches from being reset. I've provided suggestions to ensure the reset operations are always fully executed as intended.
|
@NickLucche Ready for another review — please let me know what you think. |
…efix_cache Signed-off-by: tovam <tovam@pliops.com>
Signed-off-by: tovam <tovam@pliops.com>
Signed-off-by: tovam <tovam@pliops.com>
Signed-off-by: tovam <tovam@pliops.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Tova Movshovitz <tovam@pliops.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Tova Movshovitz <tovam@pliops.com>
395dd1a to
12b5448
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a feature to reset the external KV connector cache through the existing /reset_prefix_cache endpoint by adding a reset_external flag. The implementation is well-structured, with the new functionality correctly propagated from the API layer down to the scheduler and KV connectors. The logic for handling cache resets, including for MultiConnector and for connectors that may not implement this functionality, is sound. The code is clean and adheres to the existing patterns within the codebase. I have not identified any high or critical severity issues in these changes.
NickLucche
left a comment
There was a problem hiding this comment.
Apologies for the delay in getting back to you, this looks well done!
…or that does not implement resetting Signed-off-by: tovam <tovam@pliops.com>
|
@ptovam enabling auto-merge, let's see if we can get CI green |
|
Curious should we make |
…efix_cache (vllm-project#27170) Signed-off-by: tovam <tovam@pliops.com> Signed-off-by: Tova Movshovitz <tovam@pliops.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…nnector_cache ``Scheduler.reset_connector_cache()`` currently returns ``False`` when no KV connector is configured, which makes ``Scheduler.reset_prefix_cache(reset_running_requests, reset_connector=True)`` return ``False`` on every engine that doesn't have a connector -- even when the local prefix cache reset itself succeeded. A caller that interprets the return value as "did the reset I asked for succeed?" (e.g. an RL framework that asks for a clear before a weight update, or an ops tool that pipes the return value into an SLO check) sees a spurious failure for the most common deployment shape (no KV connector configured). The user-facing log line was also a ``warning``, even though "no connector exists, so nothing to reset" is the expected steady-state behavior of most engines. Fix: - Return ``True`` when ``self.connector is None``. There is no external KV store to invalidate, so the operation trivially succeeds; the local prefix cache (handled earlier in ``reset_prefix_cache``) decides the real return value. - Demote the log line from ``warning`` to ``debug``: this branch is hit on every engine without a connector, which is the norm rather than a misconfiguration. The True/False/None return contract on ``KVConnector.reset_cache`` itself (None = not implemented, False = explicit failure, anything else = success) is unchanged; only the "no connector at all" path moves from a sentinel failure to a sentinel success, matching how ``reset_mm_cache`` / ``reset_encoder_cache`` already behave when their target structures are empty. Tests: ``tests/v1/core/test_scheduler.py::test_reset_connector_cache_no_connector_is_no_op_success`` covers both the direct method call and the ``reset_prefix_cache(reset_connector=True)`` cascade end-to-end on a no-connector scheduler. ## Duplicate-work check - ``gh pr list --repo vllm-project/vllm --state open --search "reset_connector_cache in:body"`` -> no results - ``gh pr list --repo vllm-project/vllm --state open --search "reset_connector reset_prefix_cache"`` -> no results - ``gh issue list --repo vllm-project/vllm --state open --search "reset_connector_cache"`` -> no results The only prior PR touching this method is vllm-project#27170 (which introduced the ``reset_connector`` parameter); that PR is merged and this is a follow-up correctness fix in the same method. ## AI assistance This patch was authored with assistance from Claude Opus 4.7 (1M context). Every changed line was reviewed by the submitter before opening the PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: aoshen524 <aoshen524@gmail.com>
`AsyncLLM.pause_generation(clear_cache=True)` internally calls `EngineCore._reset_caches`, which then calls `Scheduler.reset_prefix_cache(reset_running_requests=True)` -- with the new `reset_connector` parameter (added in vllm-project#27170) defaulting to False. For engines with a configured external KV store connector (e.g. `MooncakeStoreConnector`, `LMCacheConnectorV1`, `HF3FSConnector`, ...) that means a user asking the engine to "clear cache" after a weight update gets the local prefix cache cleared but leaves the external store full of KV blocks hashed against the previous policy. For RL post-training workflows this is a silent stale-cache correctness hole: the next request can read external KV that was written by the old policy and serve it as a prefix hit against the new policy. This patch parameterizes `EngineCore._reset_caches` with `reset_connector: bool = True` and forwards the value to `reset_prefix_cache`. The default flips from False to True so that `pause_generation(clear_cache=True)` -- the user-facing API whose contract is "clear caches" -- now actually clears the external KV store too. `Scheduler.reset_connector_cache` already gracefully handles the no-connector case (warns + returns False), so this is a no-op for engines that don't configure a connector. Internal callers that want a local-only invalidation can override with `reset_connector=False`; outside callers can still bypass the cascade by calling `scheduler.reset_prefix_cache(reset_connector=False)` directly. This intentionally keeps `reset_connector` *out* of the user-facing `AsyncLLM.pause_generation` signature: threading an opt-in flag all the way up turns the safety net into a footgun (user forgets the flag -> silent stale KV). The user-facing surface stays a single `clear_cache` semantic ("clear means clear"). Future internal callers can override via the new kwarg if they need to. ## Why this is not a duplicate - `gh pr list --repo vllm-project/vllm --state open --search "_reset_caches reset_connector"` -> empty - `gh pr list --repo vllm-project/vllm --state open --search "pause_generation clear_cache reset_connector"` -> empty - `gh issue list --repo vllm-project/vllm --state open --search "pause_generation reset_connector"` -> empty Related PRs: - vllm-project#27170 introduced `reset_connector` on `Scheduler.reset_prefix_cache`. - vllm-project#42693 is a parallel correctness fix for the no-connector branch of the same code path (returns True instead of False when no connector is attached) -- independent of this change. - vllm-project#42694 implements `MooncakeStoreConnector.reset_cache`, the most immediate beneficiary of this cascade. ## Test plan No new test is added: the behavior change is exercised by the existing `tests/v1/core/test_reset_prefix_cache_e2e.py` (no-connector engine keeps working) and by the new tests in vllm-project#42694 (Mooncake connector now sees its `reset_cache` called via `pause_generation(clear_cache=True)`). Commands run locally: ```bash pre-commit run --files vllm/v1/engine/core.py # -> ruff check, ruff format, mypy-3.10, SPDX, etc. all Passed ``` The full pytest run could not be executed in my local sandbox (the editable `vllm` install in the venv points at a different worktree whose precompiled CUDA flash-attention extensions don't match upstream `main`). CI will be the source of truth. ## AI assistance This patch was authored with assistance from Claude Opus 4.7 (1M context). Every changed line was reviewed by the submitter before opening the PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: aoshen524 <aoshen524@gmail.com>
…efix_cache (vllm-project#27170) Signed-off-by: tovam <tovam@pliops.com> Signed-off-by: Tova Movshovitz <tovam@pliops.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…efix_cache (vllm-project#27170) Signed-off-by: tovam <tovam@pliops.com> Signed-off-by: Tova Movshovitz <tovam@pliops.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…efix_cache (vllm-project#27170) Signed-off-by: tovam <tovam@pliops.com> Signed-off-by: Tova Movshovitz <tovam@pliops.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Purpose
Extends the existing
/reset_prefix_cacheendpoint with an optionalreset_externalflag.When set, both the local (GPU) and external (connector) prefix caches are reset.
This allows clearing connector-side caches dynamically without restarting the server -useful for benchmarking and testing.