[KV Connector] Skip stale KV xfer completion notifications in scheduler by zhewenl · Pull Request #43265 · vllm-project/vllm

zhewenl · 2026-05-21T01:20:53Z

Purpose

In P/D (prefill/decode) disaggregated setups, KV transfer completion notifications from the worker-side connector are asynchronous. When a request's lifecycle ends before the underlying KV write completes — observed under load with the Mooncake connector when request lifecycle < KV write latency — the scheduler has already removed the request from self.requests by the time the late finished_recving / finished_sending callback arrives. The assert req_id in self.requests in _update_from_kv_xfer_finished then aborts the engine process.

Same underlying issue as #37837.

This change drops late notifications for unknown request ids instead of asserting, with a logger.debug line so the drops remain observable.

…duler KV transfer completion notifications from the worker-side connector are asynchronous. In P/D setups, when a request's lifecycle ends before the underlying KV write completes (observed under load with the Mooncake connector when request lifecycle < KV write latency), the scheduler has already removed the request from `self.requests` by the time the late `finished_recving` / `finished_sending` callback arrives. The `assert req_id in self.requests` in `_update_from_kv_xfer_finished` then aborts the engine. Skip such stale notifications instead of asserting, with a debug log so drops remain observable. Same underlying issue as vllm-project#37837. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Zhewen Li <zhewenli@inferact.ai>

gemini-code-assist

Code Review

This pull request updates the scheduler to handle asynchronous KV transfer completions by replacing assertions with checks that skip stale notifications for requests already cleaned up. A reviewer suggested an improvement in the finished_sending loop to verify that a request is in a finished state before calling _free_blocks, preventing potential assertion failures for active requests.

gemini-code-assist · 2026-05-21T01:23:31Z

+            if req_id not in self.requests:
+                logger.debug("Dropping stale finished_sending for request %s", req_id)
+                continue
            self._free_blocks(self.requests[req_id])


Similar to the finished_recving loop, calling _free_blocks directly here will trigger an assertion failure inside that method if the request is not in a finished state. If a stale finished_sending notification arrives for an active request, it is safer to log and skip it rather than crashing the engine.

if req_id not in self.requests: logger.debug("Dropping stale finished_sending for request %s", req_id) continue req = self.requests[req_id] if RequestStatus.is_finished(req.status): self._free_blocks(req) else: logger.debug("Dropping stale finished_sending for request %s with status %s", req_id, req.status)

njhill · 2026-05-21T03:23:18Z

I'm trying to understand how this actually occurs.

In the async save/load case, the req_id should by design still be in scheduler self.requests even after the request has otherwise finished.

zhewenl · 2026-05-21T04:30:14Z

In the async save/load case, the req_id should by design still be in scheduler self.requests even after the request has otherwise finished.

let me clarify, I think this happened when requests are aborted

njhill · 2026-05-21T06:01:23Z

In the async save/load case, the req_id should by design still be in scheduler self.requests even after the request has otherwise finished.

let me clarify, I think this happened when requests are aborted

That shouldn't cause this in theory, they should remain in scheduler self.requests if there's a transfer in progress.

zhewenl · 2026-05-22T00:49:57Z

better fix in #43371

gemini-code-assist Bot reviewed May 21, 2026

View reviewed changes

mergify Bot added v1 bug Something isn't working kv-connector labels May 21, 2026

zhewenl changed the title ~~[Bugfix][V1][P/D] Skip stale KV xfer completion notifications in scheduler~~ [KV Connector] Skip stale KV xfer completion notifications in scheduler May 21, 2026

Dao007forever mentioned this pull request May 21, 2026

[KV Connector] MooncakeStore: don't co-queue save with load to avoid double delayed-free #43371

Merged

zhewenl closed this May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[KV Connector] Skip stale KV xfer completion notifications in scheduler#43265

[KV Connector] Skip stale KV xfer completion notifications in scheduler#43265
zhewenl wants to merge 1 commit into
vllm-project:mainfrom
zhewenl:fix/scheduler-kv-xfer-stale-assert

zhewenl commented May 21, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 21, 2026

Uh oh!

njhill commented May 21, 2026

Uh oh!

zhewenl commented May 21, 2026

Uh oh!

njhill commented May 21, 2026

Uh oh!

zhewenl commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

zhewenl commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

njhill commented May 21, 2026

Uh oh!

zhewenl commented May 21, 2026

Uh oh!

njhill commented May 21, 2026

Uh oh!

zhewenl commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhewenl commented May 21, 2026 •

edited

Loading