Skip to content

[KV Connector] Skip stale KV xfer completion notifications in scheduler#43265

Closed
zhewenl wants to merge 1 commit into
vllm-project:mainfrom
zhewenl:fix/scheduler-kv-xfer-stale-assert
Closed

[KV Connector] Skip stale KV xfer completion notifications in scheduler#43265
zhewenl wants to merge 1 commit into
vllm-project:mainfrom
zhewenl:fix/scheduler-kv-xfer-stale-assert

Conversation

@zhewenl

@zhewenl zhewenl commented May 21, 2026

Copy link
Copy Markdown
Collaborator

Purpose

In P/D (prefill/decode) disaggregated setups, KV transfer completion notifications from the worker-side connector are asynchronous. When a request's lifecycle ends before the underlying KV write completes — observed under load with the Mooncake connector when request lifecycle < KV write latency — the scheduler has already removed the request from self.requests by the time the late finished_recving / finished_sending callback arrives. The assert req_id in self.requests in _update_from_kv_xfer_finished then aborts the engine process.

Same underlying issue as #37837.

This change drops late notifications for unknown request ids instead of asserting, with a logger.debug line so the drops remain observable.

…duler

KV transfer completion notifications from the worker-side connector are
asynchronous. In P/D setups, when a request's lifecycle ends before the
underlying KV write completes (observed under load with the Mooncake
connector when request lifecycle < KV write latency), the scheduler has
already removed the request from `self.requests` by the time the late
`finished_recving` / `finished_sending` callback arrives. The
`assert req_id in self.requests` in `_update_from_kv_xfer_finished` then
aborts the engine.

Skip such stale notifications instead of asserting, with a debug log so
drops remain observable. Same underlying issue as vllm-project#37837.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Zhewen Li <zhewenli@inferact.ai>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the scheduler to handle asynchronous KV transfer completions by replacing assertions with checks that skip stale notifications for requests already cleaned up. A reviewer suggested an improvement in the finished_sending loop to verify that a request is in a finished state before calling _free_blocks, preventing potential assertion failures for active requests.

Comment on lines +2158 to 2161
if req_id not in self.requests:
logger.debug("Dropping stale finished_sending for request %s", req_id)
continue
self._free_blocks(self.requests[req_id])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the finished_recving loop, calling _free_blocks directly here will trigger an assertion failure inside that method if the request is not in a finished state. If a stale finished_sending notification arrives for an active request, it is safer to log and skip it rather than crashing the engine.

            if req_id not in self.requests:
                logger.debug("Dropping stale finished_sending for request %s", req_id)
                continue
            req = self.requests[req_id]
            if RequestStatus.is_finished(req.status):
                self._free_blocks(req)
            else:
                logger.debug("Dropping stale finished_sending for request %s with status %s", req_id, req.status)

@mergify mergify Bot added v1 bug Something isn't working kv-connector labels May 21, 2026
@njhill

njhill commented May 21, 2026

Copy link
Copy Markdown
Member

I'm trying to understand how this actually occurs.

In the async save/load case, the req_id should by design still be in scheduler self.requests even after the request has otherwise finished.

@zhewenl

zhewenl commented May 21, 2026

Copy link
Copy Markdown
Collaborator Author

In the async save/load case, the req_id should by design still be in scheduler self.requests even after the request has otherwise finished.

let me clarify, I think this happened when requests are aborted

@zhewenl zhewenl changed the title [Bugfix][V1][P/D] Skip stale KV xfer completion notifications in scheduler [KV Connector] Skip stale KV xfer completion notifications in scheduler May 21, 2026
@njhill

njhill commented May 21, 2026

Copy link
Copy Markdown
Member

In the async save/load case, the req_id should by design still be in scheduler self.requests even after the request has otherwise finished.

let me clarify, I think this happened when requests are aborted

That shouldn't cause this in theory, they should remain in scheduler self.requests if there's a transfer in progress.

@zhewenl

zhewenl commented May 22, 2026

Copy link
Copy Markdown
Collaborator Author

better fix in #43371

@zhewenl zhewenl closed this May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working kv-connector v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants