Fix chunked prefill and KV cache leaks for streaming sessions by YazhiGao · Pull Request #20476 · sgl-project/sglang

YazhiGao · 2026-03-12T23:25:55Z

Three fixes for streaming session KV cache management:

Enforce single chunked request per prefill batch: Track new_chunked_req and has_reusing_chunked_req in PrefillAdder to prevent multiple chunked requests from being batched together. The previous has_chunked_req parameter only checked self.chunked_req from prior batches, missing within-batch duplicates. Removes the now-unnecessary has_chunked_req parameter from add_one_req.
Make SessionSlot.restore_to_req non-destructive: don't clear req_pool_idx and mamba_pool_idx from the slot after restoring to the request. During chunked prefill, a request may be rejected by the scheduler (e.g. budget exhausted) and retried in the next cycle, causing match_prefix to call restore_to_req again. The destructive restore caused the second call to fall through to the inner radix cache, acquiring real tree node locks that were never properly released.
Always skip inner cache_unfinished_req for streaming chunked stash: move the chunked prefix_indices save above the slot existence check so it applies uniformly to all turns, preventing redundant radix tree insertions during inter-chunk stashing.

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-12T23:25:59Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Three fixes for streaming session KV cache management: 1. Enforce single chunked request per prefill batch: Track new_chunked_req and has_reusing_chunked_req in PrefillAdder to prevent multiple chunked requests from being batched together. The previous has_chunked_req parameter only checked self.chunked_req from prior batches, missing within-batch duplicates. Removes the now-unnecessary has_chunked_req parameter from add_one_req. 2. Make SessionSlot.restore_to_req non-destructive: don't clear req_pool_idx and mamba_pool_idx from the slot after restoring to the request. During chunked prefill, a request may be rejected by the scheduler (e.g. budget exhausted) and retried in the next cycle, causing match_prefix to call restore_to_req again. The destructive restore caused the second call to fall through to the inner radix cache, acquiring real tree node locks that were never properly released. 3. Always skip inner cache_unfinished_req for streaming chunked stash: move the chunked prefix_indices save above the slot existence check so it applies uniformly to all turns, preventing redundant radix tree insertions during inter-chunk stashing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hnyls2002 · 2026-03-13T06:16:51Z

        self.can_run_list.append(req)
+        # Track if this batch has a reusing request with is_chunked > 0
+        # to prevent batching another chunked-reusing request (memory_pool assertion).
+        if req.req_pool_idx is not None and req.is_chunked > 0:


In the new chunked prefill pipeline, when the previous request A's last chunk gets added into the prefill batch, has_reusing_chunked_req will be set to True. This prevents request B(next request) from being added to the prefill batch. I suggest not changing the original pipeline (this causes performance regression) and just relaxing the assertion first.

hnyls2002 · 2026-03-13T06:23:11Z

/tag-and-rerun-ci

hnyls2002 · 2026-03-13T06:26:45Z

@cctry @YazhiGao

The req_pool_idx reuse assert added in #17850 needs to be relaxed for streaming sessions:

assert (
    sum(1 for i in reusing if reqs[i].is_chunked > 0) <= 1
), "only one chunked request may reuse req_pool_idx in a batch"

Original reuse (from #17850): A chunked request keeps its req_pool_idx across chunks instead of freeing and reallocating — preventing a data race where an async CUDA kernel reads a slot that has been freed and reassigned. The scheduler tracks exactly one self.chunked_req, so <= 1 was always satisfied.

New reuse (streaming sessions): Streaming session requests enter scheduling with a pre-existing req_pool_idx recovered from a session slot (restore_to_req). This introduces a second source of reuse — independent from the chunked prefill reuse.

Why at most 2: A batch can contain at most one request from add_chunked_req (the single self.chunked_req) and at most one new chunked request from add_one_req (new_chunked_req). When both are streaming and carry their own req_pool_idx from session slots, the count reaches 2. It cannot exceed 2 because the scheduler only tracks one chunked_req and one new_chunked_req.

Trigger scenario: Request A (streaming, is_chunked=1, last chunk, req_pool_idx=2) via add_chunked_req + Request B (streaming, new turn, req_pool_idx=3 from session slot, gets chunked, is_chunked bumped to 1 before alloc_req_slots) → sum = 2 → assert fails.

Safety: Each reusing request writes to its own independent row in req_to_token_pool.req_to_token — no slot aliasing, no data race. The invariant from #17850 (no freed-then-reallocated slot read by async CUDA) is preserved.

Proposed fix: <= 1 → <= 2. I will rewrite the chunked prefill logic in the following PRs.

cctry · 2026-03-13T08:21:03Z

Proposed fix: <= 1 → <= 2. I will rewrite the chunked prefill logic in the following PRs.

make sense to me, but i'd suggest making the assertion be more specific instead of just relaxing the count.

…oject#20476) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com>

…oject#20476) Co-authored-by: hnyls2002 <lsyincs@gmail.com>

YazhiGao requested review from ClawSeven, Ying1123, hanming-lu, hnyls2002, hzh0425, ispobock, merrymercy, xiezhq-hermann and yizhang2077 as code owners March 12, 2026 23:25

hnyls2002 self-assigned this Mar 12, 2026

hnyls2002 reviewed Mar 12, 2026

View reviewed changes

Comment thread python/sglang/srt/dllm/mixin/scheduler.py

YazhiGao force-pushed the fix/streaming-chunked-prefill-leak branch from aca4ea8 to 83d9e75 Compare March 12, 2026 23:37

fix lint

d881cc1

hnyls2002 reviewed Mar 13, 2026

View reviewed changes

revert & relax check

102cdc3

hnyls2002 added the high priority label Mar 13, 2026

github-actions Bot added the run-ci label Mar 13, 2026

trigger hi-pri ci

e49c0e4

hnyls2002 merged commit b1246c5 into sgl-project:main Mar 13, 2026
134 of 155 checks passed

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

Fix chunked prefill and KV cache leaks for streaming sessions (sgl-pr…

bdec257

…oject#20476) Co-authored-by: hnyls2002 <lsyincs@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix chunked prefill and KV cache leaks for streaming sessions#20476

Fix chunked prefill and KV cache leaks for streaming sessions#20476
hnyls2002 merged 4 commits intosgl-project:mainfrom
YazhiGao:fix/streaming-chunked-prefill-leak

YazhiGao commented Mar 12, 2026

Uh oh!

gemini-code-assist Bot commented Mar 12, 2026

Uh oh!

Uh oh!

hnyls2002 Mar 13, 2026

Uh oh!

hnyls2002 commented Mar 13, 2026

Uh oh!

hnyls2002 commented Mar 13, 2026 •

edited

Loading

Uh oh!

cctry commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

YazhiGao commented Mar 12, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 12, 2026

Uh oh!

Uh oh!

hnyls2002 Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

hnyls2002 commented Mar 13, 2026

Uh oh!

hnyls2002 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cctry commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hnyls2002 commented Mar 13, 2026 •

edited

Loading