Fix chunked prefill and KV cache leaks for streaming sessions#20476
Fix chunked prefill and KV cache leaks for streaming sessions#20476hnyls2002 merged 4 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Three fixes for streaming session KV cache management: 1. Enforce single chunked request per prefill batch: Track new_chunked_req and has_reusing_chunked_req in PrefillAdder to prevent multiple chunked requests from being batched together. The previous has_chunked_req parameter only checked self.chunked_req from prior batches, missing within-batch duplicates. Removes the now-unnecessary has_chunked_req parameter from add_one_req. 2. Make SessionSlot.restore_to_req non-destructive: don't clear req_pool_idx and mamba_pool_idx from the slot after restoring to the request. During chunked prefill, a request may be rejected by the scheduler (e.g. budget exhausted) and retried in the next cycle, causing match_prefix to call restore_to_req again. The destructive restore caused the second call to fall through to the inner radix cache, acquiring real tree node locks that were never properly released. 3. Always skip inner cache_unfinished_req for streaming chunked stash: move the chunked prefix_indices save above the slot existence check so it applies uniformly to all turns, preventing redundant radix tree insertions during inter-chunk stashing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
aca4ea8 to
83d9e75
Compare
| self.can_run_list.append(req) | ||
| # Track if this batch has a reusing request with is_chunked > 0 | ||
| # to prevent batching another chunked-reusing request (memory_pool assertion). | ||
| if req.req_pool_idx is not None and req.is_chunked > 0: |
There was a problem hiding this comment.
In the new chunked prefill pipeline, when the previous request A's last chunk gets added into the prefill batch, has_reusing_chunked_req will be set to True. This prevents request B(next request) from being added to the prefill batch. I suggest not changing the original pipeline (this causes performance regression) and just relaxing the assertion first.
|
/tag-and-rerun-ci |
|
The assert (
sum(1 for i in reusing if reqs[i].is_chunked > 0) <= 1
), "only one chunked request may reuse req_pool_idx in a batch"Original reuse (from #17850): A chunked request keeps its New reuse (streaming sessions): Streaming session requests enter scheduling with a pre-existing Why at most 2: A batch can contain at most one request from Trigger scenario: Request A (streaming, Safety: Each reusing request writes to its own independent row in Proposed fix: |
make sense to me, but i'd suggest making the assertion be more specific instead of just relaxing the count. |
…oject#20476) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com>
…oject#20476) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com>
…oject#20476) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com>
…oject#20476) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com>
…oject#20476) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com>
…oject#20476) Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Three fixes for streaming session KV cache management:
Enforce single chunked request per prefill batch: Track new_chunked_req and has_reusing_chunked_req in PrefillAdder to prevent multiple chunked requests from being batched together. The previous has_chunked_req parameter only checked self.chunked_req from prior batches, missing within-batch duplicates. Removes the now-unnecessary has_chunked_req parameter from add_one_req.
Make SessionSlot.restore_to_req non-destructive: don't clear req_pool_idx and mamba_pool_idx from the slot after restoring to the request. During chunked prefill, a request may be rejected by the scheduler (e.g. budget exhausted) and retried in the next cycle, causing match_prefix to call restore_to_req again. The destructive restore caused the second call to fall through to the inner radix cache, acquiring real tree node locks that were never properly released.
Always skip inner cache_unfinished_req for streaming chunked stash: move the chunked prefix_indices save above the slot existence check so it applies uniformly to all turns, preventing redundant radix tree insertions during inter-chunk stashing.
Motivation
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci