[Fix] data race in req_to_token pool#17850
Merged
merrymercy merged 6 commits intomainfrom Feb 2, 2026
Merged
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
/tag-run-ci-label |
|
so if a prefill batch contain 2 or more request or request chunk, the accuracy of the mamba state for these req is not right? |
Collaborator
Author
|
Mamba state is correct but full attention can be wrong
|
60ab814 to
30b9b41
Compare
charlesHsuGG
pushed a commit
to charlesHsuGG/sglang
that referenced
this pull request
Feb 5, 2026
sfiisf
pushed a commit
to sfiisf/sglang
that referenced
this pull request
Feb 5, 2026
Collaborator
|
Hi @cctry, |
Collaborator
|
@cctry, Oh, I see. Since the kernel is launched asynchronously, even with overlap scheduling disabled, there could still be a race condition between |
nvcastet
added a commit
to nvcastet/sglang
that referenced
this pull request
Feb 13, 2026
…vent cross-stream data race In overlap scheduling (MTPv2), `process_batch_result(N-1)` runs on the default stream concurrently with `forward(N)` on the forward stream. When a request finishes, `release_kv_cache` immediately returns its `req_pool_idx` to the free list. A new request can then recycle that pool index and `prepare_for_decode` overwrites the `req_to_token` row on the default stream while `forward(N)` still reads it — causing an "index out of bounds" assertion in IndexKernel.cu. Fix: defer the pool-index free by one overlap iteration. - `ReqToTokenPool.deferred_free(req)`: withholds the pool index from the free list (the slot cannot be reallocated). - `ReqToTokenPool.flush_deferred_frees()`: moves deferred slots back to the free list once the forward that read them has completed. - `release_kv_cache(..., defer_pool_free=True)`: used in the decode result-processing path when overlap is enabled. - `process_batch_result_decode`: flushes deferred frees right after `copy_done.synchronize()`, which guarantees the previous forward has finished reading `req_to_token`. This is the overlap-scheduling counterpart of PR sgl-project#17850, which fixed the same class of race for chunked prefill.
nvcastet
added a commit
to nvcastet/sglang
that referenced
this pull request
Feb 13, 2026
…vent cross-stream data race In overlap scheduling (MTPv2), `process_batch_result(N-1)` runs on the default stream concurrently with `forward(N)` on the forward stream. When a request finishes, `release_kv_cache` immediately returns its `req_pool_idx` to the free list. A new request can then recycle that pool index and `prepare_for_decode` overwrites the `req_to_token` row on the default stream while `forward(N)` still reads it — causing an "index out of bounds" assertion in IndexKernel.cu. Fix: defer the pool-index free by one overlap iteration. - `ReqToTokenPool.deferred_free(req)`: withholds the pool index from the free list (the slot cannot be reallocated). - `ReqToTokenPool.flush_deferred_frees()`: moves deferred slots back to the free list once the forward that read them has completed. - `release_kv_cache(..., defer_pool_free=True)`: used in the decode result-processing path when overlap is enabled. - `process_batch_result_decode`: flushes deferred frees right after `copy_done.synchronize()`, which guarantees the previous forward has finished reading `req_to_token`. This is the overlap-scheduling counterpart of PR sgl-project#17850, which fixed the same class of race for chunked prefill.
Johnsonms
pushed a commit
to Johnsonms/sglang
that referenced
this pull request
Feb 14, 2026
3 tasks
5 tasks
Wangzheee
pushed a commit
to Wangzheee/sglang
that referenced
this pull request
Mar 21, 2026
Merged
6 tasks
Rodrian7
added a commit
to sgl-project/sglang-jax
that referenced
this pull request
Apr 30, 2026
Previously the scheduler explicitly freed a chunked request's req_pool slot in get_next_batch_to_run before allocating the next batch, then relied on alloc(num_reqs) to hand out a fresh slot to that same request on the next round. Under chunked prefill with the pool near capacity, the freed slot could be re-allocated to a different request whose matched-prefix indices were then written to the same row of req_to_token while the previous batch's forward was still reading that row's host buffer. JAX provides no happen-before between host numpy mutations and an in-flight dispatch, and on the PJRT host-CPU zero-copy path the host write is the device buffer; the result is silent KV gather corruption. Fix by stabilising slot ownership across batches: - ReqToTokenPool.alloc(reqs) now treats reqs whose req_pool_idx is already non-None as in-place reuse and only consumes free_slots for the rest. Atomic on capacity miss: returns None and leaves every req field untouched. - ReqToTokenPool.free(req) is req-aware; it releases the slot and clears req.req_pool_idx so a stale value cannot leak to a later caller after the slot has been re-handed out. - The explicit free of the chunked request's slot in Scheduler.get_next_batch_to_run is removed. The req keeps its req_pool_idx; the next prepare_for_extend's alloc(reqs) reuses it. - All call sites (ScheduleBatch.alloc_req_slots, release_req, ChunkCache / RadixCache / SWARadixCache.cache_finished_req) are switched to the req-aware API. Mirrors the upstream sglang fix in sgl-project/sglang#17850 and addresses the chunked-prefill ownership item in #992 . New unit tests in test/mem_cache/test_req_to_token_pool.py lock in: fresh allocation, in-place reuse, atomic capacity miss (with and without partial reuse), free-clears-req-field, alloc/free roundtrip, and end-to-end chunked-req survival across batches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Rodrian7
added a commit
to sgl-project/sglang-jax
that referenced
this pull request
Apr 30, 2026
Previously the scheduler explicitly freed a chunked request's req_pool slot in get_next_batch_to_run before allocating the next batch, then relied on alloc(num_reqs) to hand out a fresh slot to that same request on the next round. Under chunked prefill with the pool near capacity, the freed slot could be re-allocated to a different request whose matched-prefix indices were then written to the same row of req_to_token while the previous batch's forward was still reading that row's host buffer. JAX provides no happen-before between host numpy mutations and an in-flight dispatch, and on the PJRT host-CPU zero-copy path the host write is the device buffer; the result is silent KV gather corruption. Fix by stabilising slot ownership across batches: - ReqToTokenPool.alloc(reqs) now treats reqs whose req_pool_idx is already non-None as in-place reuse and only consumes free_slots for the rest. Atomic on capacity miss: returns None and leaves every req field untouched. - ReqToTokenPool.free(req) is req-aware; it releases the slot and clears req.req_pool_idx so a stale value cannot leak to a later caller after the slot has been re-handed out. - The explicit free of the chunked request's slot in Scheduler.get_next_batch_to_run is removed. The req keeps its req_pool_idx; the next prepare_for_extend's alloc(reqs) reuses it. - All call sites (ScheduleBatch.alloc_req_slots, release_req, ChunkCache / RadixCache / SWARadixCache.cache_finished_req) are switched to the req-aware API. Mirrors the upstream sglang fix in sgl-project/sglang#17850 and addresses the chunked-prefill ownership item in #992 . New unit tests in test/mem_cache/test_req_to_token_pool.py lock in: fresh allocation, in-place reuse, atomic capacity miss (with and without partial reuse), free-clears-req-field, alloc/free roundtrip, and end-to-end chunked-req survival across batches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Rodrian7
added a commit
to sgl-project/sglang-jax
that referenced
this pull request
Apr 30, 2026
Previously the scheduler explicitly freed a chunked request's req_pool slot in get_next_batch_to_run before allocating the next batch, then relied on alloc(num_reqs) to hand out a fresh slot to that same request on the next round. Under chunked prefill with the pool near capacity, the freed slot could be re-allocated to a different request whose matched-prefix indices were then written to the same row of req_to_token while the previous batch's forward was still reading that row's host buffer. JAX provides no happen-before between host numpy mutations and an in-flight dispatch, and on the PJRT host-CPU zero-copy path the host write is the device buffer; the result is silent KV gather corruption. Fix by stabilising slot ownership across batches: - ReqToTokenPool.alloc(reqs) now treats reqs whose req_pool_idx is already non-None as in-place reuse and only consumes free_slots for the rest. Atomic on capacity miss: returns None and leaves every req field untouched. - ReqToTokenPool.free(req) is req-aware; it releases the slot and clears req.req_pool_idx so a stale value cannot leak to a later caller after the slot has been re-handed out. - The explicit free of the chunked request's slot in Scheduler.get_next_batch_to_run is removed. The req keeps its req_pool_idx; the next prepare_for_extend's alloc(reqs) reuses it. - All call sites (ScheduleBatch.alloc_req_slots, release_req, ChunkCache / RadixCache / SWARadixCache.cache_finished_req) are switched to the req-aware API. Mirrors the upstream sglang fix in sgl-project/sglang#17850 and addresses the chunked-prefill ownership item in #992 . New unit tests in test/mem_cache/test_req_to_token_pool.py lock in: fresh allocation, in-place reuse, atomic capacity miss (with and without partial reuse), free-clears-req-field, alloc/free roundtrip, and end-to-end chunked-req survival across batches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
JamesBrianD
pushed a commit
to sgl-project/sglang-jax
that referenced
this pull request
May 2, 2026
#1008) Previously the scheduler explicitly freed a chunked request's req_pool slot in get_next_batch_to_run before allocating the next batch, then relied on alloc(num_reqs) to hand out a fresh slot to that same request on the next round. Under chunked prefill with the pool near capacity, the freed slot could be re-allocated to a different request whose matched-prefix indices were then written to the same row of req_to_token while the previous batch's forward was still reading that row's host buffer. JAX provides no happen-before between host numpy mutations and an in-flight dispatch, and on the PJRT host-CPU zero-copy path the host write is the device buffer; the result is silent KV gather corruption. Fix by stabilising slot ownership across batches: - ReqToTokenPool.alloc(reqs) now treats reqs whose req_pool_idx is already non-None as in-place reuse and only consumes free_slots for the rest. Atomic on capacity miss: returns None and leaves every req field untouched. - ReqToTokenPool.free(req) is req-aware; it releases the slot and clears req.req_pool_idx so a stale value cannot leak to a later caller after the slot has been re-handed out. - The explicit free of the chunked request's slot in Scheduler.get_next_batch_to_run is removed. The req keeps its req_pool_idx; the next prepare_for_extend's alloc(reqs) reuses it. - All call sites (ScheduleBatch.alloc_req_slots, release_req, ChunkCache / RadixCache / SWARadixCache.cache_finished_req) are switched to the req-aware API. Mirrors the upstream sglang fix in sgl-project/sglang#17850 and addresses the chunked-prefill ownership item in #992 . New unit tests in test/mem_cache/test_req_to_token_pool.py lock in: fresh allocation, in-place reuse, atomic capacity miss (with and without partial reuse), free-clears-req-field, alloc/free roundtrip, and end-to-end chunked-req survival across batches. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The chunked prefill requests will free its slot in
req_to_token_pooland get allocated again when preparing for its next prefill batch.As a result, if a prefill batch contains multiple requests and
req_to_token_poolis at capacity. The write for matched kv indices for another request will overwrite the slot of the chunked requests which is being read in forward streamExample
Modifications
alloc(reqs: list[Req])- Now takes request list, setsreq.req_pool_idxdirectly, reuses slot if already set. cc @hnyls2002free()withfree_mamba_cache(req, ...)inHybridReqToTokenPool- Only frees mamba state, not req slot cc @hanming-lu @yizhang2077release_kv_cache()- Now callsfree(req)at end; handles early mamba-only free casefree()inprocess_prefill_chunkandcache_finished_reqAccuracy Tests
Benchmarking and Profiling
Checklist