[Fix] data race in req_to_token pool by cctry · Pull Request #17850 · sgl-project/sglang

cctry · 2026-01-28T03:30:23Z

Motivation

The chunked prefill requests will free its slot in req_to_token_pool and get allocated again when preparing for its next prefill batch.

As a result, if a prefill batch contains multiple requests and req_to_token_pool is at capacity. The write for matched kv indices for another request will overwrite the slot of the chunked requests which is being read in forward stream

Example

Prepare & Launch prefill batch N:     
    req A (first half) --> idx 1  
    
model runner reads idx 1
  
Prepare batch N+1: 
    req A (second half) --> idx 2
    req B --> idx 1

scheduler writes req B's matched indices to idx 1

Modifications

alloc(reqs: list[Req]) - Now takes request list, sets req.req_pool_idx directly, reuses slot if already set. cc @hnyls2002
Separate free() with free_mamba_cache(req, ...) in HybridReqToTokenPool - Only frees mamba state, not req slot cc @hanming-lu @yizhang2077
release_kv_cache() - Now calls free(req) at end; handles early mamba-only free case
Removed free() in process_prefill_chunk and cache_finished_req

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2026-01-28T03:30:27Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

cctry · 2026-01-28T03:31:22Z

/tag-run-ci-label

Henrry-CHEN · 2026-01-28T12:04:16Z

so if a prefill batch contain 2 or more request or request chunk, the accuracy of the mamba state for these req is not right?

cctry · 2026-01-28T20:25:16Z

Mamba state is correct but full attention can be wrong

ClawSeven · 2026-02-10T10:22:24Z

Hi @cctry,
I have a question regarding the described race conditions. If we disable overlap scheduling, could there still be other write race conditions?

ClawSeven · 2026-02-10T10:38:05Z

@cctry, Oh, I see. Since the kernel is launched asynchronously, even with overlap scheduling disabled, there could still be a race condition between write_cache_indices and the backend reading the KV caches.
Thank you for the fix—the PR not only resolves the chunked prefill issue but also addresses a potential problem in dLLM.

…vent cross-stream data race In overlap scheduling (MTPv2), `process_batch_result(N-1)` runs on the default stream concurrently with `forward(N)` on the forward stream. When a request finishes, `release_kv_cache` immediately returns its `req_pool_idx` to the free list. A new request can then recycle that pool index and `prepare_for_decode` overwrites the `req_to_token` row on the default stream while `forward(N)` still reads it — causing an "index out of bounds" assertion in IndexKernel.cu. Fix: defer the pool-index free by one overlap iteration. - `ReqToTokenPool.deferred_free(req)`: withholds the pool index from the free list (the slot cannot be reallocated). - `ReqToTokenPool.flush_deferred_frees()`: moves deferred slots back to the free list once the forward that read them has completed. - `release_kv_cache(..., defer_pool_free=True)`: used in the decode result-processing path when overlap is enabled. - `process_batch_result_decode`: flushes deferred frees right after `copy_done.synchronize()`, which guarantees the previous forward has finished reading `req_to_token`. This is the overlap-scheduling counterpart of PR sgl-project#17850, which fixed the same class of race for chunked prefill.

Previously the scheduler explicitly freed a chunked request's req_pool slot in get_next_batch_to_run before allocating the next batch, then relied on alloc(num_reqs) to hand out a fresh slot to that same request on the next round. Under chunked prefill with the pool near capacity, the freed slot could be re-allocated to a different request whose matched-prefix indices were then written to the same row of req_to_token while the previous batch's forward was still reading that row's host buffer. JAX provides no happen-before between host numpy mutations and an in-flight dispatch, and on the PJRT host-CPU zero-copy path the host write is the device buffer; the result is silent KV gather corruption. Fix by stabilising slot ownership across batches: - ReqToTokenPool.alloc(reqs) now treats reqs whose req_pool_idx is already non-None as in-place reuse and only consumes free_slots for the rest. Atomic on capacity miss: returns None and leaves every req field untouched. - ReqToTokenPool.free(req) is req-aware; it releases the slot and clears req.req_pool_idx so a stale value cannot leak to a later caller after the slot has been re-handed out. - The explicit free of the chunked request's slot in Scheduler.get_next_batch_to_run is removed. The req keeps its req_pool_idx; the next prepare_for_extend's alloc(reqs) reuses it. - All call sites (ScheduleBatch.alloc_req_slots, release_req, ChunkCache / RadixCache / SWARadixCache.cache_finished_req) are switched to the req-aware API. Mirrors the upstream sglang fix in sgl-project/sglang#17850 and addresses the chunked-prefill ownership item in #992 . New unit tests in test/mem_cache/test_req_to_token_pool.py lock in: fresh allocation, in-place reuse, atomic capacity miss (with and without partial reuse), free-clears-req-field, alloc/free roundtrip, and end-to-end chunked-req survival across batches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

#1008) Previously the scheduler explicitly freed a chunked request's req_pool slot in get_next_batch_to_run before allocating the next batch, then relied on alloc(num_reqs) to hand out a fresh slot to that same request on the next round. Under chunked prefill with the pool near capacity, the freed slot could be re-allocated to a different request whose matched-prefix indices were then written to the same row of req_to_token while the previous batch's forward was still reading that row's host buffer. JAX provides no happen-before between host numpy mutations and an in-flight dispatch, and on the PJRT host-CPU zero-copy path the host write is the device buffer; the result is silent KV gather corruption. Fix by stabilising slot ownership across batches: - ReqToTokenPool.alloc(reqs) now treats reqs whose req_pool_idx is already non-None as in-place reuse and only consumes free_slots for the rest. Atomic on capacity miss: returns None and leaves every req field untouched. - ReqToTokenPool.free(req) is req-aware; it releases the slot and clears req.req_pool_idx so a stale value cannot leak to a later caller after the slot has been re-handed out. - The explicit free of the chunked request's slot in Scheduler.get_next_batch_to_run is removed. The req keeps its req_pool_idx; the next prepare_for_extend's alloc(reqs) reuses it. - All call sites (ScheduleBatch.alloc_req_slots, release_req, ChunkCache / RadixCache / SWARadixCache.cache_finished_req) are switched to the req-aware API. Mirrors the upstream sglang fix in sgl-project/sglang#17850 and addresses the chunked-prefill ownership item in #992 . New unit tests in test/mem_cache/test_req_to_token_pool.py lock in: fresh allocation, in-place reuse, atomic capacity miss (with and without partial reuse), free-clears-req-field, alloc/free roundtrip, and end-to-end chunked-req survival across batches. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

cctry requested review from ByronHsu, ShangmingCai, Ying1123, hanming-lu, hnyls2002, merrymercy, xiezhq-hermann and yizhang2077 as code owners January 28, 2026 03:30

github-actions Bot added the run-ci label Jan 28, 2026

init

6ce3974

fix

30b9b41

cctry force-pushed the csy/fix_req_to_pool branch from 60ab814 to 30b9b41 Compare January 28, 2026 21:31

cctry added 3 commits January 28, 2026 15:04

fix dllm

8587abd

fix mamba eagle

f37e98b

fix test

0b9995d

ShangmingCai assigned ByronHsu and hnyls2002 Feb 2, 2026

merrymercy added the high priority label Feb 2, 2026

Merge branch 'main' into csy/fix_req_to_pool

8e4924c

merrymercy merged commit 027f314 into main Feb 2, 2026
194 of 214 checks passed

merrymercy deleted the csy/fix_req_to_pool branch February 2, 2026 22:38

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 5, 2026

[Fix] data race in req_to_token pool (sgl-project#17850)

c4fc22c

sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026

[Fix] data race in req_to_token pool (sgl-project#17850)

4e73590

nvcastet mentioned this pull request Feb 13, 2026

[Fix] Defer req_to_token pool-index free in overlap scheduling to prevent cross-stream data race #18803

Closed

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

[Fix] data race in req_to_token pool (sgl-project#17850)

df6cb63

kitft mentioned this pull request Mar 11, 2026

[Fix] Slice input_embeds to extend_input_len in prepare_for_extend #20376

Merged

3 tasks

hnyls2002 mentioned this pull request Mar 13, 2026

Fix chunked prefill and KV cache leaks for streaming sessions #20476

Merged

5 tasks

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[Fix] data race in req_to_token pool (sgl-project#17850)

a607182

Rodrian7 mentioned this pull request Apr 30, 2026

fix(req_to_token_pool): eliminate cross-stream race on chunked prefill sgl-project/sglang-jax#1008

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] data race in req_to_token pool#17850

[Fix] data race in req_to_token pool#17850
merrymercy merged 6 commits intomainfrom
csy/fix_req_to_pool

cctry commented Jan 28, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jan 28, 2026

Uh oh!

cctry commented Jan 28, 2026

Uh oh!

Henrry-CHEN commented Jan 28, 2026

Uh oh!

cctry commented Jan 28, 2026 via email •

edited

Loading

Uh oh!

Uh oh!

ClawSeven commented Feb 10, 2026

Uh oh!

ClawSeven commented Feb 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

cctry commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Example

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Jan 28, 2026

Uh oh!

cctry commented Jan 28, 2026

Uh oh!

Henrry-CHEN commented Jan 28, 2026

Uh oh!

cctry commented Jan 28, 2026 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ClawSeven commented Feb 10, 2026

Uh oh!

ClawSeven commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

cctry commented Jan 28, 2026 •

edited

Loading

cctry commented Jan 28, 2026 via email •

edited

Loading

ClawSeven commented Feb 10, 2026 •

edited

Loading