Fix streaming session busy check double-counting; add compat CI tests by hnyls2002 · Pull Request #22213 · sgl-project/sglang

hnyls2002 · 2026-04-06T23:25:02Z

Summary

Fix double-counting in busy-time memory check for streaming sessions by adding is_active borrow tracker on SessionSlot
When restore_to_req borrows KV from a slot, the slot's req_pool_idx is intentionally not cleared (for chunked prefill retry idempotency). This caused session_held_tokens() and _get_total_uncached_sizes() to both count the same KV pages, triggering false positive leak assertions
is_active is set on restore_to_req, cleared on save_from_req. Token/req accounting methods skip active (borrowed-out) slots
Enable SGLANG_ENABLE_STRICT_MEM_CHECK_DURING_BUSY=2 (assert + log) in streaming session tests to validate the fix
Add mixed-chunk variant to verify no regressions under mixed chunked prefill

Following work

Retract + streaming session token leak: retract calls release_kv_cache(is_insert=False) -> save_from_req but skips overalloc tail cleanup, leaking ~14 tokens. Tracked separately.
fix: streaming session race condition + some metrics #21875: fixes other streaming session issues (abort handling, close race condition, speculative tail trimming, stale last_node on release)

Test plan

TestStreamingSession — base, busy check level=2
TestStreamingSessionMixedChunk — mixed chunk, busy check level=2

hnyls2002 · 2026-04-06T23:25:09Z

/rerun-test test/registered/sessions/test_streaming_session.py

github-actions · 2026-04-06T23:25:34Z

✅ 1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

…-cache

hnyls2002 · 2026-04-06T23:27:31Z

/rerun-test test/registered/sessions/test_streaming_session.py

github-actions · 2026-04-06T23:28:02Z

✅ 1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

gemini-code-assist

Code Review

This pull request enables support for multiple external corpora in ngram speculative decoding by replacing the single Suffix Automaton (SAM) with a map of SAMs. It introduces new HTTP endpoints and management logic to add, remove, and list corpora at runtime. Key feedback identifies critical thread-safety issues in the C++ implementation where staging_sam_ is accessed without proper synchronization between background loading and clearing operations. Other improvements suggested include addressing non-determinism caused by std::unordered_map iteration during result merging and refining the budget distribution logic to better utilize available tokens when the number of SAMs is high.

I am having trouble creating individual review comments. Click here to see my feedback.

python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (77-82)

Accessing staging_sam_ here is thread-unsafe. If clearExternalCorpus is called on the main thread while this background operation is running, staging_sam_ could be reset to nullptr between the check on line 78 and the dereference on line 81, leading to a crash.

python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (84-92)

Multiple unsynchronized accesses to staging_sam_ occur here before the lock is acquired on line 94. Specifically, finalize(), empty(), and reset() are called on staging_sam_ while the main thread might be concurrently resetting it in clearExternalCorpus.

python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (103-107)

Resetting staging_sam_ while holding mutex_ does not protect against crashes in appendExternalCorpusTokens or finishExternalCorpusLoad, as those methods access staging_sam_ without acquiring the same lock. To fix this, all accesses to staging_sam_ should be protected by the mutex, or the staging logic should be refactored to avoid shared state during the loading process.

python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (70-75)

The staging_sam_ member is accessed here without holding mutex_. While the comment suggests it is only accessed from a background thread, clearExternalCorpus (which runs on the main thread) also modifies staging_sam_. This creates a race condition if a load is started while a clear operation is in progress.

python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (215-219)

Iterating over std::unordered_map results in a non-deterministic order of SAM merging. If multiple SAMs provide different candidates for the same prefix, the final merged result may vary between runs depending on the iteration order. For consistent speculative decoding, consider using a deterministic container like std::map or sorting the keys before iteration.

python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (204-210)

The condition per_sam_budget == 0 is overly restrictive. If total_sam_budget is less than num_sams, all SAMs are completely bypassed even if some budget is available. It would be more efficient to distribute the available budget to a subset of SAMs rather than falling back to trie-only matching.

When a streaming session slot restores KV to an active request via restore_to_req, the slot's req_pool_idx is intentionally not cleared (for idempotent retry). This caused session_held_tokens() to count tokens that were already included in the running batch's uncached_size, triggering a spurious "Mem Leak Detected" assertion under retract mode. Add is_active flag to SessionSlot: set on restore_to_req, cleared on save_from_req. Token/req accounting methods skip active slots.

hnyls2002 · 2026-04-06T23:40:57Z

/rerun-test test/registered/sessions/test_streaming_session.py

github-actions · 2026-04-06T23:41:27Z

✅ 1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

hnyls2002 · 2026-04-06T23:51:04Z

/rerun-test test/registered/sessions/test_streaming_session.py

github-actions · 2026-04-06T23:51:29Z

✅ 1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

In overlap scheduling, a finished streaming-session req can still be present in the batch after its KV was transferred to a SessionSlot (req_pool_idx = None). The busy mem check counted these tokens in both uncached_size (from the batch) and session_held (from the slot). Skip reqs with req_pool_idx=None in _get_batch_uncached_size.

hnyls2002 · 2026-04-07T00:00:43Z

/rerun-test test/registered/sessions/test_streaming_session.py

github-actions · 2026-04-07T00:01:10Z

✅ 1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

The busy-time strict mem check has a pre-existing token accounting mismatch with overlap scheduling + concurrent streaming sessions (-2 undercount). This is independent of the streaming session feature. Keep SGLANG_TEST_RETRACT to verify retract behavior; idle-time memory checks still validate no token leaks after sessions close.

hnyls2002 · 2026-04-07T00:09:22Z

/rerun-test test/registered/sessions/test_streaming_session.py

github-actions · 2026-04-07T00:09:53Z

✅ 1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

…leak Concurrent streaming sessions under retract pressure leak ~14 tokens (idle check: available+evictable+protected+session_held < max). This is a pre-existing bug, not introduced by the compat tests. Skip these variants for now; the leak will be investigated separately.

hnyls2002 · 2026-04-07T00:17:37Z

/rerun-test test/registered/sessions/test_streaming_session.py

github-actions · 2026-04-07T00:18:03Z

✅ 1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

…on-compat # Conflicts: # python/sglang/srt/managers/scheduler_runtime_checker_mixin.py

hnyls2002 · 2026-04-12T08:04:44Z

/rerun-test test_streaming_session.py

github-actions · 2026-04-12T08:05:16Z

✅ 1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

hnyls2002 · 2026-04-12T08:22:58Z

/rerun-test test_streaming_session.py
(4 tries)

github-actions · 2026-04-12T08:23:26Z

✅ 1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

hnyls2002 · 2026-04-12T08:26:55Z

/tag-and-rerun-ci

github-actions · 2026-04-12T08:30:36Z

✅ 1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

github-actions · 2026-04-12T08:32:28Z

✅ 1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

github-actions · 2026-04-12T08:40:24Z

✅ 1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

…sgl-project#22213)

github-actions Bot added the jit-kernel label Apr 6, 2026

add streaming session compat tests for mixed chunk, retract, no-radix…

10cd806

…-cache

hnyls2002 force-pushed the lsyin/streaming-session-compat branch from 6dddc81 to 10cd806 Compare April 6, 2026 23:27

gemini-code-assist Bot reviewed Apr 6, 2026

View reviewed changes

enable debug-level mem check (level=2) for retract variants

de4cae4

hnyls2002 assigned Qiaolin-Yu and kpham-sgl Apr 7, 2026

Merge remote-tracking branch 'origin/main' into lsyin/streaming-sessi…

96dd1a8

…on-compat # Conflicts: # python/sglang/srt/managers/scheduler_runtime_checker_mixin.py

hnyls2002 changed the title ~~Add streaming session compat CI tests~~ Fix streaming session busy check double-counting; add compat CI tests Apr 12, 2026

hnyls2002 marked this pull request as ready for review April 12, 2026 08:26

hnyls2002 requested review from Ying1123, hanming-lu, hzh0425, ispobock, merrymercy, xiezhq-hermann and yizhang2077 as code owners April 12, 2026 08:26

github-actions Bot added the run-ci label Apr 12, 2026

enable busy check (level=2) in streaming session tests

5168c79

enable busy check (level=2) in streaming session tests

c0d7fe4

hnyls2002 added 2 commits April 12, 2026 01:36

drop no-radix-cache variants; no meaningful coverage

9bc71e2

remove section banners

70f1f04

hnyls2002 merged commit f1eb4ca into main Apr 12, 2026
32 of 72 checks passed

hnyls2002 deleted the lsyin/streaming-session-compat branch April 12, 2026 08:48

hnyls2002 mentioned this pull request Apr 14, 2026

Fix streaming session busy-check double-counting via active_pool_idxs #22753

Merged

pyc96 pushed a commit to pyc96/sglang that referenced this pull request Apr 14, 2026

Fix streaming session busy check double-counting; add compat CI tests (…

a7f56f8

…sgl-project#22213)

hnyls2002 mentioned this pull request Apr 15, 2026

streaming session: spec v2 bonus accounting + comprehensive test matrix #22651

Merged

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

Fix streaming session busy check double-counting; add compat CI tests (…

284db5f

…sgl-project#22213)

Conversation

hnyls2002 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Following work

Test plan

Uh oh!

hnyls2002 commented Apr 6, 2026

Uh oh!

github-actions Bot commented Apr 6, 2026

Uh oh!

hnyls2002 commented Apr 6, 2026

Uh oh!

github-actions Bot commented Apr 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (77-82)

python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (84-92)

python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (103-107)

python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (70-75)

python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (215-219)

python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (204-210)

Uh oh!

hnyls2002 commented Apr 6, 2026

Uh oh!

github-actions Bot commented Apr 6, 2026

Uh oh!

hnyls2002 commented Apr 6, 2026

Uh oh!

github-actions Bot commented Apr 6, 2026

Uh oh!

hnyls2002 commented Apr 7, 2026

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

hnyls2002 commented Apr 7, 2026

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

hnyls2002 commented Apr 7, 2026

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

hnyls2002 commented Apr 12, 2026

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

hnyls2002 commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

hnyls2002 commented Apr 12, 2026

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hnyls2002 commented Apr 6, 2026 •

edited

Loading

hnyls2002 commented Apr 12, 2026 •

edited

Loading