Fix streaming session busy check double-counting; add compat CI tests#22213
Fix streaming session busy check double-counting; add compat CI tests#22213
Conversation
|
/rerun-test test/registered/sessions/test_streaming_session.py |
|
✅ |
6dddc81 to
10cd806
Compare
|
/rerun-test test/registered/sessions/test_streaming_session.py |
|
✅ |
There was a problem hiding this comment.
Code Review
This pull request enables support for multiple external corpora in ngram speculative decoding by replacing the single Suffix Automaton (SAM) with a map of SAMs. It introduces new HTTP endpoints and management logic to add, remove, and list corpora at runtime. Key feedback identifies critical thread-safety issues in the C++ implementation where staging_sam_ is accessed without proper synchronization between background loading and clearing operations. Other improvements suggested include addressing non-determinism caused by std::unordered_map iteration during result merging and refining the budget distribution logic to better utilize available tokens when the number of SAMs is high.
I am having trouble creating individual review comments. Click here to see my feedback.
python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (77-82)
Accessing staging_sam_ here is thread-unsafe. If clearExternalCorpus is called on the main thread while this background operation is running, staging_sam_ could be reset to nullptr between the check on line 78 and the dereference on line 81, leading to a crash.
python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (84-92)
Multiple unsynchronized accesses to staging_sam_ occur here before the lock is acquired on line 94. Specifically, finalize(), empty(), and reset() are called on staging_sam_ while the main thread might be concurrently resetting it in clearExternalCorpus.
python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (103-107)
Resetting staging_sam_ while holding mutex_ does not protect against crashes in appendExternalCorpusTokens or finishExternalCorpusLoad, as those methods access staging_sam_ without acquiring the same lock. To fix this, all accesses to staging_sam_ should be protected by the mutex, or the staging logic should be refactored to avoid shared state during the loading process.
python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (70-75)
The staging_sam_ member is accessed here without holding mutex_. While the comment suggests it is only accessed from a background thread, clearExternalCorpus (which runs on the main thread) also modifies staging_sam_. This creates a race condition if a load is started while a clear operation is in progress.
python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (215-219)
Iterating over std::unordered_map results in a non-deterministic order of SAM merging. If multiple SAMs provide different candidates for the same prefix, the final merged result may vary between runs depending on the iteration order. For consistent speculative decoding, consider using a deterministic container like std::map or sorting the keys before iteration.
python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (204-210)
The condition per_sam_budget == 0 is overly restrictive. If total_sam_budget is less than num_sams, all SAMs are completely bypassed even if some budget is available. It would be more efficient to distribute the available budget to a subset of SAMs rather than falling back to trie-only matching.
When a streaming session slot restores KV to an active request via restore_to_req, the slot's req_pool_idx is intentionally not cleared (for idempotent retry). This caused session_held_tokens() to count tokens that were already included in the running batch's uncached_size, triggering a spurious "Mem Leak Detected" assertion under retract mode. Add is_active flag to SessionSlot: set on restore_to_req, cleared on save_from_req. Token/req accounting methods skip active slots.
|
/rerun-test test/registered/sessions/test_streaming_session.py |
|
✅ |
|
/rerun-test test/registered/sessions/test_streaming_session.py |
|
✅ |
In overlap scheduling, a finished streaming-session req can still be present in the batch after its KV was transferred to a SessionSlot (req_pool_idx = None). The busy mem check counted these tokens in both uncached_size (from the batch) and session_held (from the slot). Skip reqs with req_pool_idx=None in _get_batch_uncached_size.
|
/rerun-test test/registered/sessions/test_streaming_session.py |
|
✅ |
The busy-time strict mem check has a pre-existing token accounting mismatch with overlap scheduling + concurrent streaming sessions (-2 undercount). This is independent of the streaming session feature. Keep SGLANG_TEST_RETRACT to verify retract behavior; idle-time memory checks still validate no token leaks after sessions close.
|
/rerun-test test/registered/sessions/test_streaming_session.py |
|
✅ |
…leak Concurrent streaming sessions under retract pressure leak ~14 tokens (idle check: available+evictable+protected+session_held < max). This is a pre-existing bug, not introduced by the compat tests. Skip these variants for now; the leak will be investigated separately.
|
/rerun-test test/registered/sessions/test_streaming_session.py |
|
✅ |
…on-compat # Conflicts: # python/sglang/srt/managers/scheduler_runtime_checker_mixin.py
|
/rerun-test test_streaming_session.py |
|
✅ |
|
/rerun-test test_streaming_session.py |
|
✅ |
|
/tag-and-rerun-ci |
|
✅ |
|
✅ |
|
✅ |
Summary
is_activeborrow tracker onSessionSlotrestore_to_reqborrows KV from a slot, the slot'sreq_pool_idxis intentionally not cleared (for chunked prefill retry idempotency). This causedsession_held_tokens()and_get_total_uncached_sizes()to both count the same KV pages, triggering false positive leak assertionsis_activeis set onrestore_to_req, cleared onsave_from_req. Token/req accounting methods skip active (borrowed-out) slotsSGLANG_ENABLE_STRICT_MEM_CHECK_DURING_BUSY=2(assert + log) in streaming session tests to validate the fixFollowing work
release_kv_cache(is_insert=False)->save_from_reqbut skips overalloc tail cleanup, leaking ~14 tokens. Tracked separately.last_nodeon release)Test plan
TestStreamingSession— base, busy check level=2TestStreamingSessionMixedChunk— mixed chunk, busy check level=2