Skip to content

Fix streaming session busy check double-counting; add compat CI tests#22213

Merged
hnyls2002 merged 11 commits intomainfrom
lsyin/streaming-session-compat
Apr 12, 2026
Merged

Fix streaming session busy check double-counting; add compat CI tests#22213
hnyls2002 merged 11 commits intomainfrom
lsyin/streaming-session-compat

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 commented Apr 6, 2026

Summary

  • Fix double-counting in busy-time memory check for streaming sessions by adding is_active borrow tracker on SessionSlot
  • When restore_to_req borrows KV from a slot, the slot's req_pool_idx is intentionally not cleared (for chunked prefill retry idempotency). This caused session_held_tokens() and _get_total_uncached_sizes() to both count the same KV pages, triggering false positive leak assertions
  • is_active is set on restore_to_req, cleared on save_from_req. Token/req accounting methods skip active (borrowed-out) slots
  • Enable SGLANG_ENABLE_STRICT_MEM_CHECK_DURING_BUSY=2 (assert + log) in streaming session tests to validate the fix
  • Add mixed-chunk variant to verify no regressions under mixed chunked prefill

Following work

  • Retract + streaming session token leak: retract calls release_kv_cache(is_insert=False) -> save_from_req but skips overalloc tail cleanup, leaking ~14 tokens. Tracked separately.
  • fix: streaming session race condition + some metrics #21875: fixes other streaming session issues (abort handling, close race condition, speculative tail trimming, stale last_node on release)

Test plan

  • TestStreamingSession — base, busy check level=2
  • TestStreamingSessionMixedChunk — mixed chunk, busy check level=2

@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test test/registered/sessions/test_streaming_session.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 6, 2026

1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

@hnyls2002 hnyls2002 force-pushed the lsyin/streaming-session-compat branch from 6dddc81 to 10cd806 Compare April 6, 2026 23:27
@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test test/registered/sessions/test_streaming_session.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 6, 2026

1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables support for multiple external corpora in ngram speculative decoding by replacing the single Suffix Automaton (SAM) with a map of SAMs. It introduces new HTTP endpoints and management logic to add, remove, and list corpora at runtime. Key feedback identifies critical thread-safety issues in the C++ implementation where staging_sam_ is accessed without proper synchronization between background loading and clearing operations. Other improvements suggested include addressing non-determinism caused by std::unordered_map iteration during result merging and refining the budget distribution logic to better utilize available tokens when the number of SAMs is high.

I am having trouble creating individual review comments. Click here to see my feedback.

python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (77-82)

critical

Accessing staging_sam_ here is thread-unsafe. If clearExternalCorpus is called on the main thread while this background operation is running, staging_sam_ could be reset to nullptr between the check on line 78 and the dereference on line 81, leading to a crash.

python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (84-92)

critical

Multiple unsynchronized accesses to staging_sam_ occur here before the lock is acquired on line 94. Specifically, finalize(), empty(), and reset() are called on staging_sam_ while the main thread might be concurrently resetting it in clearExternalCorpus.

python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (103-107)

high

Resetting staging_sam_ while holding mutex_ does not protect against crashes in appendExternalCorpusTokens or finishExternalCorpusLoad, as those methods access staging_sam_ without acquiring the same lock. To fix this, all accesses to staging_sam_ should be protected by the mutex, or the staging logic should be refactored to avoid shared state during the loading process.

python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (70-75)

medium

The staging_sam_ member is accessed here without holding mutex_. While the comment suggests it is only accessed from a background thread, clearExternalCorpus (which runs on the main thread) also modifies staging_sam_. This creates a race condition if a load is started while a clear operation is in progress.

python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (215-219)

medium

Iterating over std::unordered_map results in a non-deterministic order of SAM merging. If multiple SAMs provide different candidates for the same prefix, the final merged result may vary between runs depending on the iteration order. For consistent speculative decoding, consider using a deterministic container like std::map or sorting the keys before iteration.

python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp (204-210)

medium

The condition per_sam_budget == 0 is overly restrictive. If total_sam_budget is less than num_sams, all SAMs are completely bypassed even if some budget is available. It would be more efficient to distribute the available budget to a subset of SAMs rather than falling back to trie-only matching.

When a streaming session slot restores KV to an active request via
restore_to_req, the slot's req_pool_idx is intentionally not cleared
(for idempotent retry). This caused session_held_tokens() to count
tokens that were already included in the running batch's uncached_size,
triggering a spurious "Mem Leak Detected" assertion under retract mode.

Add is_active flag to SessionSlot: set on restore_to_req, cleared on
save_from_req. Token/req accounting methods skip active slots.
@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test test/registered/sessions/test_streaming_session.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 6, 2026

1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test test/registered/sessions/test_streaming_session.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 6, 2026

1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

In overlap scheduling, a finished streaming-session req can still be
present in the batch after its KV was transferred to a SessionSlot
(req_pool_idx = None). The busy mem check counted these tokens in both
uncached_size (from the batch) and session_held (from the slot).

Skip reqs with req_pool_idx=None in _get_batch_uncached_size.
@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test test/registered/sessions/test_streaming_session.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

The busy-time strict mem check has a pre-existing token accounting
mismatch with overlap scheduling + concurrent streaming sessions
(-2 undercount). This is independent of the streaming session feature.

Keep SGLANG_TEST_RETRACT to verify retract behavior; idle-time memory
checks still validate no token leaks after sessions close.
@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test test/registered/sessions/test_streaming_session.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

…leak

Concurrent streaming sessions under retract pressure leak ~14 tokens
(idle check: available+evictable+protected+session_held < max).
This is a pre-existing bug, not introduced by the compat tests.

Skip these variants for now; the leak will be investigated separately.
@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test test/registered/sessions/test_streaming_session.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

…on-compat

# Conflicts:
#	python/sglang/srt/managers/scheduler_runtime_checker_mixin.py
@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test test_streaming_session.py

@github-actions
Copy link
Copy Markdown
Contributor

1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

@hnyls2002 hnyls2002 changed the title Add streaming session compat CI tests Fix streaming session busy check double-counting; add compat CI tests Apr 12, 2026
@hnyls2002
Copy link
Copy Markdown
Collaborator Author

hnyls2002 commented Apr 12, 2026

/rerun-test test_streaming_session.py
(4 tries)

@github-actions
Copy link
Copy Markdown
Contributor

1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

@hnyls2002 hnyls2002 marked this pull request as ready for review April 12, 2026 08:26
@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@github-actions
Copy link
Copy Markdown
Contributor

1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

@github-actions
Copy link
Copy Markdown
Contributor

1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

@github-actions
Copy link
Copy Markdown
Contributor

1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

@hnyls2002 hnyls2002 merged commit f1eb4ca into main Apr 12, 2026
32 of 72 checks passed
@hnyls2002 hnyls2002 deleted the lsyin/streaming-session-compat branch April 12, 2026 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants