streaming session: spec v2 bonus accounting + comprehensive test matrix#22651
Merged
streaming session: spec v2 bonus accounting + comprehensive test matrix#22651
Conversation
Collaborator
Author
|
/rerun-test registered/sessions/test_streaming_session.py registered/sessions/test_session_control.py (3 tries) |
Contributor
|
✅ |
Contributor
|
✅ |
…n; un-skip page>1 variants
…shed_req; unskip EagleV2 inheritance tests
Collaborator
Author
|
/rerun-test test_session_control.py test_session_latency.py test_streaming_session.py test_streaming_session_swa.py (x2) |
Contributor
|
✅ |
…k), LargePage (covered by RetractLargePage)
…Retract -> EagleV2RetractLargePage
… the harder path)
Contributor
|
✅ |
5 tasks
jmamou
pushed a commit
to jmamou/sglang
that referenced
this pull request
Apr 20, 2026
yhyang201
pushed a commit
to yhyang201/sglang
that referenced
this pull request
Apr 22, 2026
zhangying098
pushed a commit
to zhangying098/sglang
that referenced
this pull request
Apr 23, 2026
kyx1999
pushed a commit
to KMSorSMS/sglang
that referenced
this pull request
Apr 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Summary / capstone PR for the streaming-session correctness work. Two roles:
spec_v2 + streaming sessionincompatibility gate, and adds end-to-end test coverage across all meaningful configuration combinations.Read this PR if you want the overview; follow the links to #22862 / #22897 for the root analyses.
Dependent PR chain
TestStreamingSessionAbortLeakReproinherits stdout/stderr instead of tempfile (so server logs surface in CI failure output)kv_committed_len/kv_allocated_len/slot.kv_*vocabulary_free_tailpartial-page free corrupted committed pages on retract retry. #22862's body has the full analysis of whymatch_prefixis the right location for the free, why page alignment is required, and how both spec over-allocation and retract alloc-commit gap funnel through the same boundary.max_new_tokensand SWA cursor leaks; trim the slot at finish. #22897's body includes a full appendix with the root analysis (definitions, case analysis, how the fixes compose) that also covers the bonus-accounting fix below.Bonus accounting fix (this PR)
Spec v2 (
overlap + EAGLE) bumpskv_committed_lenpost-facto in_resolve_spec_overlap_token_ids:Normal decode does the +1 upfront in
prepare_for_decode(claims the pending-bonus slot before the forward runs). Spec v2 doing it post-facto leaves a one-token race: on the finishing round,cache_finished_reqfires before the next iter's resolve catches up, sosave_from_reqcapturescommitted = origin + finished_len - 1. Next turn inherits one token short, and the EagleV2 inheritance test fails by 1.Fix mirrors the normal-decode pattern:
EagleDraftInput.prepare_for_decode:r.kv_committed_len += 1(pre-claim bonus slot)_resolve_spec_overlap_token_ids:req.kv_committed_len += accept_lens[i] - 1(subtract the upfront claim)Net per round is unchanged (
+= accept_lens); the timing moves from post-facto to upfront, socache_finished_reqalways sees the post-extra-round committed value.Also removes the spec v2 + streaming-session incompatibility gate in
tokenizer_communicator_mixin.py. The accounting bugs that motivated the gate are fixed by the chain above (#22790, #22862, #22897, #22900, and this PR); spec v2 + streaming + retract now passes under strict mem check.Test coverage
What this PR adds
12 test classes across two files, pruned by strict-superset coverage — each retract variant already exercises the non-retract path;
page=256tests subsumepage=1(since page-aligned free / allocator paths only activate atpage>1).TestStreamingSession,RetractMixedChunk,RetractLargePage,AbortLeakRepro--chunked-prefill-size 512(or tight for abort stress)Eagle,EagleRetractLargePage--disable-overlap-scheduleEagleV2,EagleV2RetractLargePageSGLANG_ENABLE_SPEC_V2=1SWA,SWARetractLargePage,SWARetractMixedChunk,SWAAbortLeakReproAll use
SGLANG_ENABLE_STRICT_MEM_CHECK_DURING_BUSY=2so any KV accounting drift fails the test.Per-class shared methods
Every subclass of
TestStreamingSessioninherits and runs:test_kv_cache_inheritance— verify cached_tokens carries forward correctly across turns (skipped on page > 1 classes; constant offset doesn't fit when cached_tokens rounds to page boundaries)test_leak_logprob_concurrent— concurrent multi-session with logprob, watch for KV leaktest_stress_concurrent_sessions— high-concurrency streaming + non-streamingtest_preabort_recovery— pre-abort (rejected bycreate_req) doesn't corrupt sessiontest_first_mid_abort_recovery— abort the very first request mid-decode, recovery starts freshtest_nth_mid_abort_recovery— abort an nth request mid-decode, recovery rolls back to last successful