Skip to content

Streaming session: fix retract tail leak via _free_tail#22862

Merged
hnyls2002 merged 5 commits intomainfrom
lsyin/streaming-session-tail-leak
Apr 15, 2026
Merged

Streaming session: fix retract tail leak via _free_tail#22862
hnyls2002 merged 5 commits intomainfrom
lsyin/streaming-session-tail-leak

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 commented Apr 15, 2026

Streaming session: free orphaned KV tail via _free_tail

Streaming sessions leak KV pool indices whenever the next turn's prefix_len is less than the slot's kv_allocated_len. The existing common.py spec-tail-trim only handles the speculative-decoding source. This PR replaces it with a single _free_tail call in match_prefix that handles every source by construction, plus the page-aligned cleanup needed for PagedTokenToKVPoolAllocator.

Root cause

When a streaming session resumes, alloc_for_extend writes new pool indices into req_to_token[pool_idx, prefix_len : seq_len]. Old indices already in that range get overwritten — the allocator still tracks them as in-use, but no req_to_token entry references them. Pure leak.

The leaked range is always [prefix_len, kv_allocated_len). Two mechanisms produce a non-empty gap:

  • Over-allocation pushes allocated above committed — today this is speculative decoding (alloc num_draft+1, commit num_accepted+1); any future general over-allocation pattern produces the same gap.
  • Retract retry pulls prefix_len below committedinit_next_round_input passes fill_ids[:input_len-1] on retract retry, so len(token_ids) is one short of the saved committed length, dropping prefix_len by 1.

Old design and what it missed

release_kv_cache had a streaming-session branch that trimmed [committed, allocated) before save_from_req. Two gaps it didn't catch:

  • Logit-reserve off-by-1 leaks per turn — the trim handles committed < allocated but not prefix_len < committed. Even when the slot saves with committed == allocated = K, the next turn computes prefix_len = K-1 (logit reserve) and alloc_for_extend overwrites position K-1. Slow drip: 1 token per turn under retract or any path that triggers the reserve.
  • Couples streaming logic into common.py — violates the decorator pattern. Reviewers of the generic release path have to remember the streaming special case, and the trim duplicates work that match_prefix is better positioned to do.

New design: _free_tail in match_prefix

Runs once per turn, immediately after restore_to_req and before alloc_for_extend. Frees [prefix_len, kv_allocated_len), then clamps slot + req lengths.

Why this location is correct, not just convenient:

  • prefix_len is the exact boundary alloc_for_extend will overwrite — freeing [prefix_len, kv_allocated_len) matches the overwrite range. No leak, no over-free.
  • Single source covers every gap — spec push, retract pull, future scenarios that produce prefix_len < kv_allocated_len. No per-source branching.
  • No-op in the common case — non-spec, non-retract turns satisfy prefix_len >= kv_allocated_len and early-return. Zero overhead.
  • Decorator pattern preservedcommon.py has zero streaming-session awareness now.

Page-aligned cleanup

PagedTokenToKVPoolAllocator.free returns whole pages to the free list (free_index // page_size), so freeing a partial page would corrupt pages that still hold committed tokens. _free_tail ceil-aligns prefix_len to page_size before freeing; the unaligned tail in [prefix_len, free_start) stays attached to the slot and is freed when release_session returns the whole page.

Bonus

  • Drop the buggy prefix_len = min(committed, max(len(token_ids) - 1, 0)) to prefix_len = min(committed, len(token_ids)). The extra -1 was double-counting the already-applied 1-token logit reserve.
  • Add assert prefix_len >= slot.cache_protected_len — the append-only invariant established by Refactor streaming session abort handling #22790's finish_req pattern is now a checked property.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors KV cache management for streaming sessions by moving speculative tail trimming and bookkeeping logic from common.py into SessionAwareCache. Key changes include updating the prefix_len calculation in match_prefix and introducing a _free_tail method to handle orphaned KV indices. Feedback indicates that _free_tail lacks necessary page alignment logic for page_size > 1, which could lead to memory corruption. Additionally, it is recommended to restore the detailed docstring in release_session to maintain technical context regarding radix tree splits and prevent future regressions.

Comment thread python/sglang/srt/mem_cache/session_aware_cache.py
Comment thread python/sglang/srt/mem_cache/session_aware_cache.py
@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test test_streaming_session.py test_streaming_session_unit.py

@github-actions
Copy link
Copy Markdown
Contributor

1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py

ubuntu-latest (1 test): View workflow run

cd test/ && python3 registered/unit/mem_cache/test_streaming_session_unit.py

@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test test_streaming_session.py test_streaming_session_unit.py test_session_control.py test_session_latency.py

@github-actions
Copy link
Copy Markdown
Contributor

1-gpu-h100 (3 tests): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py
cd test/ && python3 registered/sessions/test_session_control.py
cd test/ && python3 registered/sessions/test_session_latency.py

ubuntu-latest (1 test): View workflow run

cd test/ && python3 registered/unit/mem_cache/test_streaming_session_unit.py

@hnyls2002 hnyls2002 merged commit ce31934 into main Apr 15, 2026
96 of 148 checks passed
@hnyls2002 hnyls2002 deleted the lsyin/streaming-session-tail-leak branch April 15, 2026 08:44
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
kyx1999 pushed a commit to KMSorSMS/sglang that referenced this pull request Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant