Streaming session: fix retract tail leak via _free_tail#22862
Streaming session: fix retract tail leak via _free_tail#22862
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors KV cache management for streaming sessions by moving speculative tail trimming and bookkeeping logic from common.py into SessionAwareCache. Key changes include updating the prefix_len calculation in match_prefix and introducing a _free_tail method to handle orphaned KV indices. Feedback indicates that _free_tail lacks necessary page alignment logic for page_size > 1, which could lead to memory corruption. Additionally, it is recommended to restore the detailed docstring in release_session to maintain technical context regarding radix tree splits and prevent future regressions.
|
/rerun-test test_streaming_session.py test_streaming_session_unit.py |
|
✅ ✅ |
|
/tag-and-rerun-ci |
|
/rerun-test test_streaming_session.py test_streaming_session_unit.py test_session_control.py test_session_latency.py |
|
✅ ✅ |
Streaming session: free orphaned KV tail via
_free_tailStreaming sessions leak KV pool indices whenever the next turn's
prefix_lenis less than the slot'skv_allocated_len. The existingcommon.pyspec-tail-trim only handles the speculative-decoding source. This PR replaces it with a single_free_tailcall inmatch_prefixthat handles every source by construction, plus the page-aligned cleanup needed forPagedTokenToKVPoolAllocator.Root cause
When a streaming session resumes,
alloc_for_extendwrites new pool indices intoreq_to_token[pool_idx, prefix_len : seq_len]. Old indices already in that range get overwritten — the allocator still tracks them as in-use, but noreq_to_tokenentry references them. Pure leak.The leaked range is always
[prefix_len, kv_allocated_len). Two mechanisms produce a non-empty gap:allocatedabovecommitted— today this is speculative decoding (allocnum_draft+1, commitnum_accepted+1); any future general over-allocation pattern produces the same gap.prefix_lenbelowcommitted—init_next_round_inputpassesfill_ids[:input_len-1]on retract retry, solen(token_ids)is one short of the saved committed length, droppingprefix_lenby 1.Old design and what it missed
release_kv_cachehad a streaming-session branch that trimmed[committed, allocated)beforesave_from_req. Two gaps it didn't catch:committed < allocatedbut notprefix_len < committed. Even when the slot saves withcommitted == allocated = K, the next turn computesprefix_len = K-1(logit reserve) andalloc_for_extendoverwrites positionK-1. Slow drip: 1 token per turn under retract or any path that triggers the reserve.common.py— violates the decorator pattern. Reviewers of the generic release path have to remember the streaming special case, and the trim duplicates work thatmatch_prefixis better positioned to do.New design:
_free_tailinmatch_prefixRuns once per turn, immediately after
restore_to_reqand beforealloc_for_extend. Frees[prefix_len, kv_allocated_len), then clamps slot + req lengths.Why this location is correct, not just convenient:
prefix_lenis the exact boundaryalloc_for_extendwill overwrite — freeing[prefix_len, kv_allocated_len)matches the overwrite range. No leak, no over-free.prefix_len < kv_allocated_len. No per-source branching.prefix_len >= kv_allocated_lenand early-return. Zero overhead.common.pyhas zero streaming-session awareness now.Page-aligned cleanup
PagedTokenToKVPoolAllocator.freereturns whole pages to the free list (free_index // page_size), so freeing a partial page would corrupt pages that still hold committed tokens._free_tailceil-alignsprefix_lentopage_sizebefore freeing; the unaligned tail in[prefix_len, free_start)stays attached to the slot and is freed whenrelease_sessionreturns the whole page.Bonus
prefix_len = min(committed, max(len(token_ids) - 1, 0))toprefix_len = min(committed, len(token_ids)). The extra-1was double-counting the already-applied 1-token logit reserve.assert prefix_len >= slot.cache_protected_len— the append-only invariant established by Refactor streaming session abort handling #22790'sfinish_reqpattern is now a checked property.