fix: streaming session race condition + some metrics#21875
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
Closing a streaming session while a request is actively decoding frees KV pool indices out from under the scheduler, corrupting memory accounting and hanging the engine. - Add `pending_close` flag to Session; `_close()` returns early when an unfinished request exists instead of calling `release_session()` on live memory. - `maybe_reap()` polls pending_close sessions each second and completes the release once all requests finish. - Scheduler rejects new requests for pending_close sessions (treated as "session not found").
Only log "Deferring session close" on the first deferral. Subsequent polls while pending_close is already set are silent. The "Deferred close ready" log still fires when the session actually releases. Eliminates ~90 duplicate log lines per session close during long generations with TP>1.
…better # Conflicts: # test/registered/sessions/test_streaming_session.py
…; reject closing session
|
/rerun-test test_streaming_session_unit.py test_streaming_session.py test_session_control.py |
|
✅ ✅ |
|
/rerun-test test_streaming_session.py test_session_control.py test_streaming_session_unit.py |
|
✅ ✅ |
|
/rerun-test test_streaming_session.py test_session_control.py test_streaming_session_unit.py |
|
✅ ✅ |
|
/rerun-test test_session_latency.py test_streaming_session.py test_session_control.py test_streaming_session_unit.py |
|
✅ ✅ |
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Summary
Stabilize streaming session KV lifecycle: fix multiple memory leaks, add safety for session close during active decoding, and add Prometheus metrics.
Race conditions & lifecycle
open_sessionwaiter race: register the future before sendingOpenSessionReqInput, clean up defensively, and tolerate late responsesOpenSessionReqOutputunder DP-attention: every rank opens the session, but only the designated scheduler rank returns the responseKV memory leak fixes
common.py): trim speculative/overallocated KV tail before transferring pool ownership toSessionSlot, preventing stranded pages between turnsmatch_prefix: when a request is destined for abort (e.g. input too long), skiprestore_to_reqso the aborted request gets a fresh pool slot instead of overwriting the session's accumulated KVcache_finished_req: free the transient 1-token KV from aborted requests without touching the session slot_resolve_release_state): re-match the tree prefix at release time and intersect with the slot's row to handle tree splits that occurred during the session's lifetimeslot.cache_protected_lenso it never grows past what the session actually ownskv_committed_freedandkv_overallocated_freedafterSessionAwareCachetransfers or frees KV, so busy-time memory checks don't double-countsave_from_reqoverwrites the slot with a smaller committed lengthIncludes the fix from the merged PR #22273 which added the abort-skip and abort-cleanup paths.
Observability
sglang:num_streaming_sessionsandsglang:streaming_session_held_tokensPrometheus gauges, gated behindenable_streaming_sessionTest plan
test_streaming_session_unit.py, 7 tests)test_streaming_session.py)🤖 Generated with Claude Code