Skip to content

fix: streaming session race condition + some metrics#21875

Merged
hnyls2002 merged 41 commits intomainfrom
ishan/streaming-sesh-better
Apr 13, 2026
Merged

fix: streaming session race condition + some metrics#21875
hnyls2002 merged 41 commits intomainfrom
ishan/streaming-sesh-better

Conversation

@ishandhanani
Copy link
Copy Markdown
Collaborator

@ishandhanani ishandhanani commented Apr 1, 2026

Summary

Stabilize streaming session KV lifecycle: fix multiple memory leaks, add safety for session close during active decoding, and add Prometheus metrics.

Race conditions & lifecycle

  • Fix open_session waiter race: register the future before sending OpenSessionReqInput, clean up defensively, and tolerate late responses
  • Deduplicate OpenSessionReqOutput under DP-attention: every rank opens the session, but only the designated scheduler rank returns the response
  • Defer session close when a request is still decoding: mark for cleanup after the in-flight request finishes, preventing KV corruption

KV memory leak fixes

  • Overalloc tail trim (common.py): trim speculative/overallocated KV tail before transferring pool ownership to SessionSlot, preventing stranded pages between turns
  • Abort-skip in match_prefix: when a request is destined for abort (e.g. input too long), skip restore_to_req so the aborted request gets a fresh pool slot instead of overwriting the session's accumulated KV
  • Abort KV cleanup in cache_finished_req: free the transient 1-token KV from aborted requests without touching the session slot
  • Release accounting (_resolve_release_state): re-match the tree prefix at release time and intersect with the slot's row to handle tree splits that occurred during the session's lifetime
  • Owned-prefix cap: cap the release-time rematch to slot.cache_protected_len so it never grows past what the session actually owns
  • Cleanup flags sync: mark kv_committed_freed and kv_overallocated_freed after SessionAwareCache transfers or frees KV, so busy-time memory checks don't double-count
  • Session shrink fix: when a client retries with a shorter prompt (e.g. after conversation compaction), free orphaned tail pages before save_from_req overwrites the slot with a smaller committed length

Includes the fix from the merged PR #22273 which added the abort-skip and abort-cleanup paths.

Observability

  • sglang:num_streaming_sessions and sglang:streaming_session_held_tokens Prometheus gauges, gated behind enable_streaming_session

Test plan

  • Unit tests for overalloc trim, release rematch, prefix cap, abort-skip, abort accounting, session shrink, and page-aligned shrink (test_streaming_session_unit.py, 7 tests)
  • Integration test for abort-heavy streaming session workload (test_streaming_session.py)
  • Cluster validation on sa-b200 with MiniMax M2.5 FP8: leak reduced from 534K tokens to ~4K (page-alignment residual) across 5 input-length aborts

🤖 Generated with Claude Code

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ishandhanani
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ishandhanani
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

ishandhanani and others added 5 commits April 2, 2026 02:19
Closing a streaming session while a request is actively decoding
frees KV pool indices out from under the scheduler, corrupting
memory accounting and hanging the engine.

- Add `pending_close` flag to Session; `_close()` returns early
  when an unfinished request exists instead of calling
  `release_session()` on live memory.
- `maybe_reap()` polls pending_close sessions each second and
  completes the release once all requests finish.
- Scheduler rejects new requests for pending_close sessions
  (treated as "session not found").
Only log "Deferring session close" on the first deferral. Subsequent
polls while pending_close is already set are silent. The "Deferred
close ready" log still fires when the session actually releases.

Eliminates ~90 duplicate log lines per session close during long
generations with TP>1.
@hnyls2002
Copy link
Copy Markdown
Collaborator

/rerun-test test_streaming_session_unit.py test_streaming_session.py test_session_control.py

@github-actions
Copy link
Copy Markdown
Contributor

ubuntu-latest (1 test): View workflow run

cd test/ && python3 registered/unit/mem_cache/test_streaming_session_unit.py

1-gpu-h100 (2 tests): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py
cd test/ && python3 registered/sessions/test_session_control.py

@hnyls2002
Copy link
Copy Markdown
Collaborator

/rerun-test test_streaming_session.py test_session_control.py test_streaming_session_unit.py

@github-actions
Copy link
Copy Markdown
Contributor

1-gpu-h100 (2 tests): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py
cd test/ && python3 registered/sessions/test_session_control.py

ubuntu-latest (1 test): View workflow run

cd test/ && python3 registered/unit/mem_cache/test_streaming_session_unit.py

@hnyls2002
Copy link
Copy Markdown
Collaborator

/rerun-test test_streaming_session.py test_session_control.py test_streaming_session_unit.py

@github-actions
Copy link
Copy Markdown
Contributor

1-gpu-h100 (2 tests): View workflow run

cd test/ && python3 registered/sessions/test_streaming_session.py
cd test/ && python3 registered/sessions/test_session_control.py

ubuntu-latest (1 test): View workflow run

cd test/ && python3 registered/unit/mem_cache/test_streaming_session_unit.py

@hnyls2002
Copy link
Copy Markdown
Collaborator

/rerun-test test_session_latency.py test_streaming_session.py test_session_control.py test_streaming_session_unit.py

@github-actions
Copy link
Copy Markdown
Contributor

1-gpu-h100 (3 tests): View workflow run

cd test/ && python3 registered/sessions/test_session_latency.py
cd test/ && python3 registered/sessions/test_streaming_session.py
cd test/ && python3 registered/sessions/test_session_control.py

ubuntu-latest (1 test): View workflow run

cd test/ && python3 registered/unit/mem_cache/test_streaming_session_unit.py

@hnyls2002 hnyls2002 merged commit c1ab68b into main Apr 13, 2026
145 of 167 checks passed
@hnyls2002 hnyls2002 deleted the ishan/streaming-sesh-better branch April 13, 2026 01:05
pyc96 pushed a commit to pyc96/sglang that referenced this pull request Apr 14, 2026
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants