Refactor streaming session abort handling#22790
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a rollback mechanism for aborted session requests to maintain session integrity when errors occur during request processing. Changes include tracking the previous request state in the Session class and invoking a rollback in the scheduler's error handling paths. A review comment suggests refactoring the duplicated rollback logic in handle_generate_request into a helper method to improve code maintainability.
| if req.session is not None: | ||
| req.session.rollback_aborted_req(req.rid) |
There was a problem hiding this comment.
This logic for rolling back a session request is duplicated in four places within handle_generate_request. To improve code clarity and maintainability, consider extracting this logic into a helper method.
For example, you could add a method to the Scheduler class:
def _rollback_session_req(self, req: Req):
if req.session is not None:
req.session.rollback_aborted_req(req.rid)Then you could replace this block and the other three identical blocks with a call to self._rollback_session_req(req).
self._rollback_session_req(req)65b9616 to
e38ffbe
Compare
e38ffbe to
ee7f6a2
Compare
8d36186 to
ada7d06
Compare
…; e2e + unit tests
0651c59 to
665c53e
Compare
|
/rerun-test test_streaming_session.py test_streaming_session_unit.py (x3) |
|
✅ ✅ |
|
✅ ✅ |
|
✅ ✅ |
|
/rerun-test test_streaming_session.py test_streaming_session_unit.py test_session_control.py test_session_latency.py |
|
✅ ✅ |
|
/rerun-test test_session_control.py test_session_latency.py test_streaming_session.py |
Streaming session abort refactoring
Streaming sessions had no proper abort handling — abort could corrupt session state, leak KV memory, or crash the server. This PR fixes 5 independent problems and adds comprehensive test coverage.
Design
Abort = wipe KV, keep tokens. When any streaming request is aborted, the session's KV slot is completely released (tokens freed, pool slot returned, tree lock dropped). But the previous turn's token IDs stay in
req_nodes, so the next request can re-prefill from the last successful turn and continue the conversation normally.This replaces the previous behavior where abort tried to preserve partial state — which was fragile because
kv_committed_lenandkv_allocated_lencould be inflated byprepare_for_extendmid-decode, and saving that state back to the slot corrupted future turns. The new design sidesteps all partial-state reasoning: abort wipes everything, the next request rebuilds from the last known good point.Two high-level shifts:
release_session. No more ad-hoc cleanup branches.req_nodesis the single source of truth for previous token IDs. It holds the last successful turn's token IDs and is updated only on successful finish, never on abort. The next turn's input is always rebuilt fromreq_nodes, independent of whatever partial KV state the aborted request left behind. Token IDs and KV state are fully decoupled.Problems & Fixes
P1:
create_reqpopitem destroys append-only semantics;_closemisses running requestcreate_requsedreq_nodes.popitem()to get the last successful req. If the new request was aborted,req_nodeswas left empty — the next request lost all conversation history.Fix: peek (don't pop)
req_nodes. Addfinish_req()to updatereq_nodesonly on successful completion. On abort,req_nodesstays pointing at the last successful request so the next turn can still inherit from it.With this change,
req_nodestracks the last successful request, not the currently running one._closeneeds a different signal to detect in-flight requests — introduceSession._inflightas the single source of truth (also used for concurrency control in P2)._closechecks_inflightand defers cleanup via the existingclose_on_finishmechanism when a request is running.P2: No concurrency control on streaming sessions
Nothing prevented two concurrent requests on the same streaming session. Both could call
restore_to_reqand share the samereq_pool_idx, corrupting each other's KV memory.Fix: add
Session._inflightflag. Set oncreate_req, cleared byfinish_req/abort_req. Concurrent requests are immediately pre-aborted.P3: Mid-processing abort saves corrupted state
When
/abort_requestaborted a decoding request,prepare_for_extendhad already inflatedkv_committed_len. The existing code still calledsave_from_req, writing this corrupted state back to the slot. The next turn inherited KV that didn't match its input tokens.Fix: mid-processing abort = nuke (wipe fully). Call
release_sessionto wipe all KV. The next request re-prefills from scratch (token IDs stay inreq_nodesvia the finish_req pattern from P1).P4: First-request abort cleanup incomplete
When the first request on a session was aborted mid-processing, the code created a new slot and called
save_from_reqwith the aborted request's garbage state (inflated lengths, partial decode output). This slot persisted and corrupted subsequent requests.Fix: create an ephemeral
SessionSlotfrom the request's state (includinglast_nodeandcache_protected_lenfor properdec_lock_ref), then immediately nuke it viarelease_session. No slot remains.P5: Pre-aborted requests enter streaming abort path
Scheduler-level pre-aborts (e.g., input too long) kept
req.sessionset. Incache_finished_req, the streaming abort handler would nuke the session slot even though the request never touched it — and even though another request might be actively using it (from P2 before the fix).Fix: in
match_prefix, whento_finishis already set (pre-abort), detach the request from the session (abort_req()+session = None) before delegating to inner cache. The request is handled as a regular (non-streaming) request. The session slot is completely untouched.Test coverage
All tests run under
SGLANG_ENABLE_STRICT_MEM_CHECK_DURING_BUSY=2(strict memory accounting).Unit tests (
test_streaming_session_unit.py):test_preabort_detaches_session_and_preserves_slot— match_prefix detaches session, slot untouchedtest_first_mid_abort_nukes_ephemeral_slot— ephemeral slot created and nuked, no residualtest_nth_mid_abort_nukes_session_slot— existing slot nuked via release_sessionE2E tests (
test_streaming_session.py, inherited by all subclasses):test_preabort_recovery— unsupported offset triggers pre-abort; slot preserved;prompt_tokensverifiedtest_first_mid_abort_recovery— first request aborted mid-decode;prompt_tokens= fresh input onlytest_nth_mid_abort_recovery— nth request aborted mid-decode;prompt_tokens= rollback to last successful turn