fix: resume sessions on dirty shutdown instead of suspending#18922
Closed
NeloReis wants to merge 1 commit into
Closed
fix: resume sessions on dirty shutdown instead of suspending#18922NeloReis wants to merge 1 commit into
NeloReis wants to merge 1 commit into
Conversation
When the gateway exits without a clean shutdown marker (drain timeout, crash, systemd kill), sessions were hard-suspended (wiped). This forced users to restart work from scratch, wasting tokens and losing context. Change suspend_recently_active() to mark sessions as resume_pending instead of suspended. The session history is preserved and the user seamlessly continues where they left off. Stuck-loop escalation (3+ consecutive dirty restarts) still hard-suspends genuinely broken sessions as a safety net.
Contributor
|
This appears to be implemented on current Evidence:
Thanks for the report and patch; this was useful prior work, and current |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When the gateway exits without a clean shutdown marker (drain timeout, crash, systemd kill),
suspend_recently_active()hard-suspends all recently active sessions. This wipes conversation history, forcing users to restart work from scratch.This is especially painful for Telegram/messaging users where session context is expensive to rebuild (token cost, unfinished tasks lost).
Root Cause
suspend_recently_active()setsentry.suspended = Trueon all sessions updated within the last 120 seconds when no.clean_shutdownmarker exists. Thesuspendedflag causesget_or_create_session()to hard-reset the session on next access.The drain-timeout path already calls
mark_resume_pending()for actively running agents, but sessions that were idle (just finished a turn, waiting for next message) are not in_running_agentsand get missed — then hard-suspended bysuspend_recently_active().Fix
Change
suspend_recently_active()to mark sessions asresume_pending=Truewithresume_reason="dirty_shutdown"instead ofsuspended=True. This preserves session history and lets the user seamlessly continue.Safety preserved: The stuck-loop detector (
_suspend_stuck_loop_sessions) still runs after this and hard-suspends any session that causes 3+ consecutive dirty restarts. That catches genuinely broken/stuck sessions.Changes
gateway/session.py:suspend_recently_active()now setsresume_pending=Trueinstead ofsuspended=True, and also skips already-suspended entriesgateway/run.py: Updated log messages to reflect new behaviorTesting
Verified on a live Telegram gateway: restart without clean shutdown → session marked as resume-pending → user message after restart resumes same session with full history intact.