Skip to content

fix: resume sessions on dirty shutdown instead of suspending#18922

Closed
NeloReis wants to merge 1 commit into
NousResearch:mainfrom
NeloReis:fix/session-resume-on-restart
Closed

fix: resume sessions on dirty shutdown instead of suspending#18922
NeloReis wants to merge 1 commit into
NousResearch:mainfrom
NeloReis:fix/session-resume-on-restart

Conversation

@NeloReis

@NeloReis NeloReis commented May 2, 2026

Copy link
Copy Markdown

Problem

When the gateway exits without a clean shutdown marker (drain timeout, crash, systemd kill), suspend_recently_active() hard-suspends all recently active sessions. This wipes conversation history, forcing users to restart work from scratch.

This is especially painful for Telegram/messaging users where session context is expensive to rebuild (token cost, unfinished tasks lost).

Root Cause

suspend_recently_active() sets entry.suspended = True on all sessions updated within the last 120 seconds when no .clean_shutdown marker exists. The suspended flag causes get_or_create_session() to hard-reset the session on next access.

The drain-timeout path already calls mark_resume_pending() for actively running agents, but sessions that were idle (just finished a turn, waiting for next message) are not in _running_agents and get missed — then hard-suspended by suspend_recently_active().

Fix

Change suspend_recently_active() to mark sessions as resume_pending=True with resume_reason="dirty_shutdown" instead of suspended=True. This preserves session history and lets the user seamlessly continue.

Safety preserved: The stuck-loop detector (_suspend_stuck_loop_sessions) still runs after this and hard-suspends any session that causes 3+ consecutive dirty restarts. That catches genuinely broken/stuck sessions.

Changes

  • gateway/session.py: suspend_recently_active() now sets resume_pending=True instead of suspended=True, and also skips already-suspended entries
  • gateway/run.py: Updated log messages to reflect new behavior

Testing

Verified on a live Telegram gateway: restart without clean shutdown → session marked as resume-pending → user message after restart resumes same session with full history intact.

When the gateway exits without a clean shutdown marker (drain timeout,
crash, systemd kill), sessions were hard-suspended (wiped). This forced
users to restart work from scratch, wasting tokens and losing context.

Change suspend_recently_active() to mark sessions as resume_pending
instead of suspended. The session history is preserved and the user
seamlessly continues where they left off.

Stuck-loop escalation (3+ consecutive dirty restarts) still
hard-suspends genuinely broken sessions as a safety net.
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround labels May 2, 2026
@teknium1

Copy link
Copy Markdown
Contributor

This appears to be implemented on current main. Automated hermes-sweeper review found the dirty-shutdown path now preserves session history by marking sessions resumable instead of hard-suspending them.

Evidence:

  • gateway/session.py:1111suspend_recently_active() is now documented as marking recent sessions resumable after an unexpected exit, not destroying history.
  • gateway/session.py:1136-1141 — the method skips already-resume-pending entries, skips explicitly suspended entries, and sets entry.resume_pending = True with resume_reason = "restart_interrupted".
  • gateway/session.py:904-915get_or_create_session() still lets suspended hard-reset first, but resume_pending returns the existing entry so the transcript/session_id are preserved.
  • gateway/run.py:4704-4709 — startup still invokes suspend_recently_active() when there is no clean shutdown marker, and the log now reports sessions as resumable.
  • gateway/run.py:4712-4719 — stuck-loop hard-suspend escalation still runs after this path.

Thanks for the report and patch; this was useful prior work, and current main now has the behavior the PR requested.

@teknium1 teknium1 closed this Jun 11, 2026
@teknium1 teknium1 added the sweeper:implemented-on-main Sweeper: behavior already present on current main label Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround sweeper:implemented-on-main Sweeper: behavior already present on current main type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants