fix(gateway): auto-resume sessions after drain-timeout restart (#11852) by teknium1 · Pull Request #12301 · NousResearch/hermes-agent

teknium1 · 2026-04-18T21:54:57Z

Summary

Sessions interrupted by a drain-timeout gateway restart now auto-resume on the same session_key instead of getting silently converted into a fresh session with a contradictory reset notice.

Implements the spec in #11852 (BrennerSpear) with the approved correction (reuse existing .restart_failure_counts stuck-loop counter from #7536 rather than adding a parallel counter).

Root cause: drain-timeout restart skipped .clean_shutdown → startup called suspend_recently_active() → get_or_create_session() saw suspended=True → spawned a new session_id with auto_reset_reason="suspended" — contradicting the banner's "send any message after restart to resume" promise.

Changes

gateway/session.py: SessionEntry gains resume_pending / resume_reason / last_resume_marked_at fields (with to_dict/from_dict). New SessionStore.mark_resume_pending() / clear_resume_pending(). get_or_create_session() returns the existing entry when resume_pending=True (suspended still wins). suspend_recently_active() skips resume_pending entries.
gateway/run.py: drain-timeout branch in _stop_impl() marks active sessions resume_pending (reason restart_timeout vs shutdown_timeout) before _interrupt_running_agents(). _run_agent() injects a reason-aware restart-resume system note that subsumes the tool-tail auto-continue note (feat: auto-continue interrupted agent work after gateway restart (#4493) #9934). Successful-turn cleanup clears resume_pending alongside _clear_restart_failure_count(). Shutdown banner softened to "I'll try to resume where you left off" — honest about stuck-loop escalation.
tests/gateway/test_restart_resume_pending.py: 29 new tests.

Invariants preserved

Repeated interrupted restarts still escalate to suspended=True via the existing .restart_failure_counts counter (threshold 3) — no parallel counter added.
/stop still hard-suspends.
Clean-drain shutdowns still write .clean_shutdown and run no suspension on next start.
Idle/daily session_reset policy unchanged.
The PR feat: auto-continue interrupted agent work after gateway restart (#4493) #9934 tool-tail auto-continue note still fires for non-resume-pending interrupted sessions (crashes, SIGTERM without drain, etc.).

Validation

Scenario	Before	After
Drain-timeout restart, same `session_key` next message	Fresh `session_id` + "Session automatically reset. Use /resume..."	Same `session_id`, transcript reloaded, reason-aware restart-resume system note
Interrupted transcript NOT ending on `tool` role	No resume hint to the model	Reason-aware system note still fires (resume_pending metadata-driven)
`/stop` → suspend	New `session_id` + suspended notice	Unchanged
3× consecutive restart-interrupt on same session	Stuck-loop counter flips suspended=True, fresh session	Unchanged (suspended overrides resume_pending)
Clean drain completes in time	No marking, `.clean_shutdown` written	Unchanged
Successful resumed turn	—	Clears `resume_pending` + stuck-loop counter

Test runs (targeted):

tests/gateway/test_restart_resume_pending.py — 29 passed
tests/gateway/test_restart_drain.py test_gateway_shutdown.py test_clean_shutdown_marker.py test_auto_continue.py test_stuck_loop.py test_restart_notification.py test_session.py — 141 passed
All 8 session-related test files — 139 passed
Full tests/gateway/ — 3286 passed, 7 pre-existing unrelated failures (signal phone redaction, matrix E2EE olm module, telegram approval buttons — all exist on origin/main without these changes)

Credit

Spec authored by @BrennerSpear in #11852. This PR implements that spec.

Closes #11852 (spec → implementation).

The shutdown banner promised "send any message after restart to resume where you left off" but the code did the opposite: a drain-timeout restart skipped the .clean_shutdown marker, which made the next startup call suspend_recently_active(), which marked the session suspended, which made get_or_create_session() spawn a fresh session_id with a 'Session automatically reset. Use /resume...' notice — contradicting the banner. Introduce a resume_pending state on SessionEntry that is distinct from suspended. Drain-timeout shutdown flags active sessions resume_pending instead of letting startup-wide suspension destroy them. The next message on the same session_key preserves the session_id, reloads the transcript, and the agent receives a reason-aware restart-resume system note that subsumes the existing tool-tail auto-continue note (PR #9934). Terminal escalation still flows through the existing .restart_failure_counts stuck-loop counter (PR #7536, threshold 3) — no parallel counter on SessionEntry. suspended still wins over resume_pending in get_or_create_session() so genuinely stuck sessions converge to a clean slate. Spec: PR #11852 (BrennerSpear). Implementation follows the spec with the approved correction (reuse .restart_failure_counts rather than adding a resume_attempts field). Changes: - gateway/session.py: SessionEntry.resume_pending/resume_reason/ last_resume_marked_at + to_dict/from_dict; SessionStore .mark_resume_pending()/clear_resume_pending(); get_or_create_session() returns existing entry when resume_pending (suspended still wins); suspend_recently_active() skips resume_pending entries. - gateway/run.py: _stop_impl() drain-timeout branch marks active sessions resume_pending before _interrupt_running_agents(); _run_agent() injects reason-aware restart-resume system note that subsumes the tool-tail case; successful-turn cleanup also clears resume_pending next to _clear_restart_failure_count(); _notify_active_sessions_of_shutdown() softens the restart banner to 'I'll try to resume where you left off' (honest about stuck-loop escalation). - tests/gateway/test_restart_resume_pending.py: 29 new tests covering SessionEntry roundtrip, mark/clear helpers, get_or_create_session precedence (suspended > resume_pending), suspend_recently_active skip, drain-timeout mark reason (restart vs shutdown), system-note injection decision tree (including tool-tail subsumption), banner wording, and stuck-loop escalation override.

…n timeout Follow-up to #12301. The drain-timeout branch of _stop_impl() was iterating the drain-start snapshot (active_agents) when marking sessions resume_pending. That snapshot can include sessions that finished gracefully during the drain window — marking them would give their next turn a stray 'your previous turn was interrupted by a gateway restart' system note even though the prior turn actually completed cleanly. Iterate self._running_agents at timeout time instead, mirroring _interrupt_running_agents() exactly: - only sessions still blocking the shutdown get marked - pending sentinels (AIAgent construction not yet complete) are skipped Changes: - gateway/run.py: swap active_agents.keys() for filtered self._running_agents.items() iteration in the drain-timeout mark loop. - tests/gateway/test_restart_resume_pending.py: two regression tests — finisher-during-drain not marked, pending sentinel not marked.

…n timeout (#12332) Follow-up to #12301. The drain-timeout branch of _stop_impl() was iterating the drain-start snapshot (active_agents) when marking sessions resume_pending. That snapshot can include sessions that finished gracefully during the drain window — marking them would give their next turn a stray 'your previous turn was interrupted by a gateway restart' system note even though the prior turn actually completed cleanly. Iterate self._running_agents at timeout time instead, mirroring _interrupt_running_agents() exactly: - only sessions still blocking the shutdown get marked - pending sentinels (AIAgent construction not yet complete) are skipped Changes: - gateway/run.py: swap active_agents.keys() for filtered self._running_agents.items() iteration in the drain-timeout mark loop. - tests/gateway/test_restart_resume_pending.py: two regression tests — finisher-during-drain not marked, pending sentinel not marked.

…esearch#11852) (NousResearch#12301) The shutdown banner promised "send any message after restart to resume where you left off" but the code did the opposite: a drain-timeout restart skipped the .clean_shutdown marker, which made the next startup call suspend_recently_active(), which marked the session suspended, which made get_or_create_session() spawn a fresh session_id with a 'Session automatically reset. Use /resume...' notice — contradicting the banner. Introduce a resume_pending state on SessionEntry that is distinct from suspended. Drain-timeout shutdown flags active sessions resume_pending instead of letting startup-wide suspension destroy them. The next message on the same session_key preserves the session_id, reloads the transcript, and the agent receives a reason-aware restart-resume system note that subsumes the existing tool-tail auto-continue note (PR NousResearch#9934). Terminal escalation still flows through the existing .restart_failure_counts stuck-loop counter (PR NousResearch#7536, threshold 3) — no parallel counter on SessionEntry. suspended still wins over resume_pending in get_or_create_session() so genuinely stuck sessions converge to a clean slate. Spec: PR NousResearch#11852 (BrennerSpear). Implementation follows the spec with the approved correction (reuse .restart_failure_counts rather than adding a resume_attempts field). Changes: - gateway/session.py: SessionEntry.resume_pending/resume_reason/ last_resume_marked_at + to_dict/from_dict; SessionStore .mark_resume_pending()/clear_resume_pending(); get_or_create_session() returns existing entry when resume_pending (suspended still wins); suspend_recently_active() skips resume_pending entries. - gateway/run.py: _stop_impl() drain-timeout branch marks active sessions resume_pending before _interrupt_running_agents(); _run_agent() injects reason-aware restart-resume system note that subsumes the tool-tail case; successful-turn cleanup also clears resume_pending next to _clear_restart_failure_count(); _notify_active_sessions_of_shutdown() softens the restart banner to 'I'll try to resume where you left off' (honest about stuck-loop escalation). - tests/gateway/test_restart_resume_pending.py: 29 new tests covering SessionEntry roundtrip, mark/clear helpers, get_or_create_session precedence (suspended > resume_pending), suspend_recently_active skip, drain-timeout mark reason (restart vs shutdown), system-note injection decision tree (including tool-tail subsumption), banner wording, and stuck-loop escalation override.

…n timeout (NousResearch#12332) Follow-up to NousResearch#12301. The drain-timeout branch of _stop_impl() was iterating the drain-start snapshot (active_agents) when marking sessions resume_pending. That snapshot can include sessions that finished gracefully during the drain window — marking them would give their next turn a stray 'your previous turn was interrupted by a gateway restart' system note even though the prior turn actually completed cleanly. Iterate self._running_agents at timeout time instead, mirroring _interrupt_running_agents() exactly: - only sessions still blocking the shutdown get marked - pending sentinels (AIAgent construction not yet complete) are skipped Changes: - gateway/run.py: swap active_agents.keys() for filtered self._running_agents.items() iteration in the drain-timeout mark loop. - tests/gateway/test_restart_resume_pending.py: two regression tests — finisher-during-drain not marked, pending sentinel not marked.