spec: automatic session resume after gateway restart#11852
spec: automatic session resume after gateway restart#11852BrennerSpear wants to merge 4 commits into
Conversation
|
Green light — go ahead and implement this. Diagnosis matches current main exactly (forced-interrupt → One design correction before you start: the spec's Option B adds Concretely that means:
Also confirming two of your open-question recommendations are the right calls:
Ping me when you have a draft PR up and we'll get it reviewed. |
| session_key: str, | ||
| *, | ||
| reason: str = "restart_timeout", | ||
| increment_attempts: bool = True, |
There was a problem hiding this comment.
resume_attempts counts restart events, not failed recoveries — causing premature escalation
increment_attempts: bool = True is called inside mark_resume_pending(), which fires on every interrupted restart. If the gateway cycles 3 times in a deploy loop without the user ever sending a message, the session hits the escalation threshold and becomes suspended before a single recovery was ever attempted.
The counter should track failed recovery turns (a resumed turn that was interrupted again), not restart-mark calls. Increment it in clear_resume_pending(failed=True) or in the escalation path after a resumed turn crashes — not in mark_resume_pending().
Section 4 says "first/second/third interrupted restart" which implies counting restarts, but that creates the deploy-loop problem. Worth clarifying which semantic is intended before implementation.
| Extend `get_or_create_session()` logic: | ||
|
|
||
| - if `entry.suspended` → current reset behavior stays | ||
| - if `entry.resume_pending` → **return the existing entry** and clear or downgrade the resume marker once the resume turn begins successfully |
There was a problem hiding this comment.
Contradiction between section 2 and Open Question 1 on when to clear resume_pending
Section 2 (this line) says "clear or downgrade the resume marker once the resume turn begins successfully." Open Question 1 (line 538) recommends "clear after successful completion."
These conflict in a meaningful way: if cleared on start and the resumed turn crashes mid-run, the session exits resume_pending without recording a failure. On the next restart, mark_resume_pending() would set it again but resume_attempts has no record of the prior failed resume — breaking the escalation counter.
Pick one semantic and remove the ambiguity. "Clear on completion" is safer; just make sure the in-progress guard (mentioned in Open Question 1) prevents double-resume if the session is accessed concurrently during the running turn.
| -> next message in same thread/topic | ||
| -> ACTIVE (same session_id, transcript reloaded) | ||
|
|
||
| RESUME_PENDING |
There was a problem hiding this comment.
State machine is missing the RESUME_PENDING → gateway restarts again transition
The state machine shows RESUME_PENDING going to ACTIVE (user messages) or to SUSPENDED (threshold exceeded), but has no transition for "gateway restarts again while still in RESUME_PENDING." This is the exact scenario in a deploy loop — the user hasn't messaged yet between restarts.
If each restart calls mark_resume_pending() with increment_attempts=True, the session silently escalates to SUSPENDED before the user gets a chance to reply. The state machine should make this explicit — either as a valid self-loop (RESUME_PENDING → RESUME_PENDING + attempt++) or as a special transition with a clear policy on whether successive restarts-without-user-message count toward the threshold.
| #### If the system escalates to forced clean slate | ||
| Only after repeated failure should the user see something like: | ||
|
|
||
| - `This session was interrupted repeatedly during restart recovery, so Hermes started a fresh session to avoid getting stuck. Use /resume if you want the old transcript.` |
There was a problem hiding this comment.
/resume discoverability after escalation to a new session_id is unspecified
This copy tells the user to /resume after escalation, but when suspended=True causes get_or_create_session() to mint a new session_id, the new session entry has no reference back to the old one. The spec doesn't confirm that /resume can discover the previous session_id in this case.
If /resume today works by listing recent sessions for the user to pick from, it may already handle this. But if it relies on the current session knowing its predecessor, it won't. The spec should either:
- explicitly verify the current
/resumeflow handles cross-session-id lookup, or - add a task to store a
parent_session_idfield on the new session entry when escalation creates it.
Mentioning /resume in user-facing copy that doesn't actually work is worse than omitting it.
|
|
||
| Reserve it for narrower cases, such as: | ||
|
|
||
| - startup crash recovery when explicit `resume_pending` metadata is absent, |
There was a problem hiding this comment.
Hard kills (SIGKILL/OOM) still hit the original bad UX — worth calling out explicitly
Section 5 lists "startup crash recovery when explicit resume_pending metadata is absent" as a remaining use case for suspend_recently_active(). But this is the same broken path the spec is trying to fix — the user sent the optimistic banner, the process died without writing any markers, and on next startup the session gets reset.
This is a known limitation, but the spec should say so explicitly rather than burying it in the suspend_recently_active() residual use-cases list. One mitigation worth naming: write resume_pending markers at the start of drain (not only after drain timeout), so that even a SIGKILL during drain leaves markers on disk. Whether that's in scope for v1 or a follow-up should be a stated decision.
| timeout = self._restart_drain_timeout | ||
| active_agents, timed_out = await self._drain_active_agents(timeout) | ||
| if timed_out: | ||
| timed_out_session_keys = set(self._running_agents.keys()) |
There was a problem hiding this comment.
Use active_agents.keys() instead of self._running_agents.keys() here.
active_agents is the authoritative snapshot returned by _drain_active_agents() — it contains exactly the sessions that were still running at timeout. Re-reading self._running_agents is normally equivalent, but it introduces a subtle TOCTOU hazard: if any agent completes (and removes itself from _running_agents) in the window between _drain_active_agents() returning and this line executing, those sessions would be missed. The return value is the right source of truth.
| try: | ||
| self.session_store.mark_resume_pending( | ||
| session_key, | ||
| reason="restart_timeout" if self._restart_requested else "shutdown_timeout", |
There was a problem hiding this comment.
Semantic mismatch: when _restart_requested=False (clean shutdown with drain timeout), sessions are marked resume_pending=True with reason="shutdown_timeout" — but the shutdown banner on line ~1608 tells the user only "Your current task will be interrupted" with no recovery promise, and _prepend_restart_recovery_note() always says "interrupted by a gateway restart" regardless of the reason.
Result: a user who was told their task is simply interrupted will get a recovery note on next startup that wrongly blames a restart. Either skip mark_resume_pending for non-restart shutdowns, or thread resume_reason through to the system note copy.
| elif entry.resume_pending: | ||
| entry.updated_at = now | ||
| self._save() | ||
| return entry |
There was a problem hiding this comment.
This early-return bypasses _should_reset() entirely, including idle-timeout and daily-reset policies. last_resume_marked_at is stored but never consulted here.
If the user never returns to the interrupted thread, the session stays permanently stuck in resume_pending=True and will never idle-reset. Consider adding a recovery-window guard — e.g., if _now() - entry.last_resume_marked_at > recovery_window_seconds, fall through to the normal _should_reset() path instead of returning early.
| entry.resume_reason = reason | ||
| entry.last_resume_marked_at = _now() | ||
| if increment_attempts: | ||
| entry.resume_attempts += 1 |
There was a problem hiding this comment.
resume_attempts is incremented here but never checked against a threshold anywhere in the codebase. The spec promises "third interrupted restart → convert to suspended=True" as the escalation path for poisoned sessions, but without a threshold check this counter is purely decorative.
The only escalation that exists is the existing stuck-loop mechanism in _suspend_stuck_loop_sessions() (which clears resume_pending when it fires), but that fires on restart-count watermarks, not attempt-count. A session that hangs silently (never triggering stuck-loop detection) would accumulate resume_attempts forever without ever escalating.
The shutdown banner promised "send any message after restart to resume where you left off" but the code did the opposite: a drain-timeout restart skipped the .clean_shutdown marker, which made the next startup call suspend_recently_active(), which marked the session suspended, which made get_or_create_session() spawn a fresh session_id with a 'Session automatically reset. Use /resume...' notice — contradicting the banner. Introduce a resume_pending state on SessionEntry that is distinct from suspended. Drain-timeout shutdown flags active sessions resume_pending instead of letting startup-wide suspension destroy them. The next message on the same session_key preserves the session_id, reloads the transcript, and the agent receives a reason-aware restart-resume system note that subsumes the existing tool-tail auto-continue note (PR #9934). Terminal escalation still flows through the existing .restart_failure_counts stuck-loop counter (PR #7536, threshold 3) — no parallel counter on SessionEntry. suspended still wins over resume_pending in get_or_create_session() so genuinely stuck sessions converge to a clean slate. Spec: PR #11852 (BrennerSpear). Implementation follows the spec with the approved correction (reuse .restart_failure_counts rather than adding a resume_attempts field). Changes: - gateway/session.py: SessionEntry.resume_pending/resume_reason/ last_resume_marked_at + to_dict/from_dict; SessionStore .mark_resume_pending()/clear_resume_pending(); get_or_create_session() returns existing entry when resume_pending (suspended still wins); suspend_recently_active() skips resume_pending entries. - gateway/run.py: _stop_impl() drain-timeout branch marks active sessions resume_pending before _interrupt_running_agents(); _run_agent() injects reason-aware restart-resume system note that subsumes the tool-tail case; successful-turn cleanup also clears resume_pending next to _clear_restart_failure_count(); _notify_active_sessions_of_shutdown() softens the restart banner to 'I'll try to resume where you left off' (honest about stuck-loop escalation). - tests/gateway/test_restart_resume_pending.py: 29 new tests covering SessionEntry roundtrip, mark/clear helpers, get_or_create_session precedence (suspended > resume_pending), suspend_recently_active skip, drain-timeout mark reason (restart vs shutdown), system-note injection decision tree (including tool-tail subsumption), banner wording, and stuck-loop escalation override.
… (#12301) The shutdown banner promised "send any message after restart to resume where you left off" but the code did the opposite: a drain-timeout restart skipped the .clean_shutdown marker, which made the next startup call suspend_recently_active(), which marked the session suspended, which made get_or_create_session() spawn a fresh session_id with a 'Session automatically reset. Use /resume...' notice — contradicting the banner. Introduce a resume_pending state on SessionEntry that is distinct from suspended. Drain-timeout shutdown flags active sessions resume_pending instead of letting startup-wide suspension destroy them. The next message on the same session_key preserves the session_id, reloads the transcript, and the agent receives a reason-aware restart-resume system note that subsumes the existing tool-tail auto-continue note (PR #9934). Terminal escalation still flows through the existing .restart_failure_counts stuck-loop counter (PR #7536, threshold 3) — no parallel counter on SessionEntry. suspended still wins over resume_pending in get_or_create_session() so genuinely stuck sessions converge to a clean slate. Spec: PR #11852 (BrennerSpear). Implementation follows the spec with the approved correction (reuse .restart_failure_counts rather than adding a resume_attempts field). Changes: - gateway/session.py: SessionEntry.resume_pending/resume_reason/ last_resume_marked_at + to_dict/from_dict; SessionStore .mark_resume_pending()/clear_resume_pending(); get_or_create_session() returns existing entry when resume_pending (suspended still wins); suspend_recently_active() skips resume_pending entries. - gateway/run.py: _stop_impl() drain-timeout branch marks active sessions resume_pending before _interrupt_running_agents(); _run_agent() injects reason-aware restart-resume system note that subsumes the tool-tail case; successful-turn cleanup also clears resume_pending next to _clear_restart_failure_count(); _notify_active_sessions_of_shutdown() softens the restart banner to 'I'll try to resume where you left off' (honest about stuck-loop escalation). - tests/gateway/test_restart_resume_pending.py: 29 new tests covering SessionEntry roundtrip, mark/clear helpers, get_or_create_session precedence (suspended > resume_pending), suspend_recently_active skip, drain-timeout mark reason (restart vs shutdown), system-note injection decision tree (including tool-tail subsumption), banner wording, and stuck-loop escalation override.
…esearch#11852) (NousResearch#12301) The shutdown banner promised "send any message after restart to resume where you left off" but the code did the opposite: a drain-timeout restart skipped the .clean_shutdown marker, which made the next startup call suspend_recently_active(), which marked the session suspended, which made get_or_create_session() spawn a fresh session_id with a 'Session automatically reset. Use /resume...' notice — contradicting the banner. Introduce a resume_pending state on SessionEntry that is distinct from suspended. Drain-timeout shutdown flags active sessions resume_pending instead of letting startup-wide suspension destroy them. The next message on the same session_key preserves the session_id, reloads the transcript, and the agent receives a reason-aware restart-resume system note that subsumes the existing tool-tail auto-continue note (PR NousResearch#9934). Terminal escalation still flows through the existing .restart_failure_counts stuck-loop counter (PR NousResearch#7536, threshold 3) — no parallel counter on SessionEntry. suspended still wins over resume_pending in get_or_create_session() so genuinely stuck sessions converge to a clean slate. Spec: PR NousResearch#11852 (BrennerSpear). Implementation follows the spec with the approved correction (reuse .restart_failure_counts rather than adding a resume_attempts field). Changes: - gateway/session.py: SessionEntry.resume_pending/resume_reason/ last_resume_marked_at + to_dict/from_dict; SessionStore .mark_resume_pending()/clear_resume_pending(); get_or_create_session() returns existing entry when resume_pending (suspended still wins); suspend_recently_active() skips resume_pending entries. - gateway/run.py: _stop_impl() drain-timeout branch marks active sessions resume_pending before _interrupt_running_agents(); _run_agent() injects reason-aware restart-resume system note that subsumes the tool-tail case; successful-turn cleanup also clears resume_pending next to _clear_restart_failure_count(); _notify_active_sessions_of_shutdown() softens the restart banner to 'I'll try to resume where you left off' (honest about stuck-loop escalation). - tests/gateway/test_restart_resume_pending.py: 29 new tests covering SessionEntry roundtrip, mark/clear helpers, get_or_create_session precedence (suspended > resume_pending), suspend_recently_active skip, drain-timeout mark reason (restart vs shutdown), system-note injection decision tree (including tool-tail subsumption), banner wording, and stuck-loop escalation override.
…esearch#11852) (NousResearch#12301) The shutdown banner promised "send any message after restart to resume where you left off" but the code did the opposite: a drain-timeout restart skipped the .clean_shutdown marker, which made the next startup call suspend_recently_active(), which marked the session suspended, which made get_or_create_session() spawn a fresh session_id with a 'Session automatically reset. Use /resume...' notice — contradicting the banner. Introduce a resume_pending state on SessionEntry that is distinct from suspended. Drain-timeout shutdown flags active sessions resume_pending instead of letting startup-wide suspension destroy them. The next message on the same session_key preserves the session_id, reloads the transcript, and the agent receives a reason-aware restart-resume system note that subsumes the existing tool-tail auto-continue note (PR NousResearch#9934). Terminal escalation still flows through the existing .restart_failure_counts stuck-loop counter (PR NousResearch#7536, threshold 3) — no parallel counter on SessionEntry. suspended still wins over resume_pending in get_or_create_session() so genuinely stuck sessions converge to a clean slate. Spec: PR NousResearch#11852 (BrennerSpear). Implementation follows the spec with the approved correction (reuse .restart_failure_counts rather than adding a resume_attempts field). Changes: - gateway/session.py: SessionEntry.resume_pending/resume_reason/ last_resume_marked_at + to_dict/from_dict; SessionStore .mark_resume_pending()/clear_resume_pending(); get_or_create_session() returns existing entry when resume_pending (suspended still wins); suspend_recently_active() skips resume_pending entries. - gateway/run.py: _stop_impl() drain-timeout branch marks active sessions resume_pending before _interrupt_running_agents(); _run_agent() injects reason-aware restart-resume system note that subsumes the tool-tail case; successful-turn cleanup also clears resume_pending next to _clear_restart_failure_count(); _notify_active_sessions_of_shutdown() softens the restart banner to 'I'll try to resume where you left off' (honest about stuck-loop escalation). - tests/gateway/test_restart_resume_pending.py: 29 new tests covering SessionEntry roundtrip, mark/clear helpers, get_or_create_session precedence (suspended > resume_pending), suspend_recently_active skip, drain-timeout mark reason (restart vs shutdown), system-note injection decision tree (including tool-tail subsumption), banner wording, and stuck-loop escalation override.
…esearch#11852) (NousResearch#12301) The shutdown banner promised "send any message after restart to resume where you left off" but the code did the opposite: a drain-timeout restart skipped the .clean_shutdown marker, which made the next startup call suspend_recently_active(), which marked the session suspended, which made get_or_create_session() spawn a fresh session_id with a 'Session automatically reset. Use /resume...' notice — contradicting the banner. Introduce a resume_pending state on SessionEntry that is distinct from suspended. Drain-timeout shutdown flags active sessions resume_pending instead of letting startup-wide suspension destroy them. The next message on the same session_key preserves the session_id, reloads the transcript, and the agent receives a reason-aware restart-resume system note that subsumes the existing tool-tail auto-continue note (PR NousResearch#9934). Terminal escalation still flows through the existing .restart_failure_counts stuck-loop counter (PR NousResearch#7536, threshold 3) — no parallel counter on SessionEntry. suspended still wins over resume_pending in get_or_create_session() so genuinely stuck sessions converge to a clean slate. Spec: PR NousResearch#11852 (BrennerSpear). Implementation follows the spec with the approved correction (reuse .restart_failure_counts rather than adding a resume_attempts field). Changes: - gateway/session.py: SessionEntry.resume_pending/resume_reason/ last_resume_marked_at + to_dict/from_dict; SessionStore .mark_resume_pending()/clear_resume_pending(); get_or_create_session() returns existing entry when resume_pending (suspended still wins); suspend_recently_active() skips resume_pending entries. - gateway/run.py: _stop_impl() drain-timeout branch marks active sessions resume_pending before _interrupt_running_agents(); _run_agent() injects reason-aware restart-resume system note that subsumes the tool-tail case; successful-turn cleanup also clears resume_pending next to _clear_restart_failure_count(); _notify_active_sessions_of_shutdown() softens the restart banner to 'I'll try to resume where you left off' (honest about stuck-loop escalation). - tests/gateway/test_restart_resume_pending.py: 29 new tests covering SessionEntry roundtrip, mark/clear helpers, get_or_create_session precedence (suspended > resume_pending), suspend_recently_active skip, drain-timeout mark reason (restart vs shutdown), system-note injection decision tree (including tool-tail subsumption), banner wording, and stuck-loop escalation override.
…esearch#11852) (NousResearch#12301) The shutdown banner promised "send any message after restart to resume where you left off" but the code did the opposite: a drain-timeout restart skipped the .clean_shutdown marker, which made the next startup call suspend_recently_active(), which marked the session suspended, which made get_or_create_session() spawn a fresh session_id with a 'Session automatically reset. Use /resume...' notice — contradicting the banner. Introduce a resume_pending state on SessionEntry that is distinct from suspended. Drain-timeout shutdown flags active sessions resume_pending instead of letting startup-wide suspension destroy them. The next message on the same session_key preserves the session_id, reloads the transcript, and the agent receives a reason-aware restart-resume system note that subsumes the existing tool-tail auto-continue note (PR NousResearch#9934). Terminal escalation still flows through the existing .restart_failure_counts stuck-loop counter (PR NousResearch#7536, threshold 3) — no parallel counter on SessionEntry. suspended still wins over resume_pending in get_or_create_session() so genuinely stuck sessions converge to a clean slate. Spec: PR NousResearch#11852 (BrennerSpear). Implementation follows the spec with the approved correction (reuse .restart_failure_counts rather than adding a resume_attempts field). Changes: - gateway/session.py: SessionEntry.resume_pending/resume_reason/ last_resume_marked_at + to_dict/from_dict; SessionStore .mark_resume_pending()/clear_resume_pending(); get_or_create_session() returns existing entry when resume_pending (suspended still wins); suspend_recently_active() skips resume_pending entries. - gateway/run.py: _stop_impl() drain-timeout branch marks active sessions resume_pending before _interrupt_running_agents(); _run_agent() injects reason-aware restart-resume system note that subsumes the tool-tail case; successful-turn cleanup also clears resume_pending next to _clear_restart_failure_count(); _notify_active_sessions_of_shutdown() softens the restart banner to 'I'll try to resume where you left off' (honest about stuck-loop escalation). - tests/gateway/test_restart_resume_pending.py: 29 new tests covering SessionEntry roundtrip, mark/clear helpers, get_or_create_session precedence (suspended > resume_pending), suspend_recently_active skip, drain-timeout mark reason (restart vs shutdown), system-note injection decision tree (including tool-tail subsumption), banner wording, and stuck-loop escalation override.
…esearch#11852) (NousResearch#12301) The shutdown banner promised "send any message after restart to resume where you left off" but the code did the opposite: a drain-timeout restart skipped the .clean_shutdown marker, which made the next startup call suspend_recently_active(), which marked the session suspended, which made get_or_create_session() spawn a fresh session_id with a 'Session automatically reset. Use /resume...' notice — contradicting the banner. Introduce a resume_pending state on SessionEntry that is distinct from suspended. Drain-timeout shutdown flags active sessions resume_pending instead of letting startup-wide suspension destroy them. The next message on the same session_key preserves the session_id, reloads the transcript, and the agent receives a reason-aware restart-resume system note that subsumes the existing tool-tail auto-continue note (PR NousResearch#9934). Terminal escalation still flows through the existing .restart_failure_counts stuck-loop counter (PR NousResearch#7536, threshold 3) — no parallel counter on SessionEntry. suspended still wins over resume_pending in get_or_create_session() so genuinely stuck sessions converge to a clean slate. Spec: PR NousResearch#11852 (BrennerSpear). Implementation follows the spec with the approved correction (reuse .restart_failure_counts rather than adding a resume_attempts field). Changes: - gateway/session.py: SessionEntry.resume_pending/resume_reason/ last_resume_marked_at + to_dict/from_dict; SessionStore .mark_resume_pending()/clear_resume_pending(); get_or_create_session() returns existing entry when resume_pending (suspended still wins); suspend_recently_active() skips resume_pending entries. - gateway/run.py: _stop_impl() drain-timeout branch marks active sessions resume_pending before _interrupt_running_agents(); _run_agent() injects reason-aware restart-resume system note that subsumes the tool-tail case; successful-turn cleanup also clears resume_pending next to _clear_restart_failure_count(); _notify_active_sessions_of_shutdown() softens the restart banner to 'I'll try to resume where you left off' (honest about stuck-loop escalation). - tests/gateway/test_restart_resume_pending.py: 29 new tests covering SessionEntry roundtrip, mark/clear helpers, get_or_create_session precedence (suspended > resume_pending), suspend_recently_active skip, drain-timeout mark reason (restart vs shutdown), system-note injection decision tree (including tool-tail subsumption), banner wording, and stuck-loop escalation override.
PR Spec: Automatic Session Resume After Gateway Restart
Date: 2026-04-17
TL;DR
Hermes should treat restart interruption as a resumable session state, not as a session reset.
Today, when gateway shutdown cannot drain active work within
agent.restart_drain_timeout, startup falls back tosuspend_recently_active(). That marks recently-active sessions assuspended, and the next message in the same thread causesSessionStore.get_or_create_session()to create a new session ID withauto_reset_reason="suspended". The user sees:That is the exact wrong behavior for the common case of same thread, same user, same restart, still wants same task.
Recommendation: introduce a distinct persisted state like
resume_pending/interrupted_by_restartand keep the existingsession_idon the next message with the samesession_key. Reuse the existing transcript reload and auto-continue logic ingateway/run.pyinstead of creating a new session. Escalation should reuse the existing.restart_failure_counts/ stuck-loop detection path rather than adding a parallel counter onSessionEntry.Problem statement
Current user experience
Current behavior is incoherent:
session_keylands in a fresh session./resumemanually.This is a terrible experience because:
Root cause in current code
Relevant current behavior:
gateway/run.py_notify_active_sessions_of_shutdown()Send any message after restart to resume where it left off.gateway/run.py.clean_shutdownmarker so next startup treats the prior run as unsafegateway/run.pyself.session_store.suspend_recently_active()when.clean_shutdownis absentgateway/session.pyget_or_create_session()checksentry.suspendedauto_reset_reason="suspended"gateway/run.pySession automatically reset (previous session was stopped or interrupted)...gateway/run.pytoolmessage, Hermes prepends a system note telling the model to finish the interrupted workImportant observation: Hermes already has part of the resume mechanism. The main thing preventing automatic resume is that forced restart currently turns the session into a fresh session instead of preserving the old one.
Product goal
When a user restarts Hermes and then sends the next message that resolves to the same
session_key(same chat/thread/topic lane), Hermes should, by default:session_id,/resumeunless recovery has actually failed.Desired UX
For the normal case:
session_keyFor the pathological case:
Non-goals
This PR should not try to:
session_resetpolicy semantics.This is specifically about same-lane restart continuity after gateway interruption.
Design principles
Recommendation
Introduce a resumable restart-interruption state
Add a persisted session state that is distinct from
suspended.New state
Recommended
SessionEntryfields ingateway/session.py:Meaning of states
suspended=Trueresume_pending=TrueThis is the key architectural distinction missing today.
High-level behavior change
Current behavior
Proposed behavior
Escalation path
Detailed design
1) Persist
resume_pendingon interrupted restartWhere to mark it
During shutdown in
gateway/run.py, after drain timeout is detected and the gateway force-interrupts active agents, mark the active session keys asresume_pending=True.This should happen instead of relying on the startup-wide "recently active means suspend" fallback for these sessions.
Why here
At shutdown time, the gateway knows:
/new.That is the correct moment to record resumable interruption state.
Proposed helper
Add a method to
SessionStoreingateway/session.py:Responsibilities:
resume_pending=Trueresume_reasonlast_resume_marked_atsessions.json2) Do not auto-reset
resume_pendingsessions on next accessCurrent bad behavior
SessionStore.get_or_create_session()currently treatssuspendedas "auto-reset on next access".Proposed behavior
Extend
get_or_create_session()logic:entry.suspended→ current reset behavior staysentry.resume_pending→ return the existing entry while the recovery window is still fresh, and only clear the marker after a successful turn completesPseudo-shape:
This is the core functional fix.
3) Reuse existing transcript reload and auto-continue logic
This PR should explicitly lean on behavior that already exists.
Existing asset
gateway/run.pyalready prepends an interruption note when the loaded history ends in atoolresult:tool, Hermes tells the model to finish processing interrupted tool results before addressing the new user message.Extend the system note behavior
If
session_entry.resume_pendingis set, prepend a stronger note such as:This should work whether the transcript ended with:
toolmessage,Why this is enough for v1
We do not need a brand-new recovery engine for the first version.
Preserving the same session ID plus transcript reload plus better interruption note gets the common case back to a sane product experience.
4) Keep stuck-loop protection, but reuse the existing restart-failure mechanism
We should not regress the original safety intent behind the stuck-loop work.
Proposed rule
session_key→ still auto-resumesession_key→ let the existing stuck-loop path suspend itThis keeps safety without making the default path destructive.
Recommended implementation
Reuse the existing gateway-level
.restart_failure_countsfile and_suspend_stuck_loop_sessions()flow:mark_resume_pending(...)for the interruptedsession_keyssession_key.restart_failure_counts_suspend_stuck_loop_sessions()flips the session tosuspended=Trueonce the existing threshold is exceededDo not add or maintain a parallel
resume_attemptscounter onSessionEntry.5) Narrow the role of
suspend_recently_active()suspend_recently_active()is too blunt as the generic fallback for restart interruption.Current role
It treats "recently active at startup after unclean shutdown" as a reason to force clean-slate behavior.
Proposed role after this PR
Reserve it for narrower cases, such as:
resume_pendingmetadata is absent,Normal interrupted-restart recovery with explicit
resume_pendingmetadata should not be suspended by this helper.Important outcome
This means restart interruption should no longer immediately flow through the same code path as 'known stuck session'.
That separation is the real product fix.
6) Fix user-facing messaging
Current messaging is misleading
Shutdown banner
Current wording:
This is too absolute.
Reset notice
Current wording points users to
session_resetconfig even when the real cause is startup suspension after interrupted restart.Proposed messaging
Shutdown banner
For resumable restart:
Gateway restarting — I'll try to resume this session after restart. Send a message in this chat to continue.Gateway restarting — I'll try to resume this session after restart. Send a message in this thread/topic to continue.If the system escalates to forced clean slate
Only after repeated failure should the user see something like:
This session was interrupted repeatedly during restart recovery, so Hermes started a fresh session to avoid getting stuck. Use /resume if you want the old transcript.Message principle
Only mention
/resumewhen Hermes has actually decided not to auto-resume.State machine
File-level implementation plan
Primary files
gateway/session.pyAdd persisted session fields and APIs:
SessionEntryfields:resume_pendingresume_reasonlast_resume_marked_atto_dict()/from_dict()mark_resume_pending(...)clear_resume_pending(...)get_or_create_session()soresume_pendingreturns existing session instead of resettinggateway/run.pyUpdate gateway behavior:
resume_pending=Truesession_entry.resume_pendingresume_pendingonly after a successful resumed turn completestests/gateway/Add or update tests for:
.restart_failure_counts/_suspend_stuck_loop_sessions()/resumeguidance only appears when clean-slate fallback actually occursTest plan
Unit tests
tests/gateway/test_restart_resume_pending.py(new)Suggested cases:
mark resume pending persists state
mark_resume_pending()resume_pending does not create new session id
resume_pending=Trueget_or_create_session()session_idis unchangedsuspended still creates new session id
clear resume pending after success
resume_pending=Falsetests/gateway/test_restart_recovery_flow.py(new)Suggested cases:
interrupted restart on same session key resumes existing session
session_keyloads the same transcripttool-tail transcript triggers auto-continue note on resumed session
toolrepeated restart failures escalate to suspended
clean restart remains unchanged
.clean_shutdownpath still preserves session as beforeMessage-copy tests
Add assertions for the updated shutdown banner and fallback text.
Backward compatibility
This change should be backward-compatible with existing stored sessions.
Migration behavior
sessions.jsonentries simply deserialize with default values for new fieldssuspended=Trueentries should keep current semanticsImportant safety note
Do not silently reinterpret existing
suspended=Trueas resumable. That would change meaning for users who explicitly relied on the current clean-slate escape hatch.Risks
1. Resume loop risk
If resume is attempted too aggressively, a truly poisoned session could keep re-entering the same bad state.
Mitigation: keep thresholded escalation in the existing
.restart_failure_counts/_suspend_stuck_loop_sessions()flow.2. Partial transcript ambiguity
If a turn was interrupted mid-assistant generation, the last messages may not be perfectly shaped.
Mitigation: keep the recovery note explicit and rely on existing transcript loading behavior. Tests should cover common partial-tail shapes.
3. Messaging confusion during rollout
If message copy changes before semantics change, UX could still be misleading.
Mitigation: land copy changes in the same PR as behavior changes.
Open questions
resume_pendingshould clear only after a successful completed turn, not at turn start.session_key. Do not cross lanes.suspend_recently_active()should remain only as a narrower crash-recovery fallback and should not suspend explicitresume_pendingsessions.Recommended implementation order
SessionEntryresume-pending fields + serializationSessionStore.mark_resume_pending()/clear_resume_pending()get_or_create_session()to preserve same session forresume_pendinggateway/run.py.restart_failure_counts/_suspend_stuck_loop_sessions()escalationSuccess criteria
This PR is successful if all of the following are true:
session_key.session_idand transcript history./resumefor the normal restart-recovery case.Strong opinion
Hermes should default to continuity after restart and only fall back to clean-slate reset when recovery is actually failing.
Right now the system is optimized for protecting itself from stuck loops at the cost of making ordinary restart recovery feel broken. That tradeoff is backwards for user experience.
The correct product stance is:
That gets the common case right without giving up the escape hatch.