spec: automatic session resume after gateway restart by BrennerSpear · Pull Request #11852 · NousResearch/hermes-agent

BrennerSpear · 2026-04-17T23:19:32Z

PR Spec: Automatic Session Resume After Gateway Restart

Date: 2026-04-17

Intent: Fix the terrible current UX where Hermes says an interrupted task will resume after restart, but a forced/interrupted restart often converts that thread into a fresh session and tells the user to /resume manually.

TL;DR

Hermes should treat restart interruption as a resumable session state, not as a session reset.

Today, when gateway shutdown cannot drain active work within agent.restart_drain_timeout, startup falls back to suspend_recently_active(). That marks recently-active sessions as suspended, and the next message in the same thread causes SessionStore.get_or_create_session() to create a new session ID with auto_reset_reason="suspended". The user sees:

shutdown banner: "Send any message after restart to resume where it left off."
then next-turn banner: "Session automatically reset (previous session was stopped or interrupted). Use /resume..."

That is the exact wrong behavior for the common case of same thread, same user, same restart, still wants same task.

Recommendation: introduce a distinct persisted state like resume_pending / interrupted_by_restart and keep the existing session_id on the next message with the same session_key. Reuse the existing transcript reload and auto-continue logic in gateway/run.py instead of creating a new session. Escalation should reuse the existing .restart_failure_counts / stuck-loop detection path rather than adding a parallel counter on SessionEntry.

Problem statement

Current user experience

Current behavior is incoherent:

Hermes sends an optimistic restart notice.
Gateway restart times out draining active agents.
Startup suspends recently-active sessions.
The user's next message on the same session_key lands in a fresh session.
Hermes tells the user to browse /resume manually.

This is a terrible experience because:

the product promises one behavior and delivers the opposite,
the user is forced to understand internal session mechanics,
the resume path is thread-local and obvious to the system but not automatic,
the current fallback destroys continuity even though transcript history still exists.

Root cause in current code

Relevant current behavior:

shutdown banner text in gateway/run.py
- _notify_active_sessions_of_shutdown()
- says: Send any message after restart to resume where it left off.
forced-interrupt path in gateway/run.py
- if drain times out, gateway interrupts active agents
- skips .clean_shutdown marker so next startup treats the prior run as unsafe
startup recovery in gateway/run.py
- calls self.session_store.suspend_recently_active() when .clean_shutdown is absent
session reset behavior in gateway/session.py
- get_or_create_session() checks entry.suspended
- suspended sessions are turned into a new session ID with auto_reset_reason="suspended"
reset notice in gateway/run.py
- emits: Session automatically reset (previous session was stopped or interrupted)...
transcript continuation logic already exists in gateway/run.py
- if loaded history ends with a tool message, Hermes prepends a system note telling the model to finish the interrupted work

Important observation: Hermes already has part of the resume mechanism. The main thing preventing automatic resume is that forced restart currently turns the session into a fresh session instead of preserving the old one.

Product goal

When a user restarts Hermes and then sends the next message that resolves to the same session_key (same chat/thread/topic lane), Hermes should, by default:

preserve the same conversation lane,
preserve the same session_id,
reload the same transcript,
inform the model that the previous turn was interrupted by restart,
continue/resume automatically,
avoid making the user manually browse /resume unless recovery has actually failed.

Desired UX

For the normal case:

user starts a long task in thread X
gateway restarts
user returns to the same lane and says anything
Hermes continues from the interrupted session on that same session_key

For the pathological case:

same session repeatedly hangs across multiple restarts
Hermes eventually abandons auto-resume for that session and gives the user a clean slate

Non-goals

This PR should not try to:

implement fully autonomous resume with no user follow-up message,
merge different threads/chats/topics into one session,
auto-resume across a different thread than the one that was interrupted,
invent a generic distributed job recovery layer,
remove existing stuck-loop safety mechanisms entirely,
change normal idle/daily session_reset policy semantics.

This is specifically about same-lane restart continuity after gateway interruption.

Design principles

Restart interruption is not the same as intentional reset.
Same thread should keep same session unless proven unsafe.
Safety escalation should be progressive, not immediate.
User-visible messages must describe the actual recovery semantics.
Reuse existing transcript + auto-continue machinery instead of inventing new prompt plumbing.

Recommendation

Introduce a resumable restart-interruption state

Add a persisted session state that is distinct from suspended.

New state

Recommended SessionEntry fields in gateway/session.py:

resume_pending: bool = False
resume_reason: Optional[str] = None  # e.g. "restart_timeout", "crash_recovery"
last_resume_marked_at: Optional[datetime] = None

Meaning of states

suspended=True
- do not resume automatically
- next access should create a fresh session
- used for known-poisoned sessions / explicit stuck-loop protection
resume_pending=True
- user should stay on the same session
- next access should preserve the existing session ID
- used when a restart interrupted in-flight work but we still expect same-thread continuation to succeed

This is the key architectural distinction missing today.

High-level behavior change

Current behavior

restart timed out
  -> skip .clean_shutdown
  -> startup: suspend_recently_active()
  -> session.suspended = True
  -> next message => new session_id
  -> user gets reset notice + /resume guidance

Proposed behavior

restart timed out
  -> mark active session(s) as resume_pending=True
  -> next startup preserves mapping
  -> next message on the same `session_key` returns existing session entry
  -> transcript reloads from same session_id
  -> model gets interruption note
  -> Hermes continues automatically

Escalation path

restart timed out repeatedly for same session
  -> increment existing .restart_failure_counts counter
  -> _suspend_stuck_loop_sessions() suspends once threshold is exceeded
  -> only then force fresh-session fallback

Detailed design

1) Persist `resume_pending` on interrupted restart

Where to mark it

During shutdown in gateway/run.py, after drain timeout is detected and the gateway force-interrupts active agents, mark the active session keys as resume_pending=True.

This should happen instead of relying on the startup-wide "recently active means suspend" fallback for these sessions.

Why here

At shutdown time, the gateway knows:

which sessions were actually running,
that the interruption came from restart/shutdown,
that this is not an idle/daily reset,
that the user did not ask for /new.

That is the correct moment to record resumable interruption state.

Proposed helper

Add a method to SessionStore in gateway/session.py:

def mark_resume_pending(
    self,
    session_key: str,
    *,
    reason: str = "restart_timeout",
) -> bool:
    ...

Responsibilities:

set resume_pending=True
set resume_reason
set last_resume_marked_at
persist metadata in sessions.json

2) Do not auto-reset `resume_pending` sessions on next access

Current bad behavior

SessionStore.get_or_create_session() currently treats suspended as "auto-reset on next access".

Proposed behavior

Extend get_or_create_session() logic:

if entry.suspended → current reset behavior stays
if entry.resume_pending → return the existing entry while the recovery window is still fresh, and only clear the marker after a successful turn completes

Pseudo-shape:

if entry.suspended:
    reset_reason = "suspended"
elif entry.resume_pending:
    entry.updated_at = now
    self._save()
    return entry
else:
    reset_reason = self._should_reset(entry, source)

This is the core functional fix.

3) Reuse existing transcript reload and auto-continue logic

This PR should explicitly lean on behavior that already exists.

Existing asset

gateway/run.py already prepends an interruption note when the loaded history ends in a tool result:

if transcript ends with role tool, Hermes tells the model to finish processing interrupted tool results before addressing the new user message.

Extend the system note behavior

If session_entry.resume_pending is set, prepend a stronger note such as:

[System note: Your previous turn in this same session was interrupted by a gateway restart. Continue from the existing transcript. If there are unfinished tool results, process them first, summarize what was accomplished, then answer the user's new message.]

This should work whether the transcript ended with:

a tool message,
an interrupted assistant turn,
or a partially completed tool-heavy exchange.

Why this is enough for v1

We do not need a brand-new recovery engine for the first version.

Preserving the same session ID plus transcript reload plus better interruption note gets the common case back to a sane product experience.

4) Keep stuck-loop protection, but reuse the existing restart-failure mechanism

We should not regress the original safety intent behind the stuck-loop work.

Proposed rule

first interrupted restart for a session → auto-resume
second interrupted restart for the same session_key → still auto-resume
third interrupted restart for the same session_key → let the existing stuck-loop path suspend it

This keeps safety without making the default path destructive.

Recommended implementation

Reuse the existing gateway-level .restart_failure_counts file and _suspend_stuck_loop_sessions() flow:

shutdown-time drain timeout still calls mark_resume_pending(...) for the interrupted session_keys
successful turn completion clears the restart-failure count for that session_key
repeated interrupted restarts are counted in .restart_failure_counts
_suspend_stuck_loop_sessions() flips the session to suspended=True once the existing threshold is exceeded

Do not add or maintain a parallel resume_attempts counter on SessionEntry.

5) Narrow the role of `suspend_recently_active()`

suspend_recently_active() is too blunt as the generic fallback for restart interruption.

Current role

It treats "recently active at startup after unclean shutdown" as a reason to force clean-slate behavior.

Proposed role after this PR

Reserve it for narrower cases, such as:

startup crash recovery when explicit resume_pending metadata is absent,
legacy upgrade path / backward compatibility,
emergency fallback for clearly unsafe sessions.

Normal interrupted-restart recovery with explicit resume_pending metadata should not be suspended by this helper.

Important outcome

This means restart interruption should no longer immediately flow through the same code path as 'known stuck session'.

That separation is the real product fix.

6) Fix user-facing messaging

Current messaging is misleading

Shutdown banner

Current wording:

Send any message after restart to resume where it left off.

This is too absolute.

Reset notice

Current wording points users to session_reset config even when the real cause is startup suspension after interrupted restart.

Proposed messaging

Shutdown banner

For resumable restart:

non-threaded chat:
- Gateway restarting — I'll try to resume this session after restart. Send a message in this chat to continue.
threaded/topic chat:
- Gateway restarting — I'll try to resume this session after restart. Send a message in this thread/topic to continue.

If the system escalates to forced clean slate

Only after repeated failure should the user see something like:

This session was interrupted repeatedly during restart recovery, so Hermes started a fresh session to avoid getting stuck. Use /resume if you want the old transcript.

Message principle

Only mention /resume when Hermes has actually decided not to auto-resume.

State machine

ACTIVE
  -> clean restart/shutdown drains successfully
     -> ACTIVE (same session preserved)

ACTIVE
  -> restart/crash interrupts in-flight work
     -> RESUME_PENDING

RESUME_PENDING
  -> next message in same thread/topic
     -> ACTIVE (same session_id, transcript reloaded)

RESUME_PENDING
  -> repeated interrupted restarts exceed threshold
     -> SUSPENDED

SUSPENDED
  -> next message
     -> NEW_SESSION (fresh session_id, old transcript still available via /resume)

File-level implementation plan

Primary files

`gateway/session.py`

Add persisted session fields and APIs:

new SessionEntry fields:
- resume_pending
- resume_reason
- last_resume_marked_at
serialization/deserialization support in to_dict() / from_dict()
helper methods:
- mark_resume_pending(...)
- clear_resume_pending(...)
update get_or_create_session() so resume_pending returns existing session instead of resetting

`gateway/run.py`

Update gateway behavior:

after drain timeout, mark active sessions resume_pending=True
on resumed turn, inject restart-interruption system note when session_entry.resume_pending
clear resume_pending only after a successful resumed turn completes
update shutdown banner wording to promise attempted recovery, not guaranteed recovery
stop routing normal interrupted restart recovery through immediate clean-slate semantics

`tests/gateway/`

Add or update tests for:

same-session resume after interrupted restart
transcript preserved across restart timeout
tool-result auto-continue still works on resumed session
repeated recovery failure escalates through .restart_failure_counts / _suspend_stuck_loop_sessions()
shutdown banner wording is no longer misleading
/resume guidance only appears when clean-slate fallback actually occurs

Test plan

Unit tests

`tests/gateway/test_restart_resume_pending.py` (new)

Suggested cases:

mark resume pending persists state
- create session
- call mark_resume_pending()
- reload store
- assert flags persisted
resume_pending does not create new session id
- create session
- mark resume_pending=True
- call get_or_create_session()
- assert returned session_id is unchanged
suspended still creates new session id
- existing current behavior regression guard
clear resume pending after success
- mark resumable
- simulate successful turn start or completion
- assert resume_pending=False

`tests/gateway/test_restart_recovery_flow.py` (new)

Suggested cases:

interrupted restart on same session key resumes existing session
- transcript exists under original session id
- next message on the same session_key loads the same transcript
- no auto-reset notice
tool-tail transcript triggers auto-continue note on resumed session
- transcript ends with tool
- resumed run prepends correct system note
repeated restart failures escalate to suspended
- simulate threshold crossings
- assert fresh-session fallback only after threshold
clean restart remains unchanged
- .clean_shutdown path still preserves session as before

Message-copy tests

Add assertions for the updated shutdown banner and fallback text.

Backward compatibility

This change should be backward-compatible with existing stored sessions.

Migration behavior

old sessions.json entries simply deserialize with default values for new fields
no database migration should be required if session metadata remains file-backed
existing suspended=True entries should keep current semantics

Important safety note

Do not silently reinterpret existing suspended=True as resumable. That would change meaning for users who explicitly relied on the current clean-slate escape hatch.

Risks

1. Resume loop risk

If resume is attempted too aggressively, a truly poisoned session could keep re-entering the same bad state.

Mitigation: keep thresholded escalation in the existing .restart_failure_counts / _suspend_stuck_loop_sessions() flow.

2. Partial transcript ambiguity

If a turn was interrupted mid-assistant generation, the last messages may not be perfectly shaped.

Mitigation: keep the recovery note explicit and rely on existing transcript loading behavior. Tests should cover common partial-tail shapes.

3. Messaging confusion during rollout

If message copy changes before semantics change, UX could still be misleading.

Mitigation: land copy changes in the same PR as behavior changes.

Open questions

resume_pending should clear only after a successful completed turn, not at turn start.
Recovery should only be attempted on the same session_key. Do not cross lanes.
suspend_recently_active() should remain only as a narrower crash-recovery fallback and should not suspend explicit resume_pending sessions.
User-facing recovery should stay as an internal system note in v1 unless debugging is enabled.

Recommended implementation order

Add SessionEntry resume-pending fields + serialization
Add SessionStore.mark_resume_pending() / clear_resume_pending()
Change get_or_create_session() to preserve same session for resume_pending
Mark active sessions resume-pending on drain-timeout shutdown
Inject restart-resume system note in gateway/run.py
Clear pending state after successful completion
Reuse existing .restart_failure_counts / _suspend_stuck_loop_sessions() escalation
Update shutdown/fallback copy
Add regression tests

Success criteria

This PR is successful if all of the following are true:

A long-running task interrupted by gateway restart in a Discord/Telegram lane resumes automatically on the next message with the same session_key.
The resumed thread keeps the same session_id and transcript history.
Hermes no longer tells the user to /resume for the normal restart-recovery case.
Existing clean restart behavior is preserved.
Truly stuck sessions still have a safety escape hatch after repeated failures.
User-facing restart copy accurately describes attempted recovery rather than promising impossible behavior.

Strong opinion

Hermes should default to continuity after restart and only fall back to clean-slate reset when recovery is actually failing.

Right now the system is optimized for protecting itself from stuck loops at the cost of making ordinary restart recovery feel broken. That tradeoff is backwards for user experience.

The correct product stance is:

same session_key + immediate post-restart message = continue automatically
repeated failed recovery = fresh session as safety fallback

That gets the common case right without giving up the escape hatch.

teknium1 · 2026-04-18T02:01:43Z

Green light — go ahead and implement this. Diagnosis matches current main exactly (forced-interrupt → suspend_recently_active() → next msg gets new session_id + misleading reset banner), and the design reuses existing transcript-continuation machinery cleanly.

One design correction before you start: the spec's Option B adds resume_attempts to SessionEntry, but we already ship a stuck-loop counter from PR #7536 (.restart_failure_counts JSON file, 3-restart threshold, cleared on successful turn completion). Please use that existing mechanism for escalation rather than adding a parallel field on SessionEntry. The resume_pending state itself on SessionEntry is correct and new; just don't duplicate the counter.

Concretely that means:

Add resume_pending / resume_reason / last_resume_marked_at fields → yes
Skip resume_attempts field → the .restart_failure_counts file already tracks this per session_key
Escalation threshold already exists in _suspend_stuck_loop_sessions() — your change should make suspend_recently_active() narrower so it doesn't fire on the normal interrupted-restart path, and let the existing counter+threshold handle repeated failures

Also confirming two of your open-question recommendations are the right calls:

Clear resume_pending on successful turn completion, not turn start
Same session_key only, no cross-thread recovery

Ping me when you have a draft PR up and we'll get it reviewed.

BrennerSpear · 2026-04-18T05:17:18Z

+    session_key: str,
+    *,
+    reason: str = "restart_timeout",
+    increment_attempts: bool = True,


resume_attempts counts restart events, not failed recoveries — causing premature escalation

increment_attempts: bool = True is called inside mark_resume_pending(), which fires on every interrupted restart. If the gateway cycles 3 times in a deploy loop without the user ever sending a message, the session hits the escalation threshold and becomes suspended before a single recovery was ever attempted.

The counter should track failed recovery turns (a resumed turn that was interrupted again), not restart-mark calls. Increment it in clear_resume_pending(failed=True) or in the escalation path after a resumed turn crashes — not in mark_resume_pending().

Section 4 says "first/second/third interrupted restart" which implies counting restarts, but that creates the deploy-loop problem. Worth clarifying which semantic is intended before implementation.

BrennerSpear · 2026-04-18T05:17:25Z

+Extend `get_or_create_session()` logic:
+
+- if `entry.suspended` → current reset behavior stays
+- if `entry.resume_pending` → **return the existing entry** and clear or downgrade the resume marker once the resume turn begins successfully


Contradiction between section 2 and Open Question 1 on when to clear resume_pending

Section 2 (this line) says "clear or downgrade the resume marker once the resume turn begins successfully." Open Question 1 (line 538) recommends "clear after successful completion."

These conflict in a meaningful way: if cleared on start and the resumed turn crashes mid-run, the session exits resume_pending without recording a failure. On the next restart, mark_resume_pending() would set it again but resume_attempts has no record of the prior failed resume — breaking the escalation counter.

Pick one semantic and remove the ambiguity. "Clear on completion" is safer; just make sure the in-progress guard (mentioned in Open Question 1) prevents double-resume if the session is accessed concurrently during the running turn.

BrennerSpear · 2026-04-18T05:17:32Z

+  -> next message in same thread/topic
+     -> ACTIVE (same session_id, transcript reloaded)
+
+RESUME_PENDING


State machine is missing the RESUME_PENDING → gateway restarts again transition

The state machine shows RESUME_PENDING going to ACTIVE (user messages) or to SUSPENDED (threshold exceeded), but has no transition for "gateway restarts again while still in RESUME_PENDING." This is the exact scenario in a deploy loop — the user hasn't messaged yet between restarts.

If each restart calls mark_resume_pending() with increment_attempts=True, the session silently escalates to SUSPENDED before the user gets a chance to reply. The state machine should make this explicit — either as a valid self-loop (RESUME_PENDING → RESUME_PENDING + attempt++) or as a special transition with a clear policy on whether successive restarts-without-user-message count toward the threshold.

BrennerSpear · 2026-04-18T05:17:41Z

+#### If the system escalates to forced clean slate
+Only after repeated failure should the user see something like:
+
+- `This session was interrupted repeatedly during restart recovery, so Hermes started a fresh session to avoid getting stuck. Use /resume if you want the old transcript.`


/resume discoverability after escalation to a new session_id is unspecified

This copy tells the user to /resume after escalation, but when suspended=True causes get_or_create_session() to mint a new session_id, the new session entry has no reference back to the old one. The spec doesn't confirm that /resume can discover the previous session_id in this case.

If /resume today works by listing recent sessions for the user to pick from, it may already handle this. But if it relies on the current session knowing its predecessor, it won't. The spec should either:

explicitly verify the current /resume flow handles cross-session-id lookup, or

add a task to store a parent_session_id field on the new session entry when escalation creates it.

Mentioning /resume in user-facing copy that doesn't actually work is worse than omitting it.

BrennerSpear · 2026-04-18T05:17:50Z

+
+Reserve it for narrower cases, such as:
+
+- startup crash recovery when explicit `resume_pending` metadata is absent,


Hard kills (SIGKILL/OOM) still hit the original bad UX — worth calling out explicitly

Section 5 lists "startup crash recovery when explicit resume_pending metadata is absent" as a remaining use case for suspend_recently_active(). But this is the same broken path the spec is trying to fix — the user sent the optimistic banner, the process died without writing any markers, and on next startup the session gets reset.

This is a known limitation, but the spec should say so explicitly rather than burying it in the suspend_recently_active() residual use-cases list. One mitigation worth naming: write resume_pending markers at the start of drain (not only after drain timeout), so that even a SIGKILL during drain leaves markers on disk. Whether that's in scope for v1 or a follow-up should be a stated decision.

BrennerSpear · 2026-04-18T05:31:10Z

            timeout = self._restart_drain_timeout
            active_agents, timed_out = await self._drain_active_agents(timeout)
            if timed_out:
+                timed_out_session_keys = set(self._running_agents.keys())


Use active_agents.keys() instead of self._running_agents.keys() here.

active_agents is the authoritative snapshot returned by _drain_active_agents() — it contains exactly the sessions that were still running at timeout. Re-reading self._running_agents is normally equivalent, but it introduces a subtle TOCTOU hazard: if any agent completes (and removes itself from _running_agents) in the window between _drain_active_agents() returning and this line executing, those sessions would be missed. The return value is the right source of truth.

BrennerSpear · 2026-04-18T05:31:17Z

+                        try:
+                            self.session_store.mark_resume_pending(
+                                session_key,
+                                reason="restart_timeout" if self._restart_requested else "shutdown_timeout",


Semantic mismatch: when _restart_requested=False (clean shutdown with drain timeout), sessions are marked resume_pending=True with reason="shutdown_timeout" — but the shutdown banner on line ~1608 tells the user only "Your current task will be interrupted" with no recovery promise, and _prepend_restart_recovery_note() always says "interrupted by a gateway restart" regardless of the reason.

Result: a user who was told their task is simply interrupted will get a recovery note on next startup that wrongly blames a restart. Either skip mark_resume_pending for non-restart shutdowns, or thread resume_reason through to the system note copy.

BrennerSpear · 2026-04-18T05:31:23Z

+                elif entry.resume_pending:
+                    entry.updated_at = now
+                    self._save()
+                    return entry


This early-return bypasses _should_reset() entirely, including idle-timeout and daily-reset policies. last_resume_marked_at is stored but never consulted here.

If the user never returns to the interrupted thread, the session stays permanently stuck in resume_pending=True and will never idle-reset. Consider adding a recovery-window guard — e.g., if _now() - entry.last_resume_marked_at > recovery_window_seconds, fall through to the normal _should_reset() path instead of returning early.

BrennerSpear · 2026-04-18T05:31:30Z

+            entry.resume_reason = reason
+            entry.last_resume_marked_at = _now()
+            if increment_attempts:
+                entry.resume_attempts += 1


resume_attempts is incremented here but never checked against a threshold anywhere in the codebase. The spec promises "third interrupted restart → convert to suspended=True" as the escalation path for poisoned sessions, but without a threshold check this counter is purely decorative.

The only escalation that exists is the existing stuck-loop mechanism in _suspend_stuck_loop_sessions() (which clears resume_pending when it fires), but that fires on restart-count watermarks, not attempt-count. A session that hangs silently (never triggering stuck-loop detection) would accumulate resume_attempts forever without ever escalating.

The shutdown banner promised "send any message after restart to resume where you left off" but the code did the opposite: a drain-timeout restart skipped the .clean_shutdown marker, which made the next startup call suspend_recently_active(), which marked the session suspended, which made get_or_create_session() spawn a fresh session_id with a 'Session automatically reset. Use /resume...' notice — contradicting the banner. Introduce a resume_pending state on SessionEntry that is distinct from suspended. Drain-timeout shutdown flags active sessions resume_pending instead of letting startup-wide suspension destroy them. The next message on the same session_key preserves the session_id, reloads the transcript, and the agent receives a reason-aware restart-resume system note that subsumes the existing tool-tail auto-continue note (PR #9934). Terminal escalation still flows through the existing .restart_failure_counts stuck-loop counter (PR #7536, threshold 3) — no parallel counter on SessionEntry. suspended still wins over resume_pending in get_or_create_session() so genuinely stuck sessions converge to a clean slate. Spec: PR #11852 (BrennerSpear). Implementation follows the spec with the approved correction (reuse .restart_failure_counts rather than adding a resume_attempts field). Changes: - gateway/session.py: SessionEntry.resume_pending/resume_reason/ last_resume_marked_at + to_dict/from_dict; SessionStore .mark_resume_pending()/clear_resume_pending(); get_or_create_session() returns existing entry when resume_pending (suspended still wins); suspend_recently_active() skips resume_pending entries. - gateway/run.py: _stop_impl() drain-timeout branch marks active sessions resume_pending before _interrupt_running_agents(); _run_agent() injects reason-aware restart-resume system note that subsumes the tool-tail case; successful-turn cleanup also clears resume_pending next to _clear_restart_failure_count(); _notify_active_sessions_of_shutdown() softens the restart banner to 'I'll try to resume where you left off' (honest about stuck-loop escalation). - tests/gateway/test_restart_resume_pending.py: 29 new tests covering SessionEntry roundtrip, mark/clear helpers, get_or_create_session precedence (suspended > resume_pending), suspend_recently_active skip, drain-timeout mark reason (restart vs shutdown), system-note injection decision tree (including tool-tail subsumption), banner wording, and stuck-loop escalation override.

… (#12301) The shutdown banner promised "send any message after restart to resume where you left off" but the code did the opposite: a drain-timeout restart skipped the .clean_shutdown marker, which made the next startup call suspend_recently_active(), which marked the session suspended, which made get_or_create_session() spawn a fresh session_id with a 'Session automatically reset. Use /resume...' notice — contradicting the banner. Introduce a resume_pending state on SessionEntry that is distinct from suspended. Drain-timeout shutdown flags active sessions resume_pending instead of letting startup-wide suspension destroy them. The next message on the same session_key preserves the session_id, reloads the transcript, and the agent receives a reason-aware restart-resume system note that subsumes the existing tool-tail auto-continue note (PR #9934). Terminal escalation still flows through the existing .restart_failure_counts stuck-loop counter (PR #7536, threshold 3) — no parallel counter on SessionEntry. suspended still wins over resume_pending in get_or_create_session() so genuinely stuck sessions converge to a clean slate. Spec: PR #11852 (BrennerSpear). Implementation follows the spec with the approved correction (reuse .restart_failure_counts rather than adding a resume_attempts field). Changes: - gateway/session.py: SessionEntry.resume_pending/resume_reason/ last_resume_marked_at + to_dict/from_dict; SessionStore .mark_resume_pending()/clear_resume_pending(); get_or_create_session() returns existing entry when resume_pending (suspended still wins); suspend_recently_active() skips resume_pending entries. - gateway/run.py: _stop_impl() drain-timeout branch marks active sessions resume_pending before _interrupt_running_agents(); _run_agent() injects reason-aware restart-resume system note that subsumes the tool-tail case; successful-turn cleanup also clears resume_pending next to _clear_restart_failure_count(); _notify_active_sessions_of_shutdown() softens the restart banner to 'I'll try to resume where you left off' (honest about stuck-loop escalation). - tests/gateway/test_restart_resume_pending.py: 29 new tests covering SessionEntry roundtrip, mark/clear helpers, get_or_create_session precedence (suspended > resume_pending), suspend_recently_active skip, drain-timeout mark reason (restart vs shutdown), system-note injection decision tree (including tool-tail subsumption), banner wording, and stuck-loop escalation override.

…esearch#11852) (NousResearch#12301) The shutdown banner promised "send any message after restart to resume where you left off" but the code did the opposite: a drain-timeout restart skipped the .clean_shutdown marker, which made the next startup call suspend_recently_active(), which marked the session suspended, which made get_or_create_session() spawn a fresh session_id with a 'Session automatically reset. Use /resume...' notice — contradicting the banner. Introduce a resume_pending state on SessionEntry that is distinct from suspended. Drain-timeout shutdown flags active sessions resume_pending instead of letting startup-wide suspension destroy them. The next message on the same session_key preserves the session_id, reloads the transcript, and the agent receives a reason-aware restart-resume system note that subsumes the existing tool-tail auto-continue note (PR NousResearch#9934). Terminal escalation still flows through the existing .restart_failure_counts stuck-loop counter (PR NousResearch#7536, threshold 3) — no parallel counter on SessionEntry. suspended still wins over resume_pending in get_or_create_session() so genuinely stuck sessions converge to a clean slate. Spec: PR NousResearch#11852 (BrennerSpear). Implementation follows the spec with the approved correction (reuse .restart_failure_counts rather than adding a resume_attempts field). Changes: - gateway/session.py: SessionEntry.resume_pending/resume_reason/ last_resume_marked_at + to_dict/from_dict; SessionStore .mark_resume_pending()/clear_resume_pending(); get_or_create_session() returns existing entry when resume_pending (suspended still wins); suspend_recently_active() skips resume_pending entries. - gateway/run.py: _stop_impl() drain-timeout branch marks active sessions resume_pending before _interrupt_running_agents(); _run_agent() injects reason-aware restart-resume system note that subsumes the tool-tail case; successful-turn cleanup also clears resume_pending next to _clear_restart_failure_count(); _notify_active_sessions_of_shutdown() softens the restart banner to 'I'll try to resume where you left off' (honest about stuck-loop escalation). - tests/gateway/test_restart_resume_pending.py: 29 new tests covering SessionEntry roundtrip, mark/clear helpers, get_or_create_session precedence (suspended > resume_pending), suspend_recently_active skip, drain-timeout mark reason (restart vs shutdown), system-note injection decision tree (including tool-tail subsumption), banner wording, and stuck-loop escalation override.

docs: add restart auto-resume PR spec

e2be46b

BrennerSpear commented Apr 18, 2026

View reviewed changes

fix(gateway): auto-resume interrupted restart sessions

5619fa0

BrennerSpear commented Apr 18, 2026

View reviewed changes

BrennerSpear added 2 commits April 18, 2026 01:23

fix(gateway): handle auto-resume edge cases

75c1d43

fix(gateway): reuse stuck-loop recovery path

4861e04

teknium1 mentioned this pull request Apr 18, 2026

fix(gateway): auto-resume sessions after drain-timeout restart (#11852) #12301

Merged

teknium1 closed this in #12301 Apr 19, 2026

This was referenced Apr 27, 2026

fix(gateway): write clean-shutdown marker before drain to preserve session context #11099

Closed

fix(gateway): preserve session continuity across planned restarts #11806

Closed

feat(gateway): resume interrupted sessions after restart #5226

Closed

juanfradb mentioned this pull request May 2, 2026

[codex] Allow gateway to preserve suspended sessions #18851

Open

Qwinty mentioned this pull request May 18, 2026

fix(gateway): premark active sessions before drain #27831

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec: automatic session resume after gateway restart#11852

spec: automatic session resume after gateway restart#11852
BrennerSpear wants to merge 4 commits into
NousResearch:mainfrom
BrennerSpear:docs/auto-resume-after-restart-pr-spec

BrennerSpear commented Apr 17, 2026 •

edited

Loading

Uh oh!

teknium1 commented Apr 18, 2026

Uh oh!

BrennerSpear Apr 18, 2026

Uh oh!

BrennerSpear Apr 18, 2026

Uh oh!

BrennerSpear Apr 18, 2026

Uh oh!

BrennerSpear Apr 18, 2026

Uh oh!

BrennerSpear Apr 18, 2026

Uh oh!

BrennerSpear Apr 18, 2026

Uh oh!

BrennerSpear Apr 18, 2026

Uh oh!

BrennerSpear Apr 18, 2026

Uh oh!

BrennerSpear Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		Reserve it for narrower cases, such as:

		- startup crash recovery when explicit `resume_pending` metadata is absent,

Conversation

BrennerSpear commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Spec: Automatic Session Resume After Gateway Restart

TL;DR

Problem statement

Current user experience

Root cause in current code

Product goal

Desired UX

Non-goals

Design principles

Recommendation

Introduce a resumable restart-interruption state

New state

Meaning of states

High-level behavior change

Current behavior

Proposed behavior

Escalation path

Detailed design

1) Persist resume_pending on interrupted restart

Where to mark it

Why here

Proposed helper

2) Do not auto-reset resume_pending sessions on next access

Current bad behavior

Proposed behavior

3) Reuse existing transcript reload and auto-continue logic

Existing asset

Extend the system note behavior

Why this is enough for v1

4) Keep stuck-loop protection, but reuse the existing restart-failure mechanism

Proposed rule

Recommended implementation

5) Narrow the role of suspend_recently_active()

Current role

Proposed role after this PR

Important outcome

6) Fix user-facing messaging

Current messaging is misleading

Shutdown banner

Reset notice

Proposed messaging

Shutdown banner

If the system escalates to forced clean slate

Message principle

State machine

File-level implementation plan

Primary files

gateway/session.py

gateway/run.py

tests/gateway/

Test plan

Unit tests

tests/gateway/test_restart_resume_pending.py (new)

tests/gateway/test_restart_recovery_flow.py (new)

Message-copy tests

Backward compatibility

Migration behavior

Important safety note

Risks

1. Resume loop risk

2. Partial transcript ambiguity

3. Messaging confusion during rollout

Open questions

Recommended implementation order

Success criteria

Strong opinion

Uh oh!

teknium1 commented Apr 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BrennerSpear commented Apr 17, 2026 •

edited

Loading

1) Persist `resume_pending` on interrupted restart

2) Do not auto-reset `resume_pending` sessions on next access

5) Narrow the role of `suspend_recently_active()`

`gateway/session.py`

`gateway/run.py`

`tests/gateway/`

`tests/gateway/test_restart_resume_pending.py` (new)

`tests/gateway/test_restart_recovery_flow.py` (new)