Skip to content

spec: automatic session resume after gateway restart#11852

Closed
BrennerSpear wants to merge 4 commits into
NousResearch:mainfrom
BrennerSpear:docs/auto-resume-after-restart-pr-spec
Closed

spec: automatic session resume after gateway restart#11852
BrennerSpear wants to merge 4 commits into
NousResearch:mainfrom
BrennerSpear:docs/auto-resume-after-restart-pr-spec

Conversation

@BrennerSpear

@BrennerSpear BrennerSpear commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

PR Spec: Automatic Session Resume After Gateway Restart

Date: 2026-04-17

Intent: Fix the terrible current UX where Hermes says an interrupted task will resume after restart, but a forced/interrupted restart often converts that thread into a fresh session and tells the user to /resume manually.

TL;DR

Hermes should treat restart interruption as a resumable session state, not as a session reset.

Today, when gateway shutdown cannot drain active work within agent.restart_drain_timeout, startup falls back to suspend_recently_active(). That marks recently-active sessions as suspended, and the next message in the same thread causes SessionStore.get_or_create_session() to create a new session ID with auto_reset_reason="suspended". The user sees:

  • shutdown banner: "Send any message after restart to resume where it left off."
  • then next-turn banner: "Session automatically reset (previous session was stopped or interrupted). Use /resume..."

That is the exact wrong behavior for the common case of same thread, same user, same restart, still wants same task.

Recommendation: introduce a distinct persisted state like resume_pending / interrupted_by_restart and keep the existing session_id on the next message with the same session_key. Reuse the existing transcript reload and auto-continue logic in gateway/run.py instead of creating a new session. Escalation should reuse the existing .restart_failure_counts / stuck-loop detection path rather than adding a parallel counter on SessionEntry.


Problem statement

Current user experience

Current behavior is incoherent:

  1. Hermes sends an optimistic restart notice.
  2. Gateway restart times out draining active agents.
  3. Startup suspends recently-active sessions.
  4. The user's next message on the same session_key lands in a fresh session.
  5. Hermes tells the user to browse /resume manually.

This is a terrible experience because:

  • the product promises one behavior and delivers the opposite,
  • the user is forced to understand internal session mechanics,
  • the resume path is thread-local and obvious to the system but not automatic,
  • the current fallback destroys continuity even though transcript history still exists.

Root cause in current code

Relevant current behavior:

  • shutdown banner text in gateway/run.py
    • _notify_active_sessions_of_shutdown()
    • says: Send any message after restart to resume where it left off.
  • forced-interrupt path in gateway/run.py
    • if drain times out, gateway interrupts active agents
    • skips .clean_shutdown marker so next startup treats the prior run as unsafe
  • startup recovery in gateway/run.py
    • calls self.session_store.suspend_recently_active() when .clean_shutdown is absent
  • session reset behavior in gateway/session.py
    • get_or_create_session() checks entry.suspended
    • suspended sessions are turned into a new session ID with auto_reset_reason="suspended"
  • reset notice in gateway/run.py
    • emits: Session automatically reset (previous session was stopped or interrupted)...
  • transcript continuation logic already exists in gateway/run.py
    • if loaded history ends with a tool message, Hermes prepends a system note telling the model to finish the interrupted work

Important observation: Hermes already has part of the resume mechanism. The main thing preventing automatic resume is that forced restart currently turns the session into a fresh session instead of preserving the old one.


Product goal

When a user restarts Hermes and then sends the next message that resolves to the same session_key (same chat/thread/topic lane), Hermes should, by default:

  1. preserve the same conversation lane,
  2. preserve the same session_id,
  3. reload the same transcript,
  4. inform the model that the previous turn was interrupted by restart,
  5. continue/resume automatically,
  6. avoid making the user manually browse /resume unless recovery has actually failed.

Desired UX

For the normal case:

  • user starts a long task in thread X
  • gateway restarts
  • user returns to the same lane and says anything
  • Hermes continues from the interrupted session on that same session_key

For the pathological case:

  • same session repeatedly hangs across multiple restarts
  • Hermes eventually abandons auto-resume for that session and gives the user a clean slate

Non-goals

This PR should not try to:

  • implement fully autonomous resume with no user follow-up message,
  • merge different threads/chats/topics into one session,
  • auto-resume across a different thread than the one that was interrupted,
  • invent a generic distributed job recovery layer,
  • remove existing stuck-loop safety mechanisms entirely,
  • change normal idle/daily session_reset policy semantics.

This is specifically about same-lane restart continuity after gateway interruption.


Design principles

  1. Restart interruption is not the same as intentional reset.
  2. Same thread should keep same session unless proven unsafe.
  3. Safety escalation should be progressive, not immediate.
  4. User-visible messages must describe the actual recovery semantics.
  5. Reuse existing transcript + auto-continue machinery instead of inventing new prompt plumbing.

Recommendation

Introduce a resumable restart-interruption state

Add a persisted session state that is distinct from suspended.

New state

Recommended SessionEntry fields in gateway/session.py:

resume_pending: bool = False
resume_reason: Optional[str] = None  # e.g. "restart_timeout", "crash_recovery"
last_resume_marked_at: Optional[datetime] = None

Meaning of states

  • suspended=True
    • do not resume automatically
    • next access should create a fresh session
    • used for known-poisoned sessions / explicit stuck-loop protection
  • resume_pending=True
    • user should stay on the same session
    • next access should preserve the existing session ID
    • used when a restart interrupted in-flight work but we still expect same-thread continuation to succeed

This is the key architectural distinction missing today.


High-level behavior change

Current behavior

restart timed out
  -> skip .clean_shutdown
  -> startup: suspend_recently_active()
  -> session.suspended = True
  -> next message => new session_id
  -> user gets reset notice + /resume guidance

Proposed behavior

restart timed out
  -> mark active session(s) as resume_pending=True
  -> next startup preserves mapping
  -> next message on the same `session_key` returns existing session entry
  -> transcript reloads from same session_id
  -> model gets interruption note
  -> Hermes continues automatically

Escalation path

restart timed out repeatedly for same session
  -> increment existing .restart_failure_counts counter
  -> _suspend_stuck_loop_sessions() suspends once threshold is exceeded
  -> only then force fresh-session fallback

Detailed design

1) Persist resume_pending on interrupted restart

Where to mark it

During shutdown in gateway/run.py, after drain timeout is detected and the gateway force-interrupts active agents, mark the active session keys as resume_pending=True.

This should happen instead of relying on the startup-wide "recently active means suspend" fallback for these sessions.

Why here

At shutdown time, the gateway knows:

  • which sessions were actually running,
  • that the interruption came from restart/shutdown,
  • that this is not an idle/daily reset,
  • that the user did not ask for /new.

That is the correct moment to record resumable interruption state.

Proposed helper

Add a method to SessionStore in gateway/session.py:

def mark_resume_pending(
    self,
    session_key: str,
    *,
    reason: str = "restart_timeout",
) -> bool:
    ...

Responsibilities:

  • set resume_pending=True
  • set resume_reason
  • set last_resume_marked_at
  • persist metadata in sessions.json

2) Do not auto-reset resume_pending sessions on next access

Current bad behavior

SessionStore.get_or_create_session() currently treats suspended as "auto-reset on next access".

Proposed behavior

Extend get_or_create_session() logic:

  • if entry.suspended → current reset behavior stays
  • if entry.resume_pendingreturn the existing entry while the recovery window is still fresh, and only clear the marker after a successful turn completes

Pseudo-shape:

if entry.suspended:
    reset_reason = "suspended"
elif entry.resume_pending:
    entry.updated_at = now
    self._save()
    return entry
else:
    reset_reason = self._should_reset(entry, source)

This is the core functional fix.


3) Reuse existing transcript reload and auto-continue logic

This PR should explicitly lean on behavior that already exists.

Existing asset

gateway/run.py already prepends an interruption note when the loaded history ends in a tool result:

  • if transcript ends with role tool, Hermes tells the model to finish processing interrupted tool results before addressing the new user message.

Extend the system note behavior

If session_entry.resume_pending is set, prepend a stronger note such as:

[System note: Your previous turn in this same session was interrupted by a gateway restart. Continue from the existing transcript. If there are unfinished tool results, process them first, summarize what was accomplished, then answer the user's new message.]

This should work whether the transcript ended with:

  • a tool message,
  • an interrupted assistant turn,
  • or a partially completed tool-heavy exchange.

Why this is enough for v1

We do not need a brand-new recovery engine for the first version.

Preserving the same session ID plus transcript reload plus better interruption note gets the common case back to a sane product experience.


4) Keep stuck-loop protection, but reuse the existing restart-failure mechanism

We should not regress the original safety intent behind the stuck-loop work.

Proposed rule

  • first interrupted restart for a session → auto-resume
  • second interrupted restart for the same session_key → still auto-resume
  • third interrupted restart for the same session_key → let the existing stuck-loop path suspend it

This keeps safety without making the default path destructive.

Recommended implementation

Reuse the existing gateway-level .restart_failure_counts file and _suspend_stuck_loop_sessions() flow:

  • shutdown-time drain timeout still calls mark_resume_pending(...) for the interrupted session_keys
  • successful turn completion clears the restart-failure count for that session_key
  • repeated interrupted restarts are counted in .restart_failure_counts
  • _suspend_stuck_loop_sessions() flips the session to suspended=True once the existing threshold is exceeded

Do not add or maintain a parallel resume_attempts counter on SessionEntry.


5) Narrow the role of suspend_recently_active()

suspend_recently_active() is too blunt as the generic fallback for restart interruption.

Current role

It treats "recently active at startup after unclean shutdown" as a reason to force clean-slate behavior.

Proposed role after this PR

Reserve it for narrower cases, such as:

  • startup crash recovery when explicit resume_pending metadata is absent,
  • legacy upgrade path / backward compatibility,
  • emergency fallback for clearly unsafe sessions.

Normal interrupted-restart recovery with explicit resume_pending metadata should not be suspended by this helper.

Important outcome

This means restart interruption should no longer immediately flow through the same code path as 'known stuck session'.

That separation is the real product fix.


6) Fix user-facing messaging

Current messaging is misleading

Shutdown banner

Current wording:

Send any message after restart to resume where it left off.

This is too absolute.

Reset notice

Current wording points users to session_reset config even when the real cause is startup suspension after interrupted restart.

Proposed messaging

Shutdown banner

For resumable restart:

  • non-threaded chat:
    • Gateway restarting — I'll try to resume this session after restart. Send a message in this chat to continue.
  • threaded/topic chat:
    • Gateway restarting — I'll try to resume this session after restart. Send a message in this thread/topic to continue.

If the system escalates to forced clean slate

Only after repeated failure should the user see something like:

  • This session was interrupted repeatedly during restart recovery, so Hermes started a fresh session to avoid getting stuck. Use /resume if you want the old transcript.

Message principle

Only mention /resume when Hermes has actually decided not to auto-resume.


State machine

ACTIVE
  -> clean restart/shutdown drains successfully
     -> ACTIVE (same session preserved)

ACTIVE
  -> restart/crash interrupts in-flight work
     -> RESUME_PENDING

RESUME_PENDING
  -> next message in same thread/topic
     -> ACTIVE (same session_id, transcript reloaded)

RESUME_PENDING
  -> repeated interrupted restarts exceed threshold
     -> SUSPENDED

SUSPENDED
  -> next message
     -> NEW_SESSION (fresh session_id, old transcript still available via /resume)

File-level implementation plan

Primary files

gateway/session.py

Add persisted session fields and APIs:

  • new SessionEntry fields:
    • resume_pending
    • resume_reason
    • last_resume_marked_at
  • serialization/deserialization support in to_dict() / from_dict()
  • helper methods:
    • mark_resume_pending(...)
    • clear_resume_pending(...)
  • update get_or_create_session() so resume_pending returns existing session instead of resetting

gateway/run.py

Update gateway behavior:

  • after drain timeout, mark active sessions resume_pending=True
  • on resumed turn, inject restart-interruption system note when session_entry.resume_pending
  • clear resume_pending only after a successful resumed turn completes
  • update shutdown banner wording to promise attempted recovery, not guaranteed recovery
  • stop routing normal interrupted restart recovery through immediate clean-slate semantics

tests/gateway/

Add or update tests for:

  • same-session resume after interrupted restart
  • transcript preserved across restart timeout
  • tool-result auto-continue still works on resumed session
  • repeated recovery failure escalates through .restart_failure_counts / _suspend_stuck_loop_sessions()
  • shutdown banner wording is no longer misleading
  • /resume guidance only appears when clean-slate fallback actually occurs

Test plan

Unit tests

tests/gateway/test_restart_resume_pending.py (new)

Suggested cases:

  1. mark resume pending persists state

    • create session
    • call mark_resume_pending()
    • reload store
    • assert flags persisted
  2. resume_pending does not create new session id

    • create session
    • mark resume_pending=True
    • call get_or_create_session()
    • assert returned session_id is unchanged
  3. suspended still creates new session id

    • existing current behavior regression guard
  4. clear resume pending after success

    • mark resumable
    • simulate successful turn start or completion
    • assert resume_pending=False

tests/gateway/test_restart_recovery_flow.py (new)

Suggested cases:

  1. interrupted restart on same session key resumes existing session

    • transcript exists under original session id
    • next message on the same session_key loads the same transcript
    • no auto-reset notice
  2. tool-tail transcript triggers auto-continue note on resumed session

    • transcript ends with tool
    • resumed run prepends correct system note
  3. repeated restart failures escalate to suspended

    • simulate threshold crossings
    • assert fresh-session fallback only after threshold
  4. clean restart remains unchanged

    • .clean_shutdown path still preserves session as before

Message-copy tests

Add assertions for the updated shutdown banner and fallback text.


Backward compatibility

This change should be backward-compatible with existing stored sessions.

Migration behavior

  • old sessions.json entries simply deserialize with default values for new fields
  • no database migration should be required if session metadata remains file-backed
  • existing suspended=True entries should keep current semantics

Important safety note

Do not silently reinterpret existing suspended=True as resumable. That would change meaning for users who explicitly relied on the current clean-slate escape hatch.


Risks

1. Resume loop risk

If resume is attempted too aggressively, a truly poisoned session could keep re-entering the same bad state.

Mitigation: keep thresholded escalation in the existing .restart_failure_counts / _suspend_stuck_loop_sessions() flow.

2. Partial transcript ambiguity

If a turn was interrupted mid-assistant generation, the last messages may not be perfectly shaped.

Mitigation: keep the recovery note explicit and rely on existing transcript loading behavior. Tests should cover common partial-tail shapes.

3. Messaging confusion during rollout

If message copy changes before semantics change, UX could still be misleading.

Mitigation: land copy changes in the same PR as behavior changes.


Open questions

  1. resume_pending should clear only after a successful completed turn, not at turn start.
  2. Recovery should only be attempted on the same session_key. Do not cross lanes.
  3. suspend_recently_active() should remain only as a narrower crash-recovery fallback and should not suspend explicit resume_pending sessions.
  4. User-facing recovery should stay as an internal system note in v1 unless debugging is enabled.

Recommended implementation order

  1. Add SessionEntry resume-pending fields + serialization
  2. Add SessionStore.mark_resume_pending() / clear_resume_pending()
  3. Change get_or_create_session() to preserve same session for resume_pending
  4. Mark active sessions resume-pending on drain-timeout shutdown
  5. Inject restart-resume system note in gateway/run.py
  6. Clear pending state after successful completion
  7. Reuse existing .restart_failure_counts / _suspend_stuck_loop_sessions() escalation
  8. Update shutdown/fallback copy
  9. Add regression tests

Success criteria

This PR is successful if all of the following are true:

  1. A long-running task interrupted by gateway restart in a Discord/Telegram lane resumes automatically on the next message with the same session_key.
  2. The resumed thread keeps the same session_id and transcript history.
  3. Hermes no longer tells the user to /resume for the normal restart-recovery case.
  4. Existing clean restart behavior is preserved.
  5. Truly stuck sessions still have a safety escape hatch after repeated failures.
  6. User-facing restart copy accurately describes attempted recovery rather than promising impossible behavior.

Strong opinion

Hermes should default to continuity after restart and only fall back to clean-slate reset when recovery is actually failing.

Right now the system is optimized for protecting itself from stuck loops at the cost of making ordinary restart recovery feel broken. That tradeoff is backwards for user experience.

The correct product stance is:

  • same session_key + immediate post-restart message = continue automatically
  • repeated failed recovery = fresh session as safety fallback

That gets the common case right without giving up the escape hatch.

@teknium1

Copy link
Copy Markdown
Contributor

Green light — go ahead and implement this. Diagnosis matches current main exactly (forced-interrupt → suspend_recently_active() → next msg gets new session_id + misleading reset banner), and the design reuses existing transcript-continuation machinery cleanly.

One design correction before you start: the spec's Option B adds resume_attempts to SessionEntry, but we already ship a stuck-loop counter from PR #7536 (.restart_failure_counts JSON file, 3-restart threshold, cleared on successful turn completion). Please use that existing mechanism for escalation rather than adding a parallel field on SessionEntry. The resume_pending state itself on SessionEntry is correct and new; just don't duplicate the counter.

Concretely that means:

  • Add resume_pending / resume_reason / last_resume_marked_at fields → yes
  • Skip resume_attempts field → the .restart_failure_counts file already tracks this per session_key
  • Escalation threshold already exists in _suspend_stuck_loop_sessions() — your change should make suspend_recently_active() narrower so it doesn't fire on the normal interrupted-restart path, and let the existing counter+threshold handle repeated failures

Also confirming two of your open-question recommendations are the right calls:

  • Clear resume_pending on successful turn completion, not turn start
  • Same session_key only, no cross-thread recovery

Ping me when you have a draft PR up and we'll get it reviewed.

session_key: str,
*,
reason: str = "restart_timeout",
increment_attempts: bool = True,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resume_attempts counts restart events, not failed recoveries — causing premature escalation

increment_attempts: bool = True is called inside mark_resume_pending(), which fires on every interrupted restart. If the gateway cycles 3 times in a deploy loop without the user ever sending a message, the session hits the escalation threshold and becomes suspended before a single recovery was ever attempted.

The counter should track failed recovery turns (a resumed turn that was interrupted again), not restart-mark calls. Increment it in clear_resume_pending(failed=True) or in the escalation path after a resumed turn crashes — not in mark_resume_pending().

Section 4 says "first/second/third interrupted restart" which implies counting restarts, but that creates the deploy-loop problem. Worth clarifying which semantic is intended before implementation.

Extend `get_or_create_session()` logic:

- if `entry.suspended` → current reset behavior stays
- if `entry.resume_pending` → **return the existing entry** and clear or downgrade the resume marker once the resume turn begins successfully

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contradiction between section 2 and Open Question 1 on when to clear resume_pending

Section 2 (this line) says "clear or downgrade the resume marker once the resume turn begins successfully." Open Question 1 (line 538) recommends "clear after successful completion."

These conflict in a meaningful way: if cleared on start and the resumed turn crashes mid-run, the session exits resume_pending without recording a failure. On the next restart, mark_resume_pending() would set it again but resume_attempts has no record of the prior failed resume — breaking the escalation counter.

Pick one semantic and remove the ambiguity. "Clear on completion" is safer; just make sure the in-progress guard (mentioned in Open Question 1) prevents double-resume if the session is accessed concurrently during the running turn.

-> next message in same thread/topic
-> ACTIVE (same session_id, transcript reloaded)

RESUME_PENDING

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State machine is missing the RESUME_PENDING → gateway restarts again transition

The state machine shows RESUME_PENDING going to ACTIVE (user messages) or to SUSPENDED (threshold exceeded), but has no transition for "gateway restarts again while still in RESUME_PENDING." This is the exact scenario in a deploy loop — the user hasn't messaged yet between restarts.

If each restart calls mark_resume_pending() with increment_attempts=True, the session silently escalates to SUSPENDED before the user gets a chance to reply. The state machine should make this explicit — either as a valid self-loop (RESUME_PENDING → RESUME_PENDING + attempt++) or as a special transition with a clear policy on whether successive restarts-without-user-message count toward the threshold.

#### If the system escalates to forced clean slate
Only after repeated failure should the user see something like:

- `This session was interrupted repeatedly during restart recovery, so Hermes started a fresh session to avoid getting stuck. Use /resume if you want the old transcript.`

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/resume discoverability after escalation to a new session_id is unspecified

This copy tells the user to /resume after escalation, but when suspended=True causes get_or_create_session() to mint a new session_id, the new session entry has no reference back to the old one. The spec doesn't confirm that /resume can discover the previous session_id in this case.

If /resume today works by listing recent sessions for the user to pick from, it may already handle this. But if it relies on the current session knowing its predecessor, it won't. The spec should either:

  • explicitly verify the current /resume flow handles cross-session-id lookup, or
  • add a task to store a parent_session_id field on the new session entry when escalation creates it.

Mentioning /resume in user-facing copy that doesn't actually work is worse than omitting it.


Reserve it for narrower cases, such as:

- startup crash recovery when explicit `resume_pending` metadata is absent,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard kills (SIGKILL/OOM) still hit the original bad UX — worth calling out explicitly

Section 5 lists "startup crash recovery when explicit resume_pending metadata is absent" as a remaining use case for suspend_recently_active(). But this is the same broken path the spec is trying to fix — the user sent the optimistic banner, the process died without writing any markers, and on next startup the session gets reset.

This is a known limitation, but the spec should say so explicitly rather than burying it in the suspend_recently_active() residual use-cases list. One mitigation worth naming: write resume_pending markers at the start of drain (not only after drain timeout), so that even a SIGKILL during drain leaves markers on disk. Whether that's in scope for v1 or a follow-up should be a stated decision.

Comment thread gateway/run.py Outdated
timeout = self._restart_drain_timeout
active_agents, timed_out = await self._drain_active_agents(timeout)
if timed_out:
timed_out_session_keys = set(self._running_agents.keys())

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use active_agents.keys() instead of self._running_agents.keys() here.

active_agents is the authoritative snapshot returned by _drain_active_agents() — it contains exactly the sessions that were still running at timeout. Re-reading self._running_agents is normally equivalent, but it introduces a subtle TOCTOU hazard: if any agent completes (and removes itself from _running_agents) in the window between _drain_active_agents() returning and this line executing, those sessions would be missed. The return value is the right source of truth.

Comment thread gateway/run.py
try:
self.session_store.mark_resume_pending(
session_key,
reason="restart_timeout" if self._restart_requested else "shutdown_timeout",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantic mismatch: when _restart_requested=False (clean shutdown with drain timeout), sessions are marked resume_pending=True with reason="shutdown_timeout" — but the shutdown banner on line ~1608 tells the user only "Your current task will be interrupted" with no recovery promise, and _prepend_restart_recovery_note() always says "interrupted by a gateway restart" regardless of the reason.

Result: a user who was told their task is simply interrupted will get a recovery note on next startup that wrongly blames a restart. Either skip mark_resume_pending for non-restart shutdowns, or thread resume_reason through to the system note copy.

Comment thread gateway/session.py Outdated
elif entry.resume_pending:
entry.updated_at = now
self._save()
return entry

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This early-return bypasses _should_reset() entirely, including idle-timeout and daily-reset policies. last_resume_marked_at is stored but never consulted here.

If the user never returns to the interrupted thread, the session stays permanently stuck in resume_pending=True and will never idle-reset. Consider adding a recovery-window guard — e.g., if _now() - entry.last_resume_marked_at > recovery_window_seconds, fall through to the normal _should_reset() path instead of returning early.

Comment thread gateway/session.py Outdated
entry.resume_reason = reason
entry.last_resume_marked_at = _now()
if increment_attempts:
entry.resume_attempts += 1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resume_attempts is incremented here but never checked against a threshold anywhere in the codebase. The spec promises "third interrupted restart → convert to suspended=True" as the escalation path for poisoned sessions, but without a threshold check this counter is purely decorative.

The only escalation that exists is the existing stuck-loop mechanism in _suspend_stuck_loop_sessions() (which clears resume_pending when it fires), but that fires on restart-count watermarks, not attempt-count. A session that hangs silently (never triggering stuck-loop detection) would accumulate resume_attempts forever without ever escalating.

teknium1 added a commit that referenced this pull request Apr 18, 2026
The shutdown banner promised "send any message after restart to resume
where you left off" but the code did the opposite: a drain-timeout
restart skipped the .clean_shutdown marker, which made the next startup
call suspend_recently_active(), which marked the session suspended,
which made get_or_create_session() spawn a fresh session_id with a
'Session automatically reset. Use /resume...' notice — contradicting
the banner.

Introduce a resume_pending state on SessionEntry that is distinct from
suspended. Drain-timeout shutdown flags active sessions resume_pending
instead of letting startup-wide suspension destroy them. The next
message on the same session_key preserves the session_id, reloads the
transcript, and the agent receives a reason-aware restart-resume
system note that subsumes the existing tool-tail auto-continue note
(PR #9934).

Terminal escalation still flows through the existing
.restart_failure_counts stuck-loop counter (PR #7536, threshold 3) —
no parallel counter on SessionEntry. suspended still wins over
resume_pending in get_or_create_session() so genuinely stuck sessions
converge to a clean slate.

Spec: PR #11852 (BrennerSpear). Implementation follows the spec with
the approved correction (reuse .restart_failure_counts rather than
adding a resume_attempts field).

Changes:
- gateway/session.py: SessionEntry.resume_pending/resume_reason/
  last_resume_marked_at + to_dict/from_dict; SessionStore
  .mark_resume_pending()/clear_resume_pending(); get_or_create_session()
  returns existing entry when resume_pending (suspended still wins);
  suspend_recently_active() skips resume_pending entries.
- gateway/run.py: _stop_impl() drain-timeout branch marks active
  sessions resume_pending before _interrupt_running_agents();
  _run_agent() injects reason-aware restart-resume system note that
  subsumes the tool-tail case; successful-turn cleanup also clears
  resume_pending next to _clear_restart_failure_count();
  _notify_active_sessions_of_shutdown() softens the restart banner to
  'I'll try to resume where you left off' (honest about stuck-loop
  escalation).
- tests/gateway/test_restart_resume_pending.py: 29 new tests covering
  SessionEntry roundtrip, mark/clear helpers, get_or_create_session
  precedence (suspended > resume_pending), suspend_recently_active
  skip, drain-timeout mark reason (restart vs shutdown), system-note
  injection decision tree (including tool-tail subsumption), banner
  wording, and stuck-loop escalation override.
teknium1 added a commit that referenced this pull request Apr 19, 2026
… (#12301)

The shutdown banner promised "send any message after restart to resume
where you left off" but the code did the opposite: a drain-timeout
restart skipped the .clean_shutdown marker, which made the next startup
call suspend_recently_active(), which marked the session suspended,
which made get_or_create_session() spawn a fresh session_id with a
'Session automatically reset. Use /resume...' notice — contradicting
the banner.

Introduce a resume_pending state on SessionEntry that is distinct from
suspended. Drain-timeout shutdown flags active sessions resume_pending
instead of letting startup-wide suspension destroy them. The next
message on the same session_key preserves the session_id, reloads the
transcript, and the agent receives a reason-aware restart-resume
system note that subsumes the existing tool-tail auto-continue note
(PR #9934).

Terminal escalation still flows through the existing
.restart_failure_counts stuck-loop counter (PR #7536, threshold 3) —
no parallel counter on SessionEntry. suspended still wins over
resume_pending in get_or_create_session() so genuinely stuck sessions
converge to a clean slate.

Spec: PR #11852 (BrennerSpear). Implementation follows the spec with
the approved correction (reuse .restart_failure_counts rather than
adding a resume_attempts field).

Changes:
- gateway/session.py: SessionEntry.resume_pending/resume_reason/
  last_resume_marked_at + to_dict/from_dict; SessionStore
  .mark_resume_pending()/clear_resume_pending(); get_or_create_session()
  returns existing entry when resume_pending (suspended still wins);
  suspend_recently_active() skips resume_pending entries.
- gateway/run.py: _stop_impl() drain-timeout branch marks active
  sessions resume_pending before _interrupt_running_agents();
  _run_agent() injects reason-aware restart-resume system note that
  subsumes the tool-tail case; successful-turn cleanup also clears
  resume_pending next to _clear_restart_failure_count();
  _notify_active_sessions_of_shutdown() softens the restart banner to
  'I'll try to resume where you left off' (honest about stuck-loop
  escalation).
- tests/gateway/test_restart_resume_pending.py: 29 new tests covering
  SessionEntry roundtrip, mark/clear helpers, get_or_create_session
  precedence (suspended > resume_pending), suspend_recently_active
  skip, drain-timeout mark reason (restart vs shutdown), system-note
  injection decision tree (including tool-tail subsumption), banner
  wording, and stuck-loop escalation override.
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
…esearch#11852) (NousResearch#12301)

The shutdown banner promised "send any message after restart to resume
where you left off" but the code did the opposite: a drain-timeout
restart skipped the .clean_shutdown marker, which made the next startup
call suspend_recently_active(), which marked the session suspended,
which made get_or_create_session() spawn a fresh session_id with a
'Session automatically reset. Use /resume...' notice — contradicting
the banner.

Introduce a resume_pending state on SessionEntry that is distinct from
suspended. Drain-timeout shutdown flags active sessions resume_pending
instead of letting startup-wide suspension destroy them. The next
message on the same session_key preserves the session_id, reloads the
transcript, and the agent receives a reason-aware restart-resume
system note that subsumes the existing tool-tail auto-continue note
(PR NousResearch#9934).

Terminal escalation still flows through the existing
.restart_failure_counts stuck-loop counter (PR NousResearch#7536, threshold 3) —
no parallel counter on SessionEntry. suspended still wins over
resume_pending in get_or_create_session() so genuinely stuck sessions
converge to a clean slate.

Spec: PR NousResearch#11852 (BrennerSpear). Implementation follows the spec with
the approved correction (reuse .restart_failure_counts rather than
adding a resume_attempts field).

Changes:
- gateway/session.py: SessionEntry.resume_pending/resume_reason/
  last_resume_marked_at + to_dict/from_dict; SessionStore
  .mark_resume_pending()/clear_resume_pending(); get_or_create_session()
  returns existing entry when resume_pending (suspended still wins);
  suspend_recently_active() skips resume_pending entries.
- gateway/run.py: _stop_impl() drain-timeout branch marks active
  sessions resume_pending before _interrupt_running_agents();
  _run_agent() injects reason-aware restart-resume system note that
  subsumes the tool-tail case; successful-turn cleanup also clears
  resume_pending next to _clear_restart_failure_count();
  _notify_active_sessions_of_shutdown() softens the restart banner to
  'I'll try to resume where you left off' (honest about stuck-loop
  escalation).
- tests/gateway/test_restart_resume_pending.py: 29 new tests covering
  SessionEntry roundtrip, mark/clear helpers, get_or_create_session
  precedence (suspended > resume_pending), suspend_recently_active
  skip, drain-timeout mark reason (restart vs shutdown), system-note
  injection decision tree (including tool-tail subsumption), banner
  wording, and stuck-loop escalation override.
aj-nt pushed a commit to aj-nt/hermes-agent that referenced this pull request May 1, 2026
…esearch#11852) (NousResearch#12301)

The shutdown banner promised "send any message after restart to resume
where you left off" but the code did the opposite: a drain-timeout
restart skipped the .clean_shutdown marker, which made the next startup
call suspend_recently_active(), which marked the session suspended,
which made get_or_create_session() spawn a fresh session_id with a
'Session automatically reset. Use /resume...' notice — contradicting
the banner.

Introduce a resume_pending state on SessionEntry that is distinct from
suspended. Drain-timeout shutdown flags active sessions resume_pending
instead of letting startup-wide suspension destroy them. The next
message on the same session_key preserves the session_id, reloads the
transcript, and the agent receives a reason-aware restart-resume
system note that subsumes the existing tool-tail auto-continue note
(PR NousResearch#9934).

Terminal escalation still flows through the existing
.restart_failure_counts stuck-loop counter (PR NousResearch#7536, threshold 3) —
no parallel counter on SessionEntry. suspended still wins over
resume_pending in get_or_create_session() so genuinely stuck sessions
converge to a clean slate.

Spec: PR NousResearch#11852 (BrennerSpear). Implementation follows the spec with
the approved correction (reuse .restart_failure_counts rather than
adding a resume_attempts field).

Changes:
- gateway/session.py: SessionEntry.resume_pending/resume_reason/
  last_resume_marked_at + to_dict/from_dict; SessionStore
  .mark_resume_pending()/clear_resume_pending(); get_or_create_session()
  returns existing entry when resume_pending (suspended still wins);
  suspend_recently_active() skips resume_pending entries.
- gateway/run.py: _stop_impl() drain-timeout branch marks active
  sessions resume_pending before _interrupt_running_agents();
  _run_agent() injects reason-aware restart-resume system note that
  subsumes the tool-tail case; successful-turn cleanup also clears
  resume_pending next to _clear_restart_failure_count();
  _notify_active_sessions_of_shutdown() softens the restart banner to
  'I'll try to resume where you left off' (honest about stuck-loop
  escalation).
- tests/gateway/test_restart_resume_pending.py: 29 new tests covering
  SessionEntry roundtrip, mark/clear helpers, get_or_create_session
  precedence (suspended > resume_pending), suspend_recently_active
  skip, drain-timeout mark reason (restart vs shutdown), system-note
  injection decision tree (including tool-tail subsumption), banner
  wording, and stuck-loop escalation override.
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…esearch#11852) (NousResearch#12301)

The shutdown banner promised "send any message after restart to resume
where you left off" but the code did the opposite: a drain-timeout
restart skipped the .clean_shutdown marker, which made the next startup
call suspend_recently_active(), which marked the session suspended,
which made get_or_create_session() spawn a fresh session_id with a
'Session automatically reset. Use /resume...' notice — contradicting
the banner.

Introduce a resume_pending state on SessionEntry that is distinct from
suspended. Drain-timeout shutdown flags active sessions resume_pending
instead of letting startup-wide suspension destroy them. The next
message on the same session_key preserves the session_id, reloads the
transcript, and the agent receives a reason-aware restart-resume
system note that subsumes the existing tool-tail auto-continue note
(PR NousResearch#9934).

Terminal escalation still flows through the existing
.restart_failure_counts stuck-loop counter (PR NousResearch#7536, threshold 3) —
no parallel counter on SessionEntry. suspended still wins over
resume_pending in get_or_create_session() so genuinely stuck sessions
converge to a clean slate.

Spec: PR NousResearch#11852 (BrennerSpear). Implementation follows the spec with
the approved correction (reuse .restart_failure_counts rather than
adding a resume_attempts field).

Changes:
- gateway/session.py: SessionEntry.resume_pending/resume_reason/
  last_resume_marked_at + to_dict/from_dict; SessionStore
  .mark_resume_pending()/clear_resume_pending(); get_or_create_session()
  returns existing entry when resume_pending (suspended still wins);
  suspend_recently_active() skips resume_pending entries.
- gateway/run.py: _stop_impl() drain-timeout branch marks active
  sessions resume_pending before _interrupt_running_agents();
  _run_agent() injects reason-aware restart-resume system note that
  subsumes the tool-tail case; successful-turn cleanup also clears
  resume_pending next to _clear_restart_failure_count();
  _notify_active_sessions_of_shutdown() softens the restart banner to
  'I'll try to resume where you left off' (honest about stuck-loop
  escalation).
- tests/gateway/test_restart_resume_pending.py: 29 new tests covering
  SessionEntry roundtrip, mark/clear helpers, get_or_create_session
  precedence (suspended > resume_pending), suspend_recently_active
  skip, drain-timeout mark reason (restart vs shutdown), system-note
  injection decision tree (including tool-tail subsumption), banner
  wording, and stuck-loop escalation override.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…esearch#11852) (NousResearch#12301)

The shutdown banner promised "send any message after restart to resume
where you left off" but the code did the opposite: a drain-timeout
restart skipped the .clean_shutdown marker, which made the next startup
call suspend_recently_active(), which marked the session suspended,
which made get_or_create_session() spawn a fresh session_id with a
'Session automatically reset. Use /resume...' notice — contradicting
the banner.

Introduce a resume_pending state on SessionEntry that is distinct from
suspended. Drain-timeout shutdown flags active sessions resume_pending
instead of letting startup-wide suspension destroy them. The next
message on the same session_key preserves the session_id, reloads the
transcript, and the agent receives a reason-aware restart-resume
system note that subsumes the existing tool-tail auto-continue note
(PR NousResearch#9934).

Terminal escalation still flows through the existing
.restart_failure_counts stuck-loop counter (PR NousResearch#7536, threshold 3) —
no parallel counter on SessionEntry. suspended still wins over
resume_pending in get_or_create_session() so genuinely stuck sessions
converge to a clean slate.

Spec: PR NousResearch#11852 (BrennerSpear). Implementation follows the spec with
the approved correction (reuse .restart_failure_counts rather than
adding a resume_attempts field).

Changes:
- gateway/session.py: SessionEntry.resume_pending/resume_reason/
  last_resume_marked_at + to_dict/from_dict; SessionStore
  .mark_resume_pending()/clear_resume_pending(); get_or_create_session()
  returns existing entry when resume_pending (suspended still wins);
  suspend_recently_active() skips resume_pending entries.
- gateway/run.py: _stop_impl() drain-timeout branch marks active
  sessions resume_pending before _interrupt_running_agents();
  _run_agent() injects reason-aware restart-resume system note that
  subsumes the tool-tail case; successful-turn cleanup also clears
  resume_pending next to _clear_restart_failure_count();
  _notify_active_sessions_of_shutdown() softens the restart banner to
  'I'll try to resume where you left off' (honest about stuck-loop
  escalation).
- tests/gateway/test_restart_resume_pending.py: 29 new tests covering
  SessionEntry roundtrip, mark/clear helpers, get_or_create_session
  precedence (suspended > resume_pending), suspend_recently_active
  skip, drain-timeout mark reason (restart vs shutdown), system-note
  injection decision tree (including tool-tail subsumption), banner
  wording, and stuck-loop escalation override.
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…esearch#11852) (NousResearch#12301)

The shutdown banner promised "send any message after restart to resume
where you left off" but the code did the opposite: a drain-timeout
restart skipped the .clean_shutdown marker, which made the next startup
call suspend_recently_active(), which marked the session suspended,
which made get_or_create_session() spawn a fresh session_id with a
'Session automatically reset. Use /resume...' notice — contradicting
the banner.

Introduce a resume_pending state on SessionEntry that is distinct from
suspended. Drain-timeout shutdown flags active sessions resume_pending
instead of letting startup-wide suspension destroy them. The next
message on the same session_key preserves the session_id, reloads the
transcript, and the agent receives a reason-aware restart-resume
system note that subsumes the existing tool-tail auto-continue note
(PR NousResearch#9934).

Terminal escalation still flows through the existing
.restart_failure_counts stuck-loop counter (PR NousResearch#7536, threshold 3) —
no parallel counter on SessionEntry. suspended still wins over
resume_pending in get_or_create_session() so genuinely stuck sessions
converge to a clean slate.

Spec: PR NousResearch#11852 (BrennerSpear). Implementation follows the spec with
the approved correction (reuse .restart_failure_counts rather than
adding a resume_attempts field).

Changes:
- gateway/session.py: SessionEntry.resume_pending/resume_reason/
  last_resume_marked_at + to_dict/from_dict; SessionStore
  .mark_resume_pending()/clear_resume_pending(); get_or_create_session()
  returns existing entry when resume_pending (suspended still wins);
  suspend_recently_active() skips resume_pending entries.
- gateway/run.py: _stop_impl() drain-timeout branch marks active
  sessions resume_pending before _interrupt_running_agents();
  _run_agent() injects reason-aware restart-resume system note that
  subsumes the tool-tail case; successful-turn cleanup also clears
  resume_pending next to _clear_restart_failure_count();
  _notify_active_sessions_of_shutdown() softens the restart banner to
  'I'll try to resume where you left off' (honest about stuck-loop
  escalation).
- tests/gateway/test_restart_resume_pending.py: 29 new tests covering
  SessionEntry roundtrip, mark/clear helpers, get_or_create_session
  precedence (suspended > resume_pending), suspend_recently_active
  skip, drain-timeout mark reason (restart vs shutdown), system-note
  injection decision tree (including tool-tail subsumption), banner
  wording, and stuck-loop escalation override.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants