Skip to content

fix: break stuck session resume loops after repeated restarts (#7536)#9941

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-5f9c44ae
Apr 15, 2026
Merged

fix: break stuck session resume loops after repeated restarts (#7536)#9941
teknium1 merged 1 commit into
mainfrom
hermes/hermes-5f9c44ae

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Fixes #7536 — when a session gets stuck and the user keeps restarting the gateway, the same session history reloads and traps the agent in the same stuck state. The user's only escape was manually deleting the session DB.

The loop

  1. Agent enters stuck state (hung terminal, runaway tool loop)
  2. User restarts gateway
  3. Session history loads with the stuck-causing context
  4. User sends any message → agent gets stuck again
  5. Back to step 2

The fix

Track restart-failure counts per session in .restart_failure_counts (a simple JSON file). On each shutdown with active agents, increment the counter. On startup, if any session hits 3 consecutive restarts while active, auto-suspend it.

The counter resets when a session completes a turn successfully (response delivered), so planned restarts (/restart, hermes update) that happen to interrupt a session won't accumulate false counts — as long as the session works on the next attempt, the counter resets.

How it works

Event Action
Shutdown with active agents _increment_restart_failure_counts() — bumps counter for active sessions, drops inactive ones
Startup _suspend_stuck_loop_sessions() — suspends sessions at threshold (3), clears the file
Successful response delivered _clear_restart_failure_count() — removes session from counter file
Session not active during shutdown Counter entry removed (loop was broken)

Design decisions

  • No SessionEntry schema changes — pure file-based tracking
  • No database migration — the JSON file is ephemeral and self-cleaning
  • Threshold of 3 — tolerates 1-2 restarts during normal operation (updates, config changes) without false-suspending
  • Counter drops inactive sessions — if the session wasn't active during a restart, it wasn't stuck

Test plan

  • 9 new stuck-loop tests
  • All 28 gateway lifecycle tests pass

When a session gets stuck (hung terminal, runaway tool loop) and the
user restarts the gateway, the same session history loads and puts the
agent right back in the stuck state. The user is trapped in a loop:
restart → stuck → restart → stuck.

Fix: track restart-failure counts per session using a simple JSON file
(.restart_failure_counts). On each shutdown with active agents, the
counter increments for those sessions. On startup, if any session has
been active across 3+ consecutive restarts, it's auto-suspended —
giving the user a clean slate on their next message.

The counter resets to 0 when a session completes a turn successfully
(response delivered), so normal sessions that happen to be active
during planned restarts (/restart, hermes update) won't accumulate
false counts.

Implementation:
- _increment_restart_failure_counts(): called during stop() when
  agents are active. Writes {session_key: count} to JSON file.
  Sessions NOT active are dropped (loop broken).
- _suspend_stuck_loop_sessions(): called on startup. Reads the file,
  suspends sessions at threshold (3), clears the file.
- _clear_restart_failure_count(): called after successful response
  delivery. Removes the session from the counter file.

No SessionEntry schema changes. No database migration. Pure file-based
tracking that naturally cleans up.

Test plan:
- 9 new stuck-loop tests (increment, accumulate, threshold, clear,
  suspend, file cleanup, edge cases)
- All 28 gateway lifecycle tests pass (restart drain + auto-continue
  + stuck loop)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Gateway] Stuck session resumes on restart — creates unrecoverable loop

1 participant