Skip to content

fix: notify active sessions on gateway shutdown + update health check#9850

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-993ae0d6
Apr 14, 2026
Merged

fix: notify active sessions on gateway shutdown + update health check#9850
teknium1 merged 1 commit into
mainfrom
hermes/hermes-993ae0d6

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Addresses three gateway lifecycle stability issues from user reports of agents being killed mid-work with no notification.

Changes

1. Notify active sessions before shutdown (new)
When the gateway receives SIGTERM or /restart, _notify_active_sessions_of_shutdown() sends a message to every chat with an active agent BEFORE the drain starts (while adapters are still connected):

  • Shutdown: ⚠️ Gateway shutting down — Your current task will be interrupted.
  • Restart: ⚠️ Gateway restarting — Your current task will be interrupted. Use /retry after restart to continue.

Deduplicates per-chat (multiple users in a group get one notification). Best-effort — send failures are logged and swallowed so they never block shutdown.

2. Skip .clean_shutdown marker when drain timed out
Previously, graceful SIGTERM always wrote .clean_shutdown, even when agents were force-interrupted after the drain timeout. The next startup would skip session suspension, leaving interrupted sessions in a broken state (trailing tool response, no final assistant message → stuck session on resume). Now the marker is only written if the drain completed cleanly. Interrupted sessions get properly suspended on next startup.

This also helps with #7536 (stuck session resume loops) — sessions interrupted during shutdown will now be auto-suspended instead of resuming into a broken state.

3. Post-restart health check for hermes update (#6631)
cmd_update() now verifies the gateway service actually survived after systemctl restart:

  • Sleep 3s → systemctl is-active check
  • If dead: retry once (transient startup failures often resolve)
  • If still dead: print actionable diagnostics (journalctl command + manual restart hint)

Previously, systemctl restart returning 0 was taken as success even if the service crashed immediately — leaving the gateway silently dead for days.

Also closes #8104 — already fixed on main (/restart handler correctly detects systemd via INVOCATION_ID and uses via_service=True).

Related issues

Test plan

  • 6 new tests for shutdown notifications (active session notification, restart vs shutdown messaging, per-chat dedup, sentinel filtering, no-agents skip, send failure resilience)
  • All existing restart drain tests pass (7)
  • All existing update gateway restart tests pass (34)

Three fixes for gateway lifecycle stability:

1. Notify active sessions before shutdown (#new)
   When the gateway receives SIGTERM or /restart, it now sends a
   notification to every chat with an active agent BEFORE starting
   the drain. Users see:
   - Shutdown: 'Gateway shutting down — your task will be interrupted.'
   - Restart: 'Gateway restarting — use /retry after restart to continue.'
   Deduplicates per-chat so group sessions with multiple users get
   one notification. Best-effort: send failures are logged and swallowed.

2. Skip .clean_shutdown marker when drain timed out
   Previously, a graceful SIGTERM always wrote .clean_shutdown, even if
   agents were force-interrupted when the drain timed out. This meant
   the next startup skipped session suspension, leaving interrupted
   sessions in a broken state (trailing tool response, no final message).
   Now the marker is only written if the drain completed without timeout,
   so interrupted sessions get properly suspended on next startup.

3. Post-restart health check for hermes update (#6631)
   cmd_update() now verifies the gateway actually survived after
   systemctl restart (sleep 3s + is-active check). If the service
   crashed immediately, it retries once. If still dead, prints
   actionable diagnostics (journalctl command, manual restart hint).

Also closes #8104 — already fixed on main (the /restart handler
correctly detects systemd via INVOCATION_ID and uses via_service=True).

Test plan:
- 6 new tests for shutdown notifications (dedup, restart vs shutdown
  messaging, sentinel filtering, send failure resilience)
- Existing restart drain + update tests pass (47 total)
@teknium1 teknium1 merged commit fa8c448 into main Apr 14, 2026
6 of 7 checks passed
@teknium1 teknium1 deleted the hermes/hermes-993ae0d6 branch April 14, 2026 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant