fix: notify active sessions on gateway shutdown + update health check#9850
Merged
Conversation
Three fixes for gateway lifecycle stability: 1. Notify active sessions before shutdown (#new) When the gateway receives SIGTERM or /restart, it now sends a notification to every chat with an active agent BEFORE starting the drain. Users see: - Shutdown: 'Gateway shutting down — your task will be interrupted.' - Restart: 'Gateway restarting — use /retry after restart to continue.' Deduplicates per-chat so group sessions with multiple users get one notification. Best-effort: send failures are logged and swallowed. 2. Skip .clean_shutdown marker when drain timed out Previously, a graceful SIGTERM always wrote .clean_shutdown, even if agents were force-interrupted when the drain timed out. This meant the next startup skipped session suspension, leaving interrupted sessions in a broken state (trailing tool response, no final message). Now the marker is only written if the drain completed without timeout, so interrupted sessions get properly suspended on next startup. 3. Post-restart health check for hermes update (#6631) cmd_update() now verifies the gateway actually survived after systemctl restart (sleep 3s + is-active check). If the service crashed immediately, it retries once. If still dead, prints actionable diagnostics (journalctl command, manual restart hint). Also closes #8104 — already fixed on main (the /restart handler correctly detects systemd via INVOCATION_ID and uses via_service=True). Test plan: - 6 new tests for shutdown notifications (dedup, restart vs shutdown messaging, sentinel filtering, send failure resilience) - Existing restart drain + update tests pass (47 total)
This was referenced Apr 14, 2026
19 tasks
This was referenced Apr 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses three gateway lifecycle stability issues from user reports of agents being killed mid-work with no notification.
Changes
1. Notify active sessions before shutdown (new)
When the gateway receives SIGTERM or
/restart,_notify_active_sessions_of_shutdown()sends a message to every chat with an active agent BEFORE the drain starts (while adapters are still connected):⚠️ Gateway shutting down — Your current task will be interrupted.⚠️ Gateway restarting — Your current task will be interrupted. Use /retry after restart to continue.Deduplicates per-chat (multiple users in a group get one notification). Best-effort — send failures are logged and swallowed so they never block shutdown.
2. Skip .clean_shutdown marker when drain timed out
Previously, graceful SIGTERM always wrote
.clean_shutdown, even when agents were force-interrupted after the drain timeout. The next startup would skip session suspension, leaving interrupted sessions in a broken state (trailing tool response, no final assistant message → stuck session on resume). Now the marker is only written if the drain completed cleanly. Interrupted sessions get properly suspended on next startup.This also helps with #7536 (stuck session resume loops) — sessions interrupted during shutdown will now be auto-suspended instead of resuming into a broken state.
3. Post-restart health check for
hermes update(#6631)cmd_update()now verifies the gateway service actually survived aftersystemctl restart:systemctl is-activecheckPreviously,
systemctl restartreturning 0 was taken as success even if the service crashed immediately — leaving the gateway silently dead for days.Also closes #8104 — already fixed on main (
/restarthandler correctly detects systemd viaINVOCATION_IDand usesvia_service=True).Related issues
Test plan