fix: gateway auto-recovers from unexpected SIGTERM via systemd (#5646)#9875
Merged
Conversation
Root cause: when the gateway received SIGTERM (from hermes update, external kill, WSL2 runtime, etc.), it exited with status 0. systemd's Restart=on-failure only restarts on non-zero exit, so the gateway stayed dead permanently. Users had to manually restart. Fix 1: Signal-initiated shutdown exits non-zero When SIGTERM/SIGINT is received and no restart was requested (via /restart, /update, or SIGUSR1), start_gateway() returns False which causes sys.exit(1). systemd sees a failure exit and auto-restarts after RestartSec=30. This is safe because systemctl stop tracks its own stop-requested state independently of exit code — Restart= never fires for a deliberate stop, regardless of exit code. Also logs 'Received SIGTERM/SIGINT — initiating shutdown' so the cause of unexpected shutdowns is visible in agent.log. Fix 2: PID file ownership guard remove_pid_file() now checks that the PID file belongs to the current process before removing it. During --replace handoffs, the old process's atexit handler could fire AFTER the new process wrote its PID file, deleting the new record. This left the gateway running but invisible to get_running_pid(), causing 'Another gateway already running' errors on next restart. Test plan: - All restart drain tests pass (13) - All gateway service tests pass (84) - All update gateway restart tests pass (34)
teknium1
added a commit
that referenced
this pull request
Apr 14, 2026
Add dangerous command patterns that require approval when the agent tries to run gateway lifecycle commands via the terminal tool: - hermes gateway stop/restart — kills all running agents mid-work - hermes update — pulls code and restarts the gateway - systemctl restart/stop (with optional flags like --user) These patterns fire the approval prompt so the user must explicitly approve before the agent can kill its own gateway process. In YOLO mode, the commands run without approval (by design — YOLO means the user accepts all risks). Also fixes the existing systemctl pattern to handle flags between the command and action (e.g. 'systemctl --user restart' was previously undetected because the regex expected the action immediately after 'systemctl'). Root cause: issue #6666 reported agents running 'hermes gateway restart' via terminal, killing the gateway process mid-agent-loop. The user sees the agent suddenly stop responding with no explanation. Combined with the SIGTERM auto-recovery from PR #9875, the gateway now both prevents accidental self-destruction AND recovers if it happens anyway. Test plan: - Updated test_systemctl_restart_not_flagged → test_systemctl_restart_flagged - All 119 approval tests pass - E2E verified: hermes gateway restart, hermes update, systemctl --user restart all detected; hermes gateway status, systemctl status remain safe
teknium1
added a commit
that referenced
this pull request
Apr 14, 2026
Add dangerous command patterns that require approval when the agent tries to run gateway lifecycle commands via the terminal tool: - hermes gateway stop/restart — kills all running agents mid-work - hermes update — pulls code and restarts the gateway - systemctl restart/stop (with optional flags like --user) These patterns fire the approval prompt so the user must explicitly approve before the agent can kill its own gateway process. In YOLO mode, the commands run without approval (by design — YOLO means the user accepts all risks). Also fixes the existing systemctl pattern to handle flags between the command and action (e.g. 'systemctl --user restart' was previously undetected because the regex expected the action immediately after 'systemctl'). Root cause: issue #6666 reported agents running 'hermes gateway restart' via terminal, killing the gateway process mid-agent-loop. The user sees the agent suddenly stop responding with no explanation. Combined with the SIGTERM auto-recovery from PR #9875, the gateway now both prevents accidental self-destruction AND recovers if it happens anyway. Test plan: - Updated test_systemctl_restart_not_flagged → test_systemctl_restart_flagged - All 119 approval tests pass - E2E verified: hermes gateway restart, hermes update, systemctl --user restart all detected; hermes gateway status, systemctl status remain safe
teknium1
added a commit
that referenced
this pull request
Apr 15, 2026
The gateway's exception handler for agent errors had specific hints for HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota exhausted). Users hitting billing limits from custom proxy providers got a generic error with no guidance. Added: 'Your API balance or quota is exhausted. Check your provider dashboard.' The underlying billing classification (error_classifier.py) already correctly handles 402 as FailoverReason.billing with credential rotation and fallback. The original issue (#5220) where 402 killed the entire gateway was from an older version — on current main, 402 is excluded from the is_client_error abort path (line 9460) and goes through the proper retry/fallback/fail flow. Combined with PR #9875 (auto-recover from unexpected SIGTERM), even edge cases where the gateway dies are now survivable.
teknium1
added a commit
that referenced
this pull request
Apr 15, 2026
) * fix: hermes gateway restart waits for service to come back up (#8260) Previously, systemd_restart() sent SIGUSR1 to the gateway, printed 'restart requested', and returned immediately. The gateway still needed to drain active agents, exit with code 75, wait for systemd's RestartSec=30, and start the new process. The user saw 'success' but the gateway was actually down for 30-60 seconds. Now the SIGUSR1 path blocks with progress feedback: Phase 1 — wait for old process to die: ⏳ User service draining active work... Polls os.kill(pid, 0) until ProcessLookupError (up to 90s) Phase 2 — wait for new process to become active: ⏳ Waiting for hermes-gateway to restart... Polls systemctl is-active + verifies new PID (up to 60s) Success: ✓ User service restarted (PID 12345) Timeout: ⚠ User service did not become active within 60s. Check status: hermes gateway status Check logs: journalctl --user -u hermes-gateway --since '2 min ago' The reload-or-restart fallback path (line 1189) already blocks because systemctl reload-or-restart is synchronous. Test plan: - Updated test to verify wait-for-restart behavior - All 118 gateway CLI tests pass * fix: add 402 billing error hint to gateway error handler (#5220) The gateway's exception handler for agent errors had specific hints for HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota exhausted). Users hitting billing limits from custom proxy providers got a generic error with no guidance. Added: 'Your API balance or quota is exhausted. Check your provider dashboard.' The underlying billing classification (error_classifier.py) already correctly handles 402 as FailoverReason.billing with credential rotation and fallback. The original issue (#5220) where 402 killed the entire gateway was from an older version — on current main, 402 is excluded from the is_client_error abort path (line 9460) and goes through the proper retry/fallback/fail flow. Combined with PR #9875 (auto-recover from unexpected SIGTERM), even edge cases where the gateway dies are now survivable.
lauchiwa
pushed a commit
to lauchiwa/hermes-agent
that referenced
this pull request
Apr 17, 2026
…h#5220) (NousResearch#10057) * fix: hermes gateway restart waits for service to come back up (NousResearch#8260) Previously, systemd_restart() sent SIGUSR1 to the gateway, printed 'restart requested', and returned immediately. The gateway still needed to drain active agents, exit with code 75, wait for systemd's RestartSec=30, and start the new process. The user saw 'success' but the gateway was actually down for 30-60 seconds. Now the SIGUSR1 path blocks with progress feedback: Phase 1 — wait for old process to die: ⏳ User service draining active work... Polls os.kill(pid, 0) until ProcessLookupError (up to 90s) Phase 2 — wait for new process to become active: ⏳ Waiting for hermes-gateway to restart... Polls systemctl is-active + verifies new PID (up to 60s) Success: ✓ User service restarted (PID 12345) Timeout: ⚠ User service did not become active within 60s. Check status: hermes gateway status Check logs: journalctl --user -u hermes-gateway --since '2 min ago' The reload-or-restart fallback path (line 1189) already blocks because systemctl reload-or-restart is synchronous. Test plan: - Updated test to verify wait-for-restart behavior - All 118 gateway CLI tests pass * fix: add 402 billing error hint to gateway error handler (NousResearch#5220) The gateway's exception handler for agent errors had specific hints for HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota exhausted). Users hitting billing limits from custom proxy providers got a generic error with no guidance. Added: 'Your API balance or quota is exhausted. Check your provider dashboard.' The underlying billing classification (error_classifier.py) already correctly handles 402 as FailoverReason.billing with credential rotation and fallback. The original issue (NousResearch#5220) where 402 killed the entire gateway was from an older version — on current main, 402 is excluded from the is_client_error abort path (line 9460) and goes through the proper retry/fallback/fail flow. Combined with PR NousResearch#9875 (auto-recover from unexpected SIGTERM), even edge cases where the gateway dies are now survivable. (cherry picked from commit ca0ae56)
ulasbilgen
pushed a commit
to ulasbilgen/hermes-adhd-agent
that referenced
this pull request
May 1, 2026
…arch#6666) Add dangerous command patterns that require approval when the agent tries to run gateway lifecycle commands via the terminal tool: - hermes gateway stop/restart — kills all running agents mid-work - hermes update — pulls code and restarts the gateway - systemctl restart/stop (with optional flags like --user) These patterns fire the approval prompt so the user must explicitly approve before the agent can kill its own gateway process. In YOLO mode, the commands run without approval (by design — YOLO means the user accepts all risks). Also fixes the existing systemctl pattern to handle flags between the command and action (e.g. 'systemctl --user restart' was previously undetected because the regex expected the action immediately after 'systemctl'). Root cause: issue NousResearch#6666 reported agents running 'hermes gateway restart' via terminal, killing the gateway process mid-agent-loop. The user sees the agent suddenly stop responding with no explanation. Combined with the SIGTERM auto-recovery from PR NousResearch#9875, the gateway now both prevents accidental self-destruction AND recovers if it happens anyway. Test plan: - Updated test_systemctl_restart_not_flagged → test_systemctl_restart_flagged - All 119 approval tests pass - E2E verified: hermes gateway restart, hermes update, systemctl --user restart all detected; hermes gateway status, systemctl status remain safe
ulasbilgen
pushed a commit
to ulasbilgen/hermes-adhd-agent
that referenced
this pull request
May 1, 2026
…h#5220) (NousResearch#10057) * fix: hermes gateway restart waits for service to come back up (NousResearch#8260) Previously, systemd_restart() sent SIGUSR1 to the gateway, printed 'restart requested', and returned immediately. The gateway still needed to drain active agents, exit with code 75, wait for systemd's RestartSec=30, and start the new process. The user saw 'success' but the gateway was actually down for 30-60 seconds. Now the SIGUSR1 path blocks with progress feedback: Phase 1 — wait for old process to die: ⏳ User service draining active work... Polls os.kill(pid, 0) until ProcessLookupError (up to 90s) Phase 2 — wait for new process to become active: ⏳ Waiting for hermes-gateway to restart... Polls systemctl is-active + verifies new PID (up to 60s) Success: ✓ User service restarted (PID 12345) Timeout: ⚠ User service did not become active within 60s. Check status: hermes gateway status Check logs: journalctl --user -u hermes-gateway --since '2 min ago' The reload-or-restart fallback path (line 1189) already blocks because systemctl reload-or-restart is synchronous. Test plan: - Updated test to verify wait-for-restart behavior - All 118 gateway CLI tests pass * fix: add 402 billing error hint to gateway error handler (NousResearch#5220) The gateway's exception handler for agent errors had specific hints for HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota exhausted). Users hitting billing limits from custom proxy providers got a generic error with no guidance. Added: 'Your API balance or quota is exhausted. Check your provider dashboard.' The underlying billing classification (error_classifier.py) already correctly handles 402 as FailoverReason.billing with credential rotation and fallback. The original issue (NousResearch#5220) where 402 killed the entire gateway was from an older version — on current main, 402 is excluded from the is_client_error abort path (line 9460) and goes through the proper retry/fallback/fail flow. Combined with PR NousResearch#9875 (auto-recover from unexpected SIGTERM), even edge cases where the gateway dies are now survivable.
aj-nt
pushed a commit
to aj-nt/hermes-agent
that referenced
this pull request
May 1, 2026
…arch#6666) Add dangerous command patterns that require approval when the agent tries to run gateway lifecycle commands via the terminal tool: - hermes gateway stop/restart — kills all running agents mid-work - hermes update — pulls code and restarts the gateway - systemctl restart/stop (with optional flags like --user) These patterns fire the approval prompt so the user must explicitly approve before the agent can kill its own gateway process. In YOLO mode, the commands run without approval (by design — YOLO means the user accepts all risks). Also fixes the existing systemctl pattern to handle flags between the command and action (e.g. 'systemctl --user restart' was previously undetected because the regex expected the action immediately after 'systemctl'). Root cause: issue NousResearch#6666 reported agents running 'hermes gateway restart' via terminal, killing the gateway process mid-agent-loop. The user sees the agent suddenly stop responding with no explanation. Combined with the SIGTERM auto-recovery from PR NousResearch#9875, the gateway now both prevents accidental self-destruction AND recovers if it happens anyway. Test plan: - Updated test_systemctl_restart_not_flagged → test_systemctl_restart_flagged - All 119 approval tests pass - E2E verified: hermes gateway restart, hermes update, systemctl --user restart all detected; hermes gateway status, systemctl status remain safe
aj-nt
pushed a commit
to aj-nt/hermes-agent
that referenced
this pull request
May 1, 2026
…h#5220) (NousResearch#10057) * fix: hermes gateway restart waits for service to come back up (NousResearch#8260) Previously, systemd_restart() sent SIGUSR1 to the gateway, printed 'restart requested', and returned immediately. The gateway still needed to drain active agents, exit with code 75, wait for systemd's RestartSec=30, and start the new process. The user saw 'success' but the gateway was actually down for 30-60 seconds. Now the SIGUSR1 path blocks with progress feedback: Phase 1 — wait for old process to die: ⏳ User service draining active work... Polls os.kill(pid, 0) until ProcessLookupError (up to 90s) Phase 2 — wait for new process to become active: ⏳ Waiting for hermes-gateway to restart... Polls systemctl is-active + verifies new PID (up to 60s) Success: ✓ User service restarted (PID 12345) Timeout: ⚠ User service did not become active within 60s. Check status: hermes gateway status Check logs: journalctl --user -u hermes-gateway --since '2 min ago' The reload-or-restart fallback path (line 1189) already blocks because systemctl reload-or-restart is synchronous. Test plan: - Updated test to verify wait-for-restart behavior - All 118 gateway CLI tests pass * fix: add 402 billing error hint to gateway error handler (NousResearch#5220) The gateway's exception handler for agent errors had specific hints for HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota exhausted). Users hitting billing limits from custom proxy providers got a generic error with no guidance. Added: 'Your API balance or quota is exhausted. Check your provider dashboard.' The underlying billing classification (error_classifier.py) already correctly handles 402 as FailoverReason.billing with credential rotation and fallback. The original issue (NousResearch#5220) where 402 killed the entire gateway was from an older version — on current main, 402 is excluded from the is_client_error abort path (line 9460) and goes through the proper retry/fallback/fail flow. Combined with PR NousResearch#9875 (auto-recover from unexpected SIGTERM), even edge cases where the gateway dies are now survivable.
02356abc
pushed a commit
to 02356abc/hermes-agent
that referenced
this pull request
May 14, 2026
…arch#6666) Add dangerous command patterns that require approval when the agent tries to run gateway lifecycle commands via the terminal tool: - hermes gateway stop/restart — kills all running agents mid-work - hermes update — pulls code and restarts the gateway - systemctl restart/stop (with optional flags like --user) These patterns fire the approval prompt so the user must explicitly approve before the agent can kill its own gateway process. In YOLO mode, the commands run without approval (by design — YOLO means the user accepts all risks). Also fixes the existing systemctl pattern to handle flags between the command and action (e.g. 'systemctl --user restart' was previously undetected because the regex expected the action immediately after 'systemctl'). Root cause: issue NousResearch#6666 reported agents running 'hermes gateway restart' via terminal, killing the gateway process mid-agent-loop. The user sees the agent suddenly stop responding with no explanation. Combined with the SIGTERM auto-recovery from PR NousResearch#9875, the gateway now both prevents accidental self-destruction AND recovers if it happens anyway. Test plan: - Updated test_systemctl_restart_not_flagged → test_systemctl_restart_flagged - All 119 approval tests pass - E2E verified: hermes gateway restart, hermes update, systemctl --user restart all detected; hermes gateway status, systemctl status remain safe
02356abc
pushed a commit
to 02356abc/hermes-agent
that referenced
this pull request
May 14, 2026
…h#5220) (NousResearch#10057) * fix: hermes gateway restart waits for service to come back up (NousResearch#8260) Previously, systemd_restart() sent SIGUSR1 to the gateway, printed 'restart requested', and returned immediately. The gateway still needed to drain active agents, exit with code 75, wait for systemd's RestartSec=30, and start the new process. The user saw 'success' but the gateway was actually down for 30-60 seconds. Now the SIGUSR1 path blocks with progress feedback: Phase 1 — wait for old process to die: ⏳ User service draining active work... Polls os.kill(pid, 0) until ProcessLookupError (up to 90s) Phase 2 — wait for new process to become active: ⏳ Waiting for hermes-gateway to restart... Polls systemctl is-active + verifies new PID (up to 60s) Success: ✓ User service restarted (PID 12345) Timeout: ⚠ User service did not become active within 60s. Check status: hermes gateway status Check logs: journalctl --user -u hermes-gateway --since '2 min ago' The reload-or-restart fallback path (line 1189) already blocks because systemctl reload-or-restart is synchronous. Test plan: - Updated test to verify wait-for-restart behavior - All 118 gateway CLI tests pass * fix: add 402 billing error hint to gateway error handler (NousResearch#5220) The gateway's exception handler for agent errors had specific hints for HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota exhausted). Users hitting billing limits from custom proxy providers got a generic error with no guidance. Added: 'Your API balance or quota is exhausted. Check your provider dashboard.' The underlying billing classification (error_classifier.py) already correctly handles 402 as FailoverReason.billing with credential rotation and fallback. The original issue (NousResearch#5220) where 402 killed the entire gateway was from an older version — on current main, 402 is excluded from the is_client_error abort path (line 9460) and goes through the proper retry/fallback/fail flow. Combined with PR NousResearch#9875 (auto-recover from unexpected SIGTERM), even edge cases where the gateway dies are now survivable.
gweeteve
pushed a commit
to gweeteve/hermes-agent
that referenced
this pull request
Jun 2, 2026
…arch#6666) Add dangerous command patterns that require approval when the agent tries to run gateway lifecycle commands via the terminal tool: - hermes gateway stop/restart — kills all running agents mid-work - hermes update — pulls code and restarts the gateway - systemctl restart/stop (with optional flags like --user) These patterns fire the approval prompt so the user must explicitly approve before the agent can kill its own gateway process. In YOLO mode, the commands run without approval (by design — YOLO means the user accepts all risks). Also fixes the existing systemctl pattern to handle flags between the command and action (e.g. 'systemctl --user restart' was previously undetected because the regex expected the action immediately after 'systemctl'). Root cause: issue NousResearch#6666 reported agents running 'hermes gateway restart' via terminal, killing the gateway process mid-agent-loop. The user sees the agent suddenly stop responding with no explanation. Combined with the SIGTERM auto-recovery from PR NousResearch#9875, the gateway now both prevents accidental self-destruction AND recovers if it happens anyway. Test plan: - Updated test_systemctl_restart_not_flagged → test_systemctl_restart_flagged - All 119 approval tests pass - E2E verified: hermes gateway restart, hermes update, systemctl --user restart all detected; hermes gateway status, systemctl status remain safe
gweeteve
pushed a commit
to gweeteve/hermes-agent
that referenced
this pull request
Jun 2, 2026
…h#5220) (NousResearch#10057) * fix: hermes gateway restart waits for service to come back up (NousResearch#8260) Previously, systemd_restart() sent SIGUSR1 to the gateway, printed 'restart requested', and returned immediately. The gateway still needed to drain active agents, exit with code 75, wait for systemd's RestartSec=30, and start the new process. The user saw 'success' but the gateway was actually down for 30-60 seconds. Now the SIGUSR1 path blocks with progress feedback: Phase 1 — wait for old process to die: ⏳ User service draining active work... Polls os.kill(pid, 0) until ProcessLookupError (up to 90s) Phase 2 — wait for new process to become active: ⏳ Waiting for hermes-gateway to restart... Polls systemctl is-active + verifies new PID (up to 60s) Success: ✓ User service restarted (PID 12345) Timeout: ⚠ User service did not become active within 60s. Check status: hermes gateway status Check logs: journalctl --user -u hermes-gateway --since '2 min ago' The reload-or-restart fallback path (line 1189) already blocks because systemctl reload-or-restart is synchronous. Test plan: - Updated test to verify wait-for-restart behavior - All 118 gateway CLI tests pass * fix: add 402 billing error hint to gateway error handler (NousResearch#5220) The gateway's exception handler for agent errors had specific hints for HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota exhausted). Users hitting billing limits from custom proxy providers got a generic error with no guidance. Added: 'Your API balance or quota is exhausted. Check your provider dashboard.' The underlying billing classification (error_classifier.py) already correctly handles 402 as FailoverReason.billing with credential rotation and fallback. The original issue (NousResearch#5220) where 402 killed the entire gateway was from an older version — on current main, 402 is excluded from the is_client_error abort path (line 9460) and goes through the proper retry/fallback/fail flow. Combined with PR NousResearch#9875 (auto-recover from unexpected SIGTERM), even edge cases where the gateway dies are now survivable.
Egavasyug
pushed a commit
to Egavasyug/hermes-agent
that referenced
this pull request
Jun 10, 2026
…arch#6666) Add dangerous command patterns that require approval when the agent tries to run gateway lifecycle commands via the terminal tool: - hermes gateway stop/restart — kills all running agents mid-work - hermes update — pulls code and restarts the gateway - systemctl restart/stop (with optional flags like --user) These patterns fire the approval prompt so the user must explicitly approve before the agent can kill its own gateway process. In YOLO mode, the commands run without approval (by design — YOLO means the user accepts all risks). Also fixes the existing systemctl pattern to handle flags between the command and action (e.g. 'systemctl --user restart' was previously undetected because the regex expected the action immediately after 'systemctl'). Root cause: issue NousResearch#6666 reported agents running 'hermes gateway restart' via terminal, killing the gateway process mid-agent-loop. The user sees the agent suddenly stop responding with no explanation. Combined with the SIGTERM auto-recovery from PR NousResearch#9875, the gateway now both prevents accidental self-destruction AND recovers if it happens anyway. Test plan: - Updated test_systemctl_restart_not_flagged → test_systemctl_restart_flagged - All 119 approval tests pass - E2E verified: hermes gateway restart, hermes update, systemctl --user restart all detected; hermes gateway status, systemctl status remain safe
Egavasyug
pushed a commit
to Egavasyug/hermes-agent
that referenced
this pull request
Jun 10, 2026
…h#5220) (NousResearch#10057) * fix: hermes gateway restart waits for service to come back up (NousResearch#8260) Previously, systemd_restart() sent SIGUSR1 to the gateway, printed 'restart requested', and returned immediately. The gateway still needed to drain active agents, exit with code 75, wait for systemd's RestartSec=30, and start the new process. The user saw 'success' but the gateway was actually down for 30-60 seconds. Now the SIGUSR1 path blocks with progress feedback: Phase 1 — wait for old process to die: ⏳ User service draining active work... Polls os.kill(pid, 0) until ProcessLookupError (up to 90s) Phase 2 — wait for new process to become active: ⏳ Waiting for hermes-gateway to restart... Polls systemctl is-active + verifies new PID (up to 60s) Success: ✓ User service restarted (PID 12345) Timeout: ⚠ User service did not become active within 60s. Check status: hermes gateway status Check logs: journalctl --user -u hermes-gateway --since '2 min ago' The reload-or-restart fallback path (line 1189) already blocks because systemctl reload-or-restart is synchronous. Test plan: - Updated test to verify wait-for-restart behavior - All 118 gateway CLI tests pass * fix: add 402 billing error hint to gateway error handler (NousResearch#5220) The gateway's exception handler for agent errors had specific hints for HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota exhausted). Users hitting billing limits from custom proxy providers got a generic error with no guidance. Added: 'Your API balance or quota is exhausted. Check your provider dashboard.' The underlying billing classification (error_classifier.py) already correctly handles 402 as FailoverReason.billing with credential rotation and fallback. The original issue (NousResearch#5220) where 402 killed the entire gateway was from an older version — on current main, 402 is excluded from the is_client_error abort path (line 9460) and goes through the proper retry/fallback/fail flow. Combined with PR NousResearch#9875 (auto-recover from unexpected SIGTERM), even edge cases where the gateway dies are now survivable.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #5646 — gateway exits cleanly (status=0) after receiving SIGTERM, and systemd's
Restart=on-failurenever revives it because 0 = success. The gateway stays dead permanently until manually restarted.This was the root cause of the recurring "gateway dies mid-agent-run" reports from WSL2 and systemd users. When
hermes updateor any external process sends SIGTERM to the gateway, it handled the signal gracefully and exited 0. systemd saw a clean exit and did nothing.Changes
1. Signal-initiated shutdown exits non-zero
When SIGTERM/SIGINT triggers shutdown and no restart was requested (via
/restart,/update, or SIGUSR1),start_gateway()returnsFalse→sys.exit(1). systemd'sRestart=on-failuresees the non-zero exit and auto-restarts afterRestartSec=30.This is safe because
systemctl stoptracks its own stop-requested state independently of exit code —Restart=never fires for a deliberate stop, regardless of what the process returns.Also adds a log line:
Received SIGTERM/SIGINT — initiating shutdownso the cause of unexpected shutdowns is visible in agent.log (previously it was just "Stopping gateway..." with no indication of why).2. PID file ownership guard
remove_pid_file()now verifies the PID file belongs to the current process before deleting it. During--replacehandoffs, the old process'satexithandler could fire AFTER the new process had already written its PID file — deleting the new record. This left the gateway running but invisible toget_running_pid(), causing:hermes gateway statusshowing "not running" when it wasHow this resolves the reported scenarios
systemctl stop hermes-gateway/restartfrom TelegramTest plan