Skip to content

fix(gateway): force-kill unresponsive gateway during systemd restart (#12438)#17675

Closed
cxgreat2014 wants to merge 1 commit into
NousResearch:mainfrom
cxgreat2014:fix/gateway-stale-pid-12438
Closed

fix(gateway): force-kill unresponsive gateway during systemd restart (#12438)#17675
cxgreat2014 wants to merge 1 commit into
NousResearch:mainfrom
cxgreat2014:fix/gateway-stale-pid-12438

Conversation

@cxgreat2014

Copy link
Copy Markdown

What does this PR do?

When the gateway is crashed or unresponsive (hung event loop, crash-loop, SIGKILL-resilient PID), running hermes gateway restart silently fails:

  1. systemd_restart() sends SIGUSR1 to the stuck process — the handler never runs because the event loop is dead
  2. The 90s drain timeout expires with just a warning message
  3. Then systemctl start is called — but systemd still sees the hung process as alive, so it's a no-op
  4. The user gets no recovery path; CLI restart remains stuck forever

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Related Issue

Fixes #12438

Changes Made

hermes_cli/gateway.pysystemd_restart()

  1. Force-kill on drain timeout: After the 90s drain loop times out (process still alive), call terminate_pid(pid, force=True) to SIGKILL the stuck gateway.
  2. Use systemctl restart: Changed from systemctl start to systemctl restart so systemd explicitly relaunches the service regardless of prior active state.

tests/hermes_cli/test_gateway_service.py

  • New test: test_systemd_restart_force_kills_unresponsive_gateway — verifies that when the drain loop times out, terminate_pid(pid, force=True) is called and systemctl restart is used.
  • Updated existing tests: Mock assertions updated from "start" to "restart".

How to Test

uv run python -m pytest tests/hermes_cli/test_gateway_service.py -xvs -k "test_systemd_restart_force_kills"

Expected: 1 passed, confirming force-kill and systemctl restart are invoked.

Checklist

  • I have read the contributing guidelines
  • I have followed the coding style of the project
  • I have added or updated tests where applicable
  • I have updated relevant documentation
  • I have checked my code and corrected any misspellings
  • My changes generate no new warnings or errors
  • All new and existing tests pass
  • This PR has linting, formatting, and type-checking CI passes (where applicable)
  • I have performed a self-review of my own code

…ousResearch#12438)

When the gateway is in a crashed/unresponsive state (hung event loop,
crash-loop), 'hermes gateway restart' sends SIGUSR1 but the signal
handler never executes. The 90s drain timeout expires with a warning,
but then 'systemctl start' is a no-op because systemd still sees the
hung process as alive — leaving the gateway permanently stuck.

Fix:
1. After the 90s drain timeout, force-kill (SIGKILL) the stuck
   gateway process via terminate_pid(pid, force=True)
2. Use 'systemctl restart' instead of 'systemctl start' so systemd
   explicitly relaunches the service regardless of prior state

Adds a regression test: test_systemd_restart_force_kills_unresponsive_gateway
verifies terminate_pid(force=True) is called and systemctl restart
is used. Existing tests updated to match 'systemctl restart'.

Fixes NousResearch#12438
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery labels Apr 30, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #17588 — same force-kill fix for unresponsive gateway during systemd restart. Fixes #12438.

@cxgreat2014

Copy link
Copy Markdown
Author

Not a duplicate — #17588 was the original version of this same PR that had 4 extra files carried in from other branches (gateway/run.py, hermes_cli/model_switch.py, and their tests). I noticed the clutter, closed it, rebased to clean only my 2-files change, and reopened as #17675. #17588 has already been closed. See #17588's closed state vs this one for the cleaned diff.

@teknium1

Copy link
Copy Markdown
Contributor

Automated hermes-sweeper review: this appears to be implemented on current main via the later merged gateway restart work.

Evidence:

  • hermes_cli/gateway.py:2850 now resolves the live gateway PID or systemd MainPID, then attempts the graceful SIGUSR1 drain path.
  • hermes_cli/gateway.py:2880 reports the timeout fallback as a forced service restart.
  • hermes_cli/gateway.py:2890 now calls systemctl restart for the fallback path, so the old systemctl start no-op failure mode described here is gone.
  • tests/hermes_cli/test_gateway_service.py:914 covers the systemd restart handoff and asserts the restart path is used.
  • The merged follow-up PR fix(gateway): wait for systemd restart readiness + harden Discord slash-command sync #20949 cherry-picked the related restart-readiness work onto main; the mainline commit is d797755a1c17566b0aef4d77548a4b460142d26a, shipped in v2026.5.7.

Thanks for the cleanup on #17675 versus #17588; the current mainline fix took a slightly different route than the direct terminate_pid(pid, force=True) call, but it resolves the stuck systemctl start restart path this PR targeted.

@teknium1 teknium1 closed this Jun 10, 2026
@teknium1 teknium1 added the sweeper:implemented-on-main Sweeper: behavior already present on current main label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists sweeper:implemented-on-main Sweeper: behavior already present on current main type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hermes gateway restart fails when gateway is in crashed/unresponsive state

3 participants