Skip to content

fix(gateway): force-kill unresponsive gateway during systemd restart (#12438)#17588

Closed
cxgreat2014 wants to merge 4 commits into
NousResearch:mainfrom
cxgreat2014:fix/stale-pid-stuck-gateway-force-kill
Closed

fix(gateway): force-kill unresponsive gateway during systemd restart (#12438)#17588
cxgreat2014 wants to merge 4 commits into
NousResearch:mainfrom
cxgreat2014:fix/stale-pid-stuck-gateway-force-kill

Conversation

@cxgreat2014

Copy link
Copy Markdown

What does this PR do?

When the gateway is crashed or unresponsive (hung event loop, crash-loop, SIGKILL-resilient PID), running hermes gateway restart silently fails:

  1. systemd_restart() sends SIGUSR1 to the stuck process — the handler never runs because the event loop is dead
  2. The 90s drain timeout expires with just a warning message
  3. Then systemctl start is called — but systemd still sees the hung process as alive, so it's a no-op
  4. The user gets no recovery path; CLI restart remains stuck forever

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Related Issue

Fixes #12438

Changes Made

hermes_cli/gateway.pysystemd_restart()

  1. Force-kill on drain timeout: After the 90s drain loop times out (process still alive), call terminate_pid(pid, force=True) to SIGKILL the stuck gateway.
  2. Use systemctl restart: Changed from systemctl start to systemctl restart so systemd explicitly relaunches the service regardless of prior active state.

tests/hermes_cli/test_gateway_service.py

  • New test: test_systemd_restart_force_kills_unresponsive_gateway — verifies that when the drain loop times out, terminate_pid(pid, force=True) is called and systemctl restart is used.
  • Updated existing tests: Mock assertions updated from "start" to "restart".

How to Test

uv run python -m pytest tests/hermes_cli/test_gateway_service.py -xvs -k "test_systemd_restart_force_kills"

Expected: 1 passed, confirming force-kill and systemctl restart are invoked.

Checklist

  • I have read the contributing guidelines
  • I have followed the coding style of the project
  • I have added or updated tests where applicable
  • I have updated relevant documentation
  • I have checked my code and corrected any misspellings
  • My changes generate no new warnings or errors
  • All new and existing tests pass
  • This PR has linting, formatting, and type-checking CI passes (where applicable)
  • I have performed a self-review of my own code

… Python process

The detached /restart mechanism spawned a bash shell that polled the
old gateway's PID with `kill -0` and then ran `hermes gateway restart`.
This had two race conditions in container environments:

1. Zombie PID: `kill -0` on a zombie (Z) returns 0, so the bash
   wrapper could loop indefinitely until the zombie was reaped by init.

2. Cmdline matching: the bash command `hermes gateway restart`
   contained the string "hermes gateway", which matched
   `find_gateway_pids()`'s `_scan_gateway_pids` patterns.  This
   caused the bash wrapper itself to be sent SIGTERM during the
   restart flow, which propagated to the child gateway process.

Replace the bash wrapper with a minimal Python process that uses
fcntl.flock(LOCK_EX) on the existing gateway.lock file.  The kernel
releases flock locks atomically when the owning process dies —
regardless of zombie state.  After the lock is released, the watcher
tries LOCK_EX|LOCK_NB to check whether another gateway already claimed
it (meaning someone else restarted), skipping if so.

Changes:
- gateway/run.py: _launch_detached_restart_command() now spawns
  python3 -c '<watcher>' instead of bash -lc '<shell command>'
When a prior failed model switch wrote `provider: custom` to
config.yaml, `list_authenticated_providers()` would use the literal
string `'custom'` as the provider slug instead of the canonical
`custom:<name>` format. This caused every subsequent session to
fail with `Unknown provider 'custom'`.

Fix: add `current_provider != custom` guard to the base-URL
matching branch so the stale literal value doesn't propagate.

Tests:
- 4 new tests covering the bug scenario
- `custom_provider_slug()` format validation
…ousResearch#12438)

When the gateway is in a crashed/unresponsive state (hung event loop,
crash-loop), 'hermes gateway restart' sends SIGUSR1 but the signal
handler never executes. The 90s drain timeout expires with a warning,
but then 'systemctl start' is a no-op because systemd still sees the
hung process as alive — leaving the gateway permanently stuck.

Fix:
1. After the 90s drain timeout, force-kill (SIGKILL) the stuck
   gateway process via terminate_pid(pid, force=True)
2. Use 'systemctl restart' instead of 'systemctl start' so systemd
   explicitly relaunches the service regardless of prior state

Adds a regression test: test_systemd_restart_force_kills_unresponsive_gateway
verifies terminate_pid(force=True) is called and systemctl restart
is used. Existing tests updated to match 'systemctl restart'.

Fixes NousResearch#12438
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery comp/cli CLI entry point, hermes_cli/, setup wizard labels Apr 29, 2026
@cxgreat2014 cxgreat2014 deleted the fix/stale-pid-stuck-gateway-force-kill branch April 30, 2026 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hermes gateway restart fails when gateway is in crashed/unresponsive state

2 participants