fix(gateway): force-kill unresponsive gateway during systemd restart (#12438)#17588
Closed
cxgreat2014 wants to merge 4 commits into
Closed
fix(gateway): force-kill unresponsive gateway during systemd restart (#12438)#17588cxgreat2014 wants to merge 4 commits into
cxgreat2014 wants to merge 4 commits into
Conversation
… Python process The detached /restart mechanism spawned a bash shell that polled the old gateway's PID with `kill -0` and then ran `hermes gateway restart`. This had two race conditions in container environments: 1. Zombie PID: `kill -0` on a zombie (Z) returns 0, so the bash wrapper could loop indefinitely until the zombie was reaped by init. 2. Cmdline matching: the bash command `hermes gateway restart` contained the string "hermes gateway", which matched `find_gateway_pids()`'s `_scan_gateway_pids` patterns. This caused the bash wrapper itself to be sent SIGTERM during the restart flow, which propagated to the child gateway process. Replace the bash wrapper with a minimal Python process that uses fcntl.flock(LOCK_EX) on the existing gateway.lock file. The kernel releases flock locks atomically when the owning process dies — regardless of zombie state. After the lock is released, the watcher tries LOCK_EX|LOCK_NB to check whether another gateway already claimed it (meaning someone else restarted), skipping if so. Changes: - gateway/run.py: _launch_detached_restart_command() now spawns python3 -c '<watcher>' instead of bash -lc '<shell command>'
When a prior failed model switch wrote `provider: custom` to config.yaml, `list_authenticated_providers()` would use the literal string `'custom'` as the provider slug instead of the canonical `custom:<name>` format. This caused every subsequent session to fail with `Unknown provider 'custom'`. Fix: add `current_provider != custom` guard to the base-URL matching branch so the stale literal value doesn't propagate. Tests: - 4 new tests covering the bug scenario - `custom_provider_slug()` format validation
…ousResearch#12438) When the gateway is in a crashed/unresponsive state (hung event loop, crash-loop), 'hermes gateway restart' sends SIGUSR1 but the signal handler never executes. The 90s drain timeout expires with a warning, but then 'systemctl start' is a no-op because systemd still sees the hung process as alive — leaving the gateway permanently stuck. Fix: 1. After the 90s drain timeout, force-kill (SIGKILL) the stuck gateway process via terminate_pid(pid, force=True) 2. Use 'systemctl restart' instead of 'systemctl start' so systemd explicitly relaunches the service regardless of prior state Adds a regression test: test_systemd_restart_force_kills_unresponsive_gateway verifies terminate_pid(force=True) is called and systemctl restart is used. Existing tests updated to match 'systemctl restart'. Fixes NousResearch#12438
13 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
When the gateway is crashed or unresponsive (hung event loop, crash-loop, SIGKILL-resilient PID), running
hermes gateway restartsilently fails:systemd_restart()sends SIGUSR1 to the stuck process — the handler never runs because the event loop is deadsystemctl startis called — but systemd still sees the hung process as alive, so it's a no-opType of Change
Related Issue
Fixes #12438
Changes Made
hermes_cli/gateway.py—systemd_restart()terminate_pid(pid, force=True)to SIGKILL the stuck gateway.systemctl restart: Changed fromsystemctl starttosystemctl restartso systemd explicitly relaunches the service regardless of prior active state.tests/hermes_cli/test_gateway_service.pytest_systemd_restart_force_kills_unresponsive_gateway— verifies that when the drain loop times out,terminate_pid(pid, force=True)is called andsystemctl restartis used."start"to"restart".How to Test
uv run python -m pytest tests/hermes_cli/test_gateway_service.py -xvs -k "test_systemd_restart_force_kills"Expected: 1 passed, confirming force-kill and systemctl restart are invoked.
Checklist