fix(gateway): force-kill unresponsive gateway during systemd restart (#12438)#17675
Closed
cxgreat2014 wants to merge 1 commit into
Closed
fix(gateway): force-kill unresponsive gateway during systemd restart (#12438)#17675cxgreat2014 wants to merge 1 commit into
cxgreat2014 wants to merge 1 commit into
Conversation
…ousResearch#12438) When the gateway is in a crashed/unresponsive state (hung event loop, crash-loop), 'hermes gateway restart' sends SIGUSR1 but the signal handler never executes. The 90s drain timeout expires with a warning, but then 'systemctl start' is a no-op because systemd still sees the hung process as alive — leaving the gateway permanently stuck. Fix: 1. After the 90s drain timeout, force-kill (SIGKILL) the stuck gateway process via terminate_pid(pid, force=True) 2. Use 'systemctl restart' instead of 'systemctl start' so systemd explicitly relaunches the service regardless of prior state Adds a regression test: test_systemd_restart_force_kills_unresponsive_gateway verifies terminate_pid(force=True) is called and systemctl restart is used. Existing tests updated to match 'systemctl restart'. Fixes NousResearch#12438
Collaborator
Author
|
Not a duplicate — #17588 was the original version of this same PR that had 4 extra files carried in from other branches (gateway/run.py, hermes_cli/model_switch.py, and their tests). I noticed the clutter, closed it, rebased to clean only my 2-files change, and reopened as #17675. #17588 has already been closed. See #17588's closed state vs this one for the cleaned diff. |
18 tasks
Contributor
|
Automated hermes-sweeper review: this appears to be implemented on current Evidence:
Thanks for the cleanup on #17675 versus #17588; the current mainline fix took a slightly different route than the direct |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
When the gateway is crashed or unresponsive (hung event loop, crash-loop, SIGKILL-resilient PID), running
hermes gateway restartsilently fails:systemd_restart()sends SIGUSR1 to the stuck process — the handler never runs because the event loop is deadsystemctl startis called — but systemd still sees the hung process as alive, so it's a no-opType of Change
Related Issue
Fixes #12438
Changes Made
hermes_cli/gateway.py—systemd_restart()terminate_pid(pid, force=True)to SIGKILL the stuck gateway.systemctl restart: Changed fromsystemctl starttosystemctl restartso systemd explicitly relaunches the service regardless of prior active state.tests/hermes_cli/test_gateway_service.pytest_systemd_restart_force_kills_unresponsive_gateway— verifies that when the drain loop times out,terminate_pid(pid, force=True)is called andsystemctl restartis used."start"to"restart".How to Test
uv run python -m pytest tests/hermes_cli/test_gateway_service.py -xvs -k "test_systemd_restart_force_kills"Expected: 1 passed, confirming force-kill and systemctl restart are invoked.
Checklist