fix(gateway): force-kill unresponsive gateway during systemd restart (#12438) by cxgreat2014 · Pull Request #17588 · NousResearch/hermes-agent

cxgreat2014 · 2026-04-29T18:11:51Z

What does this PR do?

When the gateway is crashed or unresponsive (hung event loop, crash-loop, SIGKILL-resilient PID), running hermes gateway restart silently fails:

systemd_restart() sends SIGUSR1 to the stuck process — the handler never runs because the event loop is dead
The 90s drain timeout expires with just a warning message
Then systemctl start is called — but systemd still sees the hung process as alive, so it's a no-op
The user gets no recovery path; CLI restart remains stuck forever

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Related Issue

Fixes #12438

Changes Made

`hermes_cli/gateway.py` — `systemd_restart()`

Force-kill on drain timeout: After the 90s drain loop times out (process still alive), call terminate_pid(pid, force=True) to SIGKILL the stuck gateway.
Use systemctl restart: Changed from systemctl start to systemctl restart so systemd explicitly relaunches the service regardless of prior active state.

`tests/hermes_cli/test_gateway_service.py`

New test: test_systemd_restart_force_kills_unresponsive_gateway — verifies that when the drain loop times out, terminate_pid(pid, force=True) is called and systemctl restart is used.
Updated existing tests: Mock assertions updated from "start" to "restart".

How to Test

uv run python -m pytest tests/hermes_cli/test_gateway_service.py -xvs -k "test_systemd_restart_force_kills"

Expected: 1 passed, confirming force-kill and systemctl restart are invoked.

Checklist

I have read the contributing guidelines
I have followed the coding style of the project
I have added or updated tests where applicable
I have updated relevant documentation
I have checked my code and corrected any misspellings
My changes generate no new warnings or errors
All new and existing tests pass
This PR has linting, formatting, and type-checking CI passes (where applicable)
I have performed a self-review of my own code

… Python process The detached /restart mechanism spawned a bash shell that polled the old gateway's PID with `kill -0` and then ran `hermes gateway restart`. This had two race conditions in container environments: 1. Zombie PID: `kill -0` on a zombie (Z) returns 0, so the bash wrapper could loop indefinitely until the zombie was reaped by init. 2. Cmdline matching: the bash command `hermes gateway restart` contained the string "hermes gateway", which matched `find_gateway_pids()`'s `_scan_gateway_pids` patterns. This caused the bash wrapper itself to be sent SIGTERM during the restart flow, which propagated to the child gateway process. Replace the bash wrapper with a minimal Python process that uses fcntl.flock(LOCK_EX) on the existing gateway.lock file. The kernel releases flock locks atomically when the owning process dies — regardless of zombie state. After the lock is released, the watcher tries LOCK_EX|LOCK_NB to check whether another gateway already claimed it (meaning someone else restarted), skipping if so. Changes: - gateway/run.py: _launch_detached_restart_command() now spawns python3 -c '<watcher>' instead of bash -lc '<shell command>'

When a prior failed model switch wrote `provider: custom` to config.yaml, `list_authenticated_providers()` would use the literal string `'custom'` as the provider slug instead of the canonical `custom:<name>` format. This caused every subsequent session to fail with `Unknown provider 'custom'`. Fix: add `current_provider != custom` guard to the base-URL matching branch so the stale literal value doesn't propagate. Tests: - 4 new tests covering the bug scenario - `custom_provider_slug()` format validation

…ousResearch#12438) When the gateway is in a crashed/unresponsive state (hung event loop, crash-loop), 'hermes gateway restart' sends SIGUSR1 but the signal handler never executes. The 90s drain timeout expires with a warning, but then 'systemctl start' is a no-op because systemd still sees the hung process as alive — leaving the gateway permanently stuck. Fix: 1. After the 90s drain timeout, force-kill (SIGKILL) the stuck gateway process via terminate_pid(pid, force=True) 2. Use 'systemctl restart' instead of 'systemctl start' so systemd explicitly relaunches the service regardless of prior state Adds a regression test: test_systemd_restart_force_kills_unresponsive_gateway verifies terminate_pid(force=True) is called and systemctl restart is used. Existing tests updated to match 'systemctl restart'. Fixes NousResearch#12438

cxgreat2014 added 4 commits April 27, 2026 15:44

test: add tests for gateway restart fcntl file-lock watcher

aa5e5a9

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery comp/cli CLI entry point, hermes_cli/, setup wizard labels Apr 29, 2026

cxgreat2014 closed this Apr 30, 2026

cxgreat2014 deleted the fix/stale-pid-stuck-gateway-force-kill branch April 30, 2026 00:29

alt-glitch mentioned this pull request Apr 30, 2026

fix(gateway): force-kill unresponsive gateway during systemd restart (#12438) #17675

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): force-kill unresponsive gateway during systemd restart (#12438)#17588

fix(gateway): force-kill unresponsive gateway during systemd restart (#12438)#17588
cxgreat2014 wants to merge 4 commits into
NousResearch:mainfrom
cxgreat2014:fix/stale-pid-stuck-gateway-force-kill

cxgreat2014 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cxgreat2014 commented Apr 29, 2026

What does this PR do?

Type of Change

Related Issue

Changes Made

hermes_cli/gateway.py — systemd_restart()

tests/hermes_cli/test_gateway_service.py

How to Test

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`hermes_cli/gateway.py` — `systemd_restart()`

`tests/hermes_cli/test_gateway_service.py`