Issue: hermes gateway restart fails when gateway is in a crashed/unresponsive state
Repo: https://github.com/NousResearch/hermes-agent
Problem
hermes gateway restart relies on sending SIGUSR1 to the gateway process, which triggers an asyncio signal handler that calls runner.request_restart(detached=False, via_service=True). This initiates a graceful drain — the gateway waits for active agents to finish (up to 90 seconds) before exiting.
This design fails completely when the gateway is in a crashed or unresponsive state:
- The asyncio event loop is not running normally (crashed, frozen, or in a crash-loop)
- The
SIGUSR1 handler never executes because asyncio.get_running_loop() cannot create tasks in a dead loop
- The graceful restart never happens, and
launchd/systemd never gets the signal to restart the service
- User is left with a dead gateway and no way to recover from the CLI
Evidence
The gateway process was observed crash-looping multiple times. During this time, hermes gateway restart would have been ineffective because SIGUSR1 would not have been processed.
The only recovery mechanism that worked was kill -HUP (or kill -TERM), which forces the process to exit regardless of its internal state.
Expected Behavior
When a user runs hermes gateway restart, they expect the gateway to restart — not to gracefully drain and then fail silently.
Proposed Solutions
Option 1: Add hermes gateway force-restart (recommended)
Bypass drain entirely: send SIGTERM directly, then launchctl kickstart / systemctl restart.
Option 2: Make restart auto-fallback to force-kill
After SIGUSR1, if gateway doesn't exit within 30s, fall back to SIGTERM + service restart.
Option 3: Self-healing watchdog
Built-in watchdog that detects unresponsive event loop and self-restarts.
Code References
- Signal handler:
gateway/run.py:10322-10324
- macOS restart:
hermes_cli/gateway.py:1904
- systemd restart:
hermes_cli/gateway.py:1502
Severity
Medium — complete service outage with no CLI recovery path.
Issue:
hermes gateway restartfails when gateway is in a crashed/unresponsive stateRepo: https://github.com/NousResearch/hermes-agent
Problem
hermes gateway restartrelies on sendingSIGUSR1to the gateway process, which triggers an asyncio signal handler that callsrunner.request_restart(detached=False, via_service=True). This initiates a graceful drain — the gateway waits for active agents to finish (up to 90 seconds) before exiting.This design fails completely when the gateway is in a crashed or unresponsive state:
SIGUSR1handler never executes becauseasyncio.get_running_loop()cannot create tasks in a dead looplaunchd/systemdnever gets the signal to restart the serviceEvidence
The gateway process was observed crash-looping multiple times. During this time,
hermes gateway restartwould have been ineffective because SIGUSR1 would not have been processed.The only recovery mechanism that worked was
kill -HUP(orkill -TERM), which forces the process to exit regardless of its internal state.Expected Behavior
When a user runs
hermes gateway restart, they expect the gateway to restart — not to gracefully drain and then fail silently.Proposed Solutions
Option 1: Add
hermes gateway force-restart(recommended)Bypass drain entirely: send
SIGTERMdirectly, thenlaunchctl kickstart/systemctl restart.Option 2: Make
restartauto-fallback to force-killAfter
SIGUSR1, if gateway doesn't exit within 30s, fall back toSIGTERM+ service restart.Option 3: Self-healing watchdog
Built-in watchdog that detects unresponsive event loop and self-restarts.
Code References
gateway/run.py:10322-10324hermes_cli/gateway.py:1904hermes_cli/gateway.py:1502Severity
Medium — complete service outage with no CLI recovery path.