Bug Description
The /restart slash command in the gateway hardcodes detached=True, via_service=False when calling request_restart(). When the gateway is managed by systemd (system-level or user-level service), this causes the gateway process to die permanently without automatic recovery.
Hermes Version: 0.8.0 (2026.4.8)
Environment: Linux, systemd system-level service (hermes-gateway.service), Tencent Cloud
Root Cause Analysis
In gateway/run.py, the /restart handler at line ~3947:
async def _handle_restart_command(self, event: MessageEvent) -> str:
"""Handle /restart command - drain active work, then restart the gateway."""
# ...
self.request_restart(detached=True, via_service=False) # ← hardcoded
This hardcoded path triggers the following chain:
request_restart(detached=True, via_service=False) → sets _restart_detached=True, _restart_via_service=False
stop(detached_restart=True, service_restart=False) is called
- In
_stop_impl() (line ~1887), _restart_detached=True triggers _launch_detached_restart_command() — which spawns a hermes gateway restart process in a new session via setsid/Popen(start_new_session=True)
- The detached child process waits for the parent PID to exit, then runs
hermes gateway restart
hermes gateway restart regenerates the systemd service file via hermes gateway install --system, then runs systemctl restart hermes-gateway
- But here's the problem: by the time the child runs
systemctl restart, systemd has already detected the original process exited (exit code 0, not the special exit code 75) and may hit StartLimitBurst or the service file regeneration strips custom environment overrides
- The key issue: since
via_service=False, the gateway exits with code 0 instead of 75 (GATEWAY_SERVICE_RESTART_EXIT_CODE). systemd sees a clean exit, not a restart request. With Restart=always, it may try to restart, but the detached child is simultaneously trying to regenerate the service file — causing a race condition
Why via_service=False is wrong here: The codebase already has a well-designed service-restart exit path. When via_service=True, the gateway sets _exit_code = GATEWAY_SERVICE_RESTART_EXIT_CODE (75), which tells systemd to restart the process cleanly via the existing service configuration — no service file regeneration, no PID races, no environment override loss.
Steps to Reproduce
- Install gateway as systemd service:
hermes gateway install --system
- Start:
systemctl start hermes-gateway
- Send
/restart to the bot via Telegram
- Observe: gateway process exits, may or may not restart depending on timing
- If
StartLimitBurst is exceeded (default 5 in 600s), gateway stays dead
Expected Behavior
The /restart command should detect whether the gateway is running under systemd and use the service-managed restart path:
- If systemd: exit with code 75 → systemd handles the restart cleanly
- If not systemd: fall back to the detached spawn approach
Actual Behavior
Always uses the detached spawn path (detached=True, via_service=False), which regenerates service files, creates PID races, and can exhaust systemd's start limit.
Relevant Code
gateway/run.py line ~3947:
self.request_restart(detached=True, via_service=False)
gateway/run.py line ~1385 (request_restart):
def request_restart(self, *, detached: bool = False, via_service: bool = False) -> bool:
gateway/run.py line ~1939 (exit path selection):
if self._restart_requested and self._restart_via_service:
self._exit_code = GATEWAY_SERVICE_RESTART_EXIT_CODE
gateway/restart.py:
GATEWAY_SERVICE_RESTART_EXIT_CODE = 75 # EX_TEMPFAIL
Proposed Fix
The /restart handler should auto-detect the systemd environment and pass via_service=True when appropriate:
async def _handle_restart_command(self, event: MessageEvent) -> str:
"""Handle /restart command - drain active work, then restart the gateway."""
if self._restart_requested or self._draining:
count = self._running_agent_count()
if count:
return f"⏳ Draining {count} active agent(s) before restart..."
return "⏳ Gateway restart already in progress..."
active_agents = self._running_agent_count()
# Detect if running under systemd and use service-managed restart path
via_service = _is_systemd_managed()
self.request_restart(detached=not via_service, via_service=via_service)
if active_agents:
return f"⏳ Draining {active_agents} active agent(s) before restart..."
return "♻ Restarting gateway..."
A helper to detect systemd could check:
os.environ.get("INVOCATION_ID") (set by systemd)
- Or
os.path.exists("/run/systemd/system")
- Or check if the parent PID 1 is systemd
This ensures the gateway exits with code 75 when under systemd, letting the service manager handle restart cleanly without file regeneration or PID races.
Related Issues
Environment
- OS: Linux (Tencent Cloud, systemd system-level service)
- Python: 3.x
- Hermes: 0.8.0 (2026.4.8)
- Service type:
Type=simple, Restart=always, RestartSec=2, KillMode=mixed
- Messaging Platform: Telegram (but affects all platforms)
Bug Description
The
/restartslash command in the gateway hardcodesdetached=True, via_service=Falsewhen callingrequest_restart(). When the gateway is managed by systemd (system-level or user-level service), this causes the gateway process to die permanently without automatic recovery.Hermes Version: 0.8.0 (2026.4.8)
Environment: Linux, systemd system-level service (
hermes-gateway.service), Tencent CloudRoot Cause Analysis
In
gateway/run.py, the/restarthandler at line ~3947:This hardcoded path triggers the following chain:
request_restart(detached=True, via_service=False)→ sets_restart_detached=True,_restart_via_service=Falsestop(detached_restart=True, service_restart=False)is called_stop_impl()(line ~1887),_restart_detached=Truetriggers_launch_detached_restart_command()— which spawns ahermes gateway restartprocess in a new session viasetsid/Popen(start_new_session=True)hermes gateway restarthermes gateway restartregenerates the systemd service file viahermes gateway install --system, then runssystemctl restart hermes-gatewaysystemctl restart, systemd has already detected the original process exited (exit code 0, not the special exit code 75) and may hitStartLimitBurstor the service file regeneration strips custom environment overridesvia_service=False, the gateway exits with code 0 instead of 75 (GATEWAY_SERVICE_RESTART_EXIT_CODE). systemd sees a clean exit, not a restart request. WithRestart=always, it may try to restart, but the detached child is simultaneously trying to regenerate the service file — causing a race conditionWhy
via_service=Falseis wrong here: The codebase already has a well-designed service-restart exit path. Whenvia_service=True, the gateway sets_exit_code = GATEWAY_SERVICE_RESTART_EXIT_CODE(75), which tells systemd to restart the process cleanly via the existing service configuration — no service file regeneration, no PID races, no environment override loss.Steps to Reproduce
hermes gateway install --systemsystemctl start hermes-gateway/restartto the bot via TelegramStartLimitBurstis exceeded (default 5 in 600s), gateway stays deadExpected Behavior
The
/restartcommand should detect whether the gateway is running under systemd and use the service-managed restart path:Actual Behavior
Always uses the detached spawn path (
detached=True, via_service=False), which regenerates service files, creates PID races, and can exhaust systemd's start limit.Relevant Code
gateway/run.pyline ~3947:gateway/run.pyline ~1385 (request_restart):gateway/run.pyline ~1939 (exit path selection):gateway/restart.py:Proposed Fix
The
/restarthandler should auto-detect the systemd environment and passvia_service=Truewhen appropriate:A helper to detect systemd could check:
os.environ.get("INVOCATION_ID")(set by systemd)os.path.exists("/run/systemd/system")This ensures the gateway exits with code 75 when under systemd, letting the service manager handle restart cleanly without file regeneration or PID races.
Related Issues
/restartcommand; implementation exists but has this bugEnvironment
Type=simple,Restart=always,RestartSec=2,KillMode=mixed