Bug Description
sudo hermes gateway restart --system always reports failure ("System service did not become active within 60s" twice, after ~245s total) even though the gateway actually restarts successfully and ends up healthy. The wrapper's runtime-status detection silently reads from /root/.hermes/gateway_state.json instead of the configured user's HERMES_HOME, so it can never observe gateway_state == "running".
The end state is broken in practice: when the wrapper's first 60s wait expires it falls back to a forced systemctl restart, which SIGTERMs the in-progress new gateway (exit code 1 in journald) before spawning a fresh one. The user sees repeated banner-prints, status=1/FAILURE lines, and the CLI claiming the restart failed.
Steps to Reproduce
- Install the gateway as a system-level service running as a non-root user:
sudo hermes gateway install --system --run-as-user $USER
sudo hermes gateway start --system
- Verify the gateway is healthy:
hermes gateway status and check that ~/.hermes/gateway_state.json has "gateway_state":"running".
- Run
sudo hermes gateway restart --system.
Expected Behavior
The wrapper should detect that the new gateway instance has reached gateway_state == "running" (which happens within a few seconds after the SIGUSR1-triggered respawn) and print ✓ System service restarted (PID …). The whole call should complete in under 10 seconds — and indeed it does when HERMES_HOME is propagated:
$ sudo HERMES_HOME=/home/$USER/.hermes hermes gateway restart --system
↻ Updated gateway system service definition to match the current Hermes install
⏳ System service restarting gracefully (PID 1019162)...
⏳ System service process started (PID 1019800); waiting for gateway runtime...
✓ System service restarted (PID 1019800)
# 6 seconds total
Actual Behavior
$ sudo hermes gateway restart --system
⏳ System service restarting gracefully (PID 1018411)...
⏳ System service process started (PID 1018713); waiting for gateway runtime...
⚠ System service did not become active within 60s.
⚠ Graceful restart did not complete within 185s; forcing a service restart...
⏳ System service process started (PID 1019162); waiting for gateway runtime...
⚠ System service did not become active within 60s.
# ~245 seconds total
journald during the failed wait shows the wrapper's forced fallback SIGTERMing the in-progress new instance:
python[1018713]: WARNING gateway.run: Shutdown diagnostic — other hermes processes running:
root ... sudo /home/$USER/.local/bin/hermes gateway restart --system
root ... systemctl restart hermes-gateway
systemd[1]: hermes-gateway.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: Started hermes-gateway.service - Hermes Agent Gateway
After all that, the final instance is healthy: ActiveState=active, ExecMainStatus=0, NRestarts=0.
Affected Component
- Gateway (Telegram/Discord/Slack/WhatsApp)
- CLI (interactive chat) — bug is specifically in the
hermes gateway restart --system wrapper
Messaging Platform (if gateway-related)
N/A — bug is in the CLI wrapper for --system scope, independent of platform.
Debug Report
Report https://paste.rs/THGF9
agent.log https://paste.rs/ScXFe
gateway.log https://paste.rs/gdzv6
Operating System
Ubuntu 24.04.4 LTS (kernel 6.17.0-1010-azure)
Python Version
3.11.15
Hermes Version
0.13.0 (2026.5.7)
Additional Logs / Traceback (optional)
Direct probe confirming the path resolution problem under sudo:
$ sudo HOME=/root python3 -c "
import sys
sys.path.insert(0, '/home/$USER/.hermes/hermes-agent')
from gateway.status import _get_pid_path, get_running_pid, read_runtime_status
print('PID path:', _get_pid_path())
print('Running PID:', get_running_pid())
print('Runtime status:', read_runtime_status())
"
PID path: /root/.hermes/gateway.pid
Running PID: None
Runtime status: None
strace on the wrapper confirms it sends exactly one SIGUSR1 to the correct old PID (no signal storm); the apparent flap loop in journald is entirely caused by the wait-detection failure leading to a forced systemctl restart SIGTERM fallback.
Root Cause Analysis (optional)
In hermes_cli/gateway.py::_wait_for_systemd_service_restart (~line 605), the success check is:
runtime_state = _gateway_runtime_status_for_pid(new_pid)
gateway_state = (runtime_state or {}).get("gateway_state")
if gateway_state == "running":
print(f"✓ {scope_label} service restarted (PID {new_pid})")
return True
_gateway_runtime_status_for_pid (hermes_cli/gateway.py:564) calls _read_gateway_runtime_status() → read_runtime_status() (gateway/status.py:439) → _get_runtime_status_path() (gateway/status.py:58), which resolves relative to HERMES_HOME, defaulting to $HOME/.hermes.
When running under sudo (required by _require_root_for_system_service for the --system path), $HOME=/root, so the wrapper reads /root/.hermes/gateway_state.json, which doesn't exist for a system service running as a non-root user. The function returns None, the wait loop never sees gateway_state == "running", and it always hits the 60s timeout — even when the gateway is healthy on the very first poll.
The wrapper's preflight pid = get_running_pid() or _systemd_main_pid(system=system) (line 2268) accidentally papers over the same root-HOME problem because the systemd MainPID fallback works without HERMES_HOME. The wait loop has no equivalent fallback for the runtime-status file.
The installed unit already has Environment="HERMES_HOME=/home/<user>/.hermes", so the correct value is recoverable from the unit definition.
Proposed Fix (optional)
For the --system scope, derive HERMES_HOME from the installed unit file before the wait loop runs. Either:
- Parse
Environment=HERMES_HOME=… from /etc/systemd/system/hermes-gateway.service and set os.environ["HERMES_HOME"] for the duration of the CLI call, or
- Pass an explicit
hermes_home argument through systemd_restart → _wait_for_systemd_service_restart → _gateway_runtime_status_for_pid, and have those helpers read from that path instead of relying on $HOME.
The same flaw likely affects any other --system code path that calls read_runtime_status() or get_running_pid() from inside a sudo'd CLI (e.g. hermes gateway status --system, the start-limit recovery paths, _recover_pending_systemd_restart).
Are you willing to submit a PR for this?
Bug Description
sudo hermes gateway restart --systemalways reports failure ("System service did not become active within 60s" twice, after ~245s total) even though the gateway actually restarts successfully and ends up healthy. The wrapper's runtime-status detection silently reads from/root/.hermes/gateway_state.jsoninstead of the configured user'sHERMES_HOME, so it can never observegateway_state == "running".The end state is broken in practice: when the wrapper's first 60s wait expires it falls back to a forced
systemctl restart, which SIGTERMs the in-progress new gateway (exit code 1 in journald) before spawning a fresh one. The user sees repeated banner-prints,status=1/FAILURElines, and the CLI claiming the restart failed.Steps to Reproduce
hermes gateway statusand check that~/.hermes/gateway_state.jsonhas"gateway_state":"running".sudo hermes gateway restart --system.Expected Behavior
The wrapper should detect that the new gateway instance has reached
gateway_state == "running"(which happens within a few seconds after the SIGUSR1-triggered respawn) and print✓ System service restarted (PID …). The whole call should complete in under 10 seconds — and indeed it does whenHERMES_HOMEis propagated:Actual Behavior
journald during the failed wait shows the wrapper's forced fallback SIGTERMing the in-progress new instance:
After all that, the final instance is healthy:
ActiveState=active, ExecMainStatus=0, NRestarts=0.Affected Component
hermes gateway restart --systemwrapperMessaging Platform (if gateway-related)
N/A — bug is in the CLI wrapper for
--systemscope, independent of platform.Debug Report
Operating System
Ubuntu 24.04.4 LTS (kernel 6.17.0-1010-azure)
Python Version
3.11.15
Hermes Version
0.13.0 (2026.5.7)
Additional Logs / Traceback (optional)
Direct probe confirming the path resolution problem under sudo:
straceon the wrapper confirms it sends exactly one SIGUSR1 to the correct old PID (no signal storm); the apparent flap loop in journald is entirely caused by the wait-detection failure leading to a forcedsystemctl restartSIGTERM fallback.Root Cause Analysis (optional)
In
hermes_cli/gateway.py::_wait_for_systemd_service_restart(~line 605), the success check is:_gateway_runtime_status_for_pid(hermes_cli/gateway.py:564) calls_read_gateway_runtime_status()→read_runtime_status()(gateway/status.py:439) →_get_runtime_status_path()(gateway/status.py:58), which resolves relative toHERMES_HOME, defaulting to$HOME/.hermes.When running under
sudo(required by_require_root_for_system_servicefor the--systempath),$HOME=/root, so the wrapper reads/root/.hermes/gateway_state.json, which doesn't exist for a system service running as a non-root user. The function returnsNone, the wait loop never seesgateway_state == "running", and it always hits the 60s timeout — even when the gateway is healthy on the very first poll.The wrapper's preflight
pid = get_running_pid() or _systemd_main_pid(system=system)(line 2268) accidentally papers over the same root-HOME problem because the systemd MainPID fallback works without HERMES_HOME. The wait loop has no equivalent fallback for the runtime-status file.The installed unit already has
Environment="HERMES_HOME=/home/<user>/.hermes", so the correct value is recoverable from the unit definition.Proposed Fix (optional)
For the
--systemscope, deriveHERMES_HOMEfrom the installed unit file before the wait loop runs. Either:Environment=HERMES_HOME=…from/etc/systemd/system/hermes-gateway.serviceand setos.environ["HERMES_HOME"]for the duration of the CLI call, orhermes_homeargument throughsystemd_restart→_wait_for_systemd_service_restart→_gateway_runtime_status_for_pid, and have those helpers read from that path instead of relying on$HOME.The same flaw likely affects any other
--systemcode path that callsread_runtime_status()orget_running_pid()from inside a sudo'd CLI (e.g.hermes gateway status --system, the start-limit recovery paths,_recover_pending_systemd_restart).Are you willing to submit a PR for this?