Skip to content

[Bug]: gateway restart --system always reports failure (60s timeout × 2) — wrapper reads runtime status from root's HERMES_HOME #22035

@Zxilly

Description

@Zxilly

Bug Description

sudo hermes gateway restart --system always reports failure ("System service did not become active within 60s" twice, after ~245s total) even though the gateway actually restarts successfully and ends up healthy. The wrapper's runtime-status detection silently reads from /root/.hermes/gateway_state.json instead of the configured user's HERMES_HOME, so it can never observe gateway_state == "running".

The end state is broken in practice: when the wrapper's first 60s wait expires it falls back to a forced systemctl restart, which SIGTERMs the in-progress new gateway (exit code 1 in journald) before spawning a fresh one. The user sees repeated banner-prints, status=1/FAILURE lines, and the CLI claiming the restart failed.

Steps to Reproduce

  1. Install the gateway as a system-level service running as a non-root user:
    sudo hermes gateway install --system --run-as-user $USER
    sudo hermes gateway start --system
    
  2. Verify the gateway is healthy: hermes gateway status and check that ~/.hermes/gateway_state.json has "gateway_state":"running".
  3. Run sudo hermes gateway restart --system.

Expected Behavior

The wrapper should detect that the new gateway instance has reached gateway_state == "running" (which happens within a few seconds after the SIGUSR1-triggered respawn) and print ✓ System service restarted (PID …). The whole call should complete in under 10 seconds — and indeed it does when HERMES_HOME is propagated:

$ sudo HERMES_HOME=/home/$USER/.hermes hermes gateway restart --system
↻ Updated gateway system service definition to match the current Hermes install
⏳ System service restarting gracefully (PID 1019162)...
⏳ System service process started (PID 1019800); waiting for gateway runtime...
✓ System service restarted (PID 1019800)
# 6 seconds total

Actual Behavior

$ sudo hermes gateway restart --system
⏳ System service restarting gracefully (PID 1018411)...
⏳ System service process started (PID 1018713); waiting for gateway runtime...
⚠ System service did not become active within 60s.
⚠ Graceful restart did not complete within 185s; forcing a service restart...
⏳ System service process started (PID 1019162); waiting for gateway runtime...
⚠ System service did not become active within 60s.
# ~245 seconds total

journald during the failed wait shows the wrapper's forced fallback SIGTERMing the in-progress new instance:

python[1018713]: WARNING gateway.run: Shutdown diagnostic — other hermes processes running:
                 root  ... sudo /home/$USER/.local/bin/hermes gateway restart --system
                 root  ... systemctl restart hermes-gateway
systemd[1]: hermes-gateway.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: Started hermes-gateway.service - Hermes Agent Gateway

After all that, the final instance is healthy: ActiveState=active, ExecMainStatus=0, NRestarts=0.

Affected Component

  • Gateway (Telegram/Discord/Slack/WhatsApp)
  • CLI (interactive chat) — bug is specifically in the hermes gateway restart --system wrapper

Messaging Platform (if gateway-related)

N/A — bug is in the CLI wrapper for --system scope, independent of platform.

Debug Report

Report       https://paste.rs/THGF9
agent.log    https://paste.rs/ScXFe
gateway.log  https://paste.rs/gdzv6

Operating System

Ubuntu 24.04.4 LTS (kernel 6.17.0-1010-azure)

Python Version

3.11.15

Hermes Version

0.13.0 (2026.5.7)

Additional Logs / Traceback (optional)

Direct probe confirming the path resolution problem under sudo:

$ sudo HOME=/root python3 -c "
import sys
sys.path.insert(0, '/home/$USER/.hermes/hermes-agent')
from gateway.status import _get_pid_path, get_running_pid, read_runtime_status
print('PID path:', _get_pid_path())
print('Running PID:', get_running_pid())
print('Runtime status:', read_runtime_status())
"
PID path: /root/.hermes/gateway.pid
Running PID: None
Runtime status: None

strace on the wrapper confirms it sends exactly one SIGUSR1 to the correct old PID (no signal storm); the apparent flap loop in journald is entirely caused by the wait-detection failure leading to a forced systemctl restart SIGTERM fallback.

Root Cause Analysis (optional)

In hermes_cli/gateway.py::_wait_for_systemd_service_restart (~line 605), the success check is:

runtime_state = _gateway_runtime_status_for_pid(new_pid)
gateway_state = (runtime_state or {}).get("gateway_state")
if gateway_state == "running":
    print(f"✓ {scope_label} service restarted (PID {new_pid})")
    return True

_gateway_runtime_status_for_pid (hermes_cli/gateway.py:564) calls _read_gateway_runtime_status()read_runtime_status() (gateway/status.py:439) → _get_runtime_status_path() (gateway/status.py:58), which resolves relative to HERMES_HOME, defaulting to $HOME/.hermes.

When running under sudo (required by _require_root_for_system_service for the --system path), $HOME=/root, so the wrapper reads /root/.hermes/gateway_state.json, which doesn't exist for a system service running as a non-root user. The function returns None, the wait loop never sees gateway_state == "running", and it always hits the 60s timeout — even when the gateway is healthy on the very first poll.

The wrapper's preflight pid = get_running_pid() or _systemd_main_pid(system=system) (line 2268) accidentally papers over the same root-HOME problem because the systemd MainPID fallback works without HERMES_HOME. The wait loop has no equivalent fallback for the runtime-status file.

The installed unit already has Environment="HERMES_HOME=/home/<user>/.hermes", so the correct value is recoverable from the unit definition.

Proposed Fix (optional)

For the --system scope, derive HERMES_HOME from the installed unit file before the wait loop runs. Either:

  1. Parse Environment=HERMES_HOME=… from /etc/systemd/system/hermes-gateway.service and set os.environ["HERMES_HOME"] for the duration of the CLI call, or
  2. Pass an explicit hermes_home argument through systemd_restart_wait_for_systemd_service_restart_gateway_runtime_status_for_pid, and have those helpers read from that path instead of relying on $HOME.

The same flaw likely affects any other --system code path that calls read_runtime_status() or get_running_pid() from inside a sudo'd CLI (e.g. hermes gateway status --system, the start-limit recovery paths, _recover_pending_systemd_restart).

Are you willing to submit a PR for this?

  • I'd like to fix this myself and submit a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions