Skip to content

[Bug]: /restart slash command hardcodes detached=True, via_service=False — kills systemd-managed gateway permanently #8104

@Typeve

Description

@Typeve

Bug Description

The /restart slash command in the gateway hardcodes detached=True, via_service=False when calling request_restart(). When the gateway is managed by systemd (system-level or user-level service), this causes the gateway process to die permanently without automatic recovery.

Hermes Version: 0.8.0 (2026.4.8)
Environment: Linux, systemd system-level service (hermes-gateway.service), Tencent Cloud

Root Cause Analysis

In gateway/run.py, the /restart handler at line ~3947:

async def _handle_restart_command(self, event: MessageEvent) -> str:
    """Handle /restart command - drain active work, then restart the gateway."""
    # ...
    self.request_restart(detached=True, via_service=False)  # ← hardcoded

This hardcoded path triggers the following chain:

  1. request_restart(detached=True, via_service=False) → sets _restart_detached=True, _restart_via_service=False
  2. stop(detached_restart=True, service_restart=False) is called
  3. In _stop_impl() (line ~1887), _restart_detached=True triggers _launch_detached_restart_command() — which spawns a hermes gateway restart process in a new session via setsid/Popen(start_new_session=True)
  4. The detached child process waits for the parent PID to exit, then runs hermes gateway restart
  5. hermes gateway restart regenerates the systemd service file via hermes gateway install --system, then runs systemctl restart hermes-gateway
  6. But here's the problem: by the time the child runs systemctl restart, systemd has already detected the original process exited (exit code 0, not the special exit code 75) and may hit StartLimitBurst or the service file regeneration strips custom environment overrides
  7. The key issue: since via_service=False, the gateway exits with code 0 instead of 75 (GATEWAY_SERVICE_RESTART_EXIT_CODE). systemd sees a clean exit, not a restart request. With Restart=always, it may try to restart, but the detached child is simultaneously trying to regenerate the service file — causing a race condition

Why via_service=False is wrong here: The codebase already has a well-designed service-restart exit path. When via_service=True, the gateway sets _exit_code = GATEWAY_SERVICE_RESTART_EXIT_CODE (75), which tells systemd to restart the process cleanly via the existing service configuration — no service file regeneration, no PID races, no environment override loss.

Steps to Reproduce

  1. Install gateway as systemd service: hermes gateway install --system
  2. Start: systemctl start hermes-gateway
  3. Send /restart to the bot via Telegram
  4. Observe: gateway process exits, may or may not restart depending on timing
  5. If StartLimitBurst is exceeded (default 5 in 600s), gateway stays dead

Expected Behavior

The /restart command should detect whether the gateway is running under systemd and use the service-managed restart path:

  • If systemd: exit with code 75 → systemd handles the restart cleanly
  • If not systemd: fall back to the detached spawn approach

Actual Behavior

Always uses the detached spawn path (detached=True, via_service=False), which regenerates service files, creates PID races, and can exhaust systemd's start limit.

Relevant Code

gateway/run.py line ~3947:

self.request_restart(detached=True, via_service=False)

gateway/run.py line ~1385 (request_restart):

def request_restart(self, *, detached: bool = False, via_service: bool = False) -> bool:

gateway/run.py line ~1939 (exit path selection):

if self._restart_requested and self._restart_via_service:
    self._exit_code = GATEWAY_SERVICE_RESTART_EXIT_CODE

gateway/restart.py:

GATEWAY_SERVICE_RESTART_EXIT_CODE = 75  # EX_TEMPFAIL

Proposed Fix

The /restart handler should auto-detect the systemd environment and pass via_service=True when appropriate:

async def _handle_restart_command(self, event: MessageEvent) -> str:
    """Handle /restart command - drain active work, then restart the gateway."""
    if self._restart_requested or self._draining:
        count = self._running_agent_count()
        if count:
            return f"⏳ Draining {count} active agent(s) before restart..."
        return "⏳ Gateway restart already in progress..."

    active_agents = self._running_agent_count()

    # Detect if running under systemd and use service-managed restart path
    via_service = _is_systemd_managed()
    self.request_restart(detached=not via_service, via_service=via_service)

    if active_agents:
        return f"⏳ Draining {active_agents} active agent(s) before restart..."
    return "♻ Restarting gateway..."

A helper to detect systemd could check:

  • os.environ.get("INVOCATION_ID") (set by systemd)
  • Or os.path.exists("/run/systemd/system")
  • Or check if the parent PID 1 is systemd

This ensures the gateway exits with code 75 when under systemd, letting the service manager handle restart cleanly without file regeneration or PID races.

Related Issues

Environment

  • OS: Linux (Tencent Cloud, systemd system-level service)
  • Python: 3.x
  • Hermes: 0.8.0 (2026.4.8)
  • Service type: Type=simple, Restart=always, RestartSec=2, KillMode=mixed
  • Messaging Platform: Telegram (but affects all platforms)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions