Skip to content

hermes gateway restart returns before restart completes #8260

@alcorzheng

Description

@alcorzheng

Bug Description

hermes gateway restart returns immediately after requesting a restart, but the gateway process actually exits and does not automatically restart. The user must manually run hermes gateway start after every restart call.

Steps to Reproduce

  1. hermes gateway start — gateway is running
  2. hermes gateway restart — returns '✓ User service restart requested' immediately
  3. hermes gateway status — shows '✗ User gateway service is stopped'

Expected Behavior

hermes gateway restart should block until the gateway is back up and running, similar to systemctl restart.

Root Cause

In hermes_cli/gateway.py, the systemd_restart() function (line 1097):

def systemd_restart(system: bool = False):
    ...
    pid = get_running_pid()
    if pid is not None and _request_gateway_self_restart(pid):
        print(f'✓ {_service_scope_label(system).capitalize()} service restart requested')
        return   # <-- returns immediately after SIGUSR1
    subprocess.run(_systemctl_cmd(system) + ['reload-or-restart', ...])

It sends SIGUSR1 to the gateway process (graceful drain-and-restart signal), prints the success message, then returns immediately without waiting for the gateway to actually restart.

The gateway then shuts down (drain), exits with code 75. The systemd unit has Restart=on-failure and RestartSec=30, so it does eventually restart — but after a 30-second delay, and the restart command already returned.

Suggested Fix

systemd_restart() should wait for the service to become active again after requesting the restart, e.g.:

# After sending SIGUSR1, poll status until active or timeout
import time
deadline = time.time() + 60
while time.time() < deadline:
    result = subprocess.run(
        _systemctl_cmd(system) + ['is-active', get_service_name()],
        capture_output=True, text=True
    )
    if result.stdout.strip() == 'active':
        print(f'✓ {_service_scope_label(system).capitalize()} service restarted')
        return
    time.sleep(1)
print('✗ Service restart timed out')
``"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions