Skip to content

hermes update: gateway restart doesn't verify service survived — can leave gateway dead #6631

@witt3rd

Description

@witt3rd

Bug

hermes update auto-restarts the gateway service after pulling new code (main.py ~L3800), but only checks that systemctl restart returned 0 — it does not verify the service actually stayed running afterward.

If the new process crashes on startup (e.g. transient import error, stale module cache), systemd sees a clean exit (SIGTERM) so Restart=on-failure does not trigger a retry. The gateway stays dead silently until the user notices.

Reproduction

Observed on Linux (systemd system-level service). After hermes update on Apr 7, the gateway restarted but died in 26ms. systemd journal:

Apr 07 05:25:15 roger systemd[1]: Started Hermes Agent Gateway
Apr 07 05:25:15 roger systemd[1]: hermes-gateway.service: Deactivated successfully.

Duration: 26ms. Service stayed dead for 2 days until manually restarted.

Root cause

In hermes_cli/main.py around line 3800:

restart = subprocess.run(
    scope_cmd + ["restart", svc_name],
    capture_output=True, text=True, timeout=15,
)
if restart.returncode == 0:
    restarted_services.append(svc_name)

systemctl restart returns 0 even if the service immediately crashes after starting — it only confirms the restart command was accepted, not that the process survived.

Suggested fix

Add a post-restart health check with a brief sleep + is-active verification, and retry once on failure:

if restart.returncode == 0:
    import time
    time.sleep(2)
    verify = subprocess.run(
        scope_cmd + ["is-active", svc_name],
        capture_output=True, text=True, timeout=5,
    )
    if verify.stdout.strip() == "active":
        restarted_services.append(svc_name)
    else:
        print(f"  ⚠ {svc_name} restarted but died immediately — retrying...")
        retry = subprocess.run(
            scope_cmd + ["restart", svc_name],
            capture_output=True, text=True, timeout=15,
        )
        time.sleep(2)
        verify2 = subprocess.run(
            scope_cmd + ["is-active", svc_name],
            capture_output=True, text=True, timeout=5,
        )
        if verify2.stdout.strip() == "active":
            restarted_services.append(svc_name)
        else:
            print(f"  ⚠ {svc_name} failed to stay running after update")

Environment

  • Hermes v0.8.0
  • Linux (Arch), systemd system-level service
  • Restart=on-failure in unit file (correct default, but insufficient for this edge case)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions