hermes update: gateway restart doesn't verify service survived — can leave gateway dead

## Bug

`hermes update` auto-restarts the gateway service after pulling new code (main.py ~L3800), but only checks that `systemctl restart` returned 0 — it does not verify the service actually stayed running afterward.

If the new process crashes on startup (e.g. transient import error, stale module cache), systemd sees a clean exit (SIGTERM) so `Restart=on-failure` does not trigger a retry. The gateway stays dead silently until the user notices.

## Reproduction

Observed on Linux (systemd system-level service). After `hermes update` on Apr 7, the gateway restarted but died in 26ms. systemd journal:

```
Apr 07 05:25:15 roger systemd[1]: Started Hermes Agent Gateway
Apr 07 05:25:15 roger systemd[1]: hermes-gateway.service: Deactivated successfully.
```

Duration: 26ms. Service stayed dead for 2 days until manually restarted.

## Root cause

In `hermes_cli/main.py` around line 3800:

```python
restart = subprocess.run(
    scope_cmd + ["restart", svc_name],
    capture_output=True, text=True, timeout=15,
)
if restart.returncode == 0:
    restarted_services.append(svc_name)
```

`systemctl restart` returns 0 even if the service immediately crashes after starting — it only confirms the restart command was accepted, not that the process survived.

## Suggested fix

Add a post-restart health check with a brief sleep + `is-active` verification, and retry once on failure:

```python
if restart.returncode == 0:
    import time
    time.sleep(2)
    verify = subprocess.run(
        scope_cmd + ["is-active", svc_name],
        capture_output=True, text=True, timeout=5,
    )
    if verify.stdout.strip() == "active":
        restarted_services.append(svc_name)
    else:
        print(f"  ⚠ {svc_name} restarted but died immediately — retrying...")
        retry = subprocess.run(
            scope_cmd + ["restart", svc_name],
            capture_output=True, text=True, timeout=15,
        )
        time.sleep(2)
        verify2 = subprocess.run(
            scope_cmd + ["is-active", svc_name],
            capture_output=True, text=True, timeout=5,
        )
        if verify2.stdout.strip() == "active":
            restarted_services.append(svc_name)
        else:
            print(f"  ⚠ {svc_name} failed to stay running after update")
```

## Environment

- Hermes v0.8.0
- Linux (Arch), systemd system-level service
- `Restart=on-failure` in unit file (correct default, but insufficient for this edge case)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hermes update: gateway restart doesn't verify service survived — can leave gateway dead #6631

Bug

Reproduction

Root cause

Suggested fix

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

hermes update: gateway restart doesn't verify service survived — can leave gateway dead #6631

Description

Bug

Reproduction

Root cause

Suggested fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions