hermes gateway restart fails when gateway is in crashed/unresponsive state

# Issue: `hermes gateway restart` fails when gateway is in a crashed/unresponsive state

**Repo:** https://github.com/NousResearch/hermes-agent

## Problem

`hermes gateway restart` relies on sending `SIGUSR1` to the gateway process, which triggers an asyncio signal handler that calls `runner.request_restart(detached=False, via_service=True)`. This initiates a **graceful drain** — the gateway waits for active agents to finish (up to 90 seconds) before exiting.

**This design fails completely when the gateway is in a crashed or unresponsive state:**

1. The asyncio event loop is not running normally (crashed, frozen, or in a crash-loop)
2. The `SIGUSR1` handler never executes because `asyncio.get_running_loop()` cannot create tasks in a dead loop
3. The graceful restart never happens, and `launchd`/`systemd` never gets the signal to restart the service
4. User is left with a dead gateway and no way to recover from the CLI

## Evidence

The gateway process was observed crash-looping multiple times. During this time, `hermes gateway restart` would have been ineffective because SIGUSR1 would not have been processed.

The only recovery mechanism that worked was `kill -HUP` (or `kill -TERM`), which forces the process to exit regardless of its internal state.

## Expected Behavior

When a user runs `hermes gateway restart`, they expect the gateway to restart — not to gracefully drain and then fail silently.

## Proposed Solutions

### Option 1: Add `hermes gateway force-restart` (recommended)

Bypass drain entirely: send `SIGTERM` directly, then `launchctl kickstart` / `systemctl restart`.

### Option 2: Make `restart` auto-fallback to force-kill

After `SIGUSR1`, if gateway doesn't exit within 30s, fall back to `SIGTERM` + service restart.

### Option 3: Self-healing watchdog

Built-in watchdog that detects unresponsive event loop and self-restarts.

## Code References

- Signal handler: `gateway/run.py:10322-10324`
- macOS restart: `hermes_cli/gateway.py:1904`
- systemd restart: `hermes_cli/gateway.py:1502`

## Severity

Medium — complete service outage with no CLI recovery path.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hermes gateway restart fails when gateway is in crashed/unresponsive state #12438

Issue: `hermes gateway restart` fails when gateway is in a crashed/unresponsive state

Problem

Evidence

Expected Behavior

Proposed Solutions

Option 1: Add `hermes gateway force-restart` (recommended)

Option 2: Make `restart` auto-fallback to force-kill

Option 3: Self-healing watchdog

Code References

Severity

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

hermes gateway restart fails when gateway is in crashed/unresponsive state #12438

Description

Issue: hermes gateway restart fails when gateway is in a crashed/unresponsive state

Problem

Evidence

Expected Behavior

Proposed Solutions

Option 1: Add hermes gateway force-restart (recommended)

Option 2: Make restart auto-fallback to force-kill

Option 3: Self-healing watchdog

Code References

Severity

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Issue: `hermes gateway restart` fails when gateway is in a crashed/unresponsive state

Option 1: Add `hermes gateway force-restart` (recommended)

Option 2: Make `restart` auto-fallback to force-kill