Skip to content

hermes gateway restart fails when gateway is in crashed/unresponsive state #12438

@shengbai4-hub

Description

@shengbai4-hub

Issue: hermes gateway restart fails when gateway is in a crashed/unresponsive state

Repo: https://github.com/NousResearch/hermes-agent

Problem

hermes gateway restart relies on sending SIGUSR1 to the gateway process, which triggers an asyncio signal handler that calls runner.request_restart(detached=False, via_service=True). This initiates a graceful drain — the gateway waits for active agents to finish (up to 90 seconds) before exiting.

This design fails completely when the gateway is in a crashed or unresponsive state:

  1. The asyncio event loop is not running normally (crashed, frozen, or in a crash-loop)
  2. The SIGUSR1 handler never executes because asyncio.get_running_loop() cannot create tasks in a dead loop
  3. The graceful restart never happens, and launchd/systemd never gets the signal to restart the service
  4. User is left with a dead gateway and no way to recover from the CLI

Evidence

The gateway process was observed crash-looping multiple times. During this time, hermes gateway restart would have been ineffective because SIGUSR1 would not have been processed.

The only recovery mechanism that worked was kill -HUP (or kill -TERM), which forces the process to exit regardless of its internal state.

Expected Behavior

When a user runs hermes gateway restart, they expect the gateway to restart — not to gracefully drain and then fail silently.

Proposed Solutions

Option 1: Add hermes gateway force-restart (recommended)

Bypass drain entirely: send SIGTERM directly, then launchctl kickstart / systemctl restart.

Option 2: Make restart auto-fallback to force-kill

After SIGUSR1, if gateway doesn't exit within 30s, fall back to SIGTERM + service restart.

Option 3: Self-healing watchdog

Built-in watchdog that detects unresponsive event loop and self-restarts.

Code References

  • Signal handler: gateway/run.py:10322-10324
  • macOS restart: hermes_cli/gateway.py:1904
  • systemd restart: hermes_cli/gateway.py:1502

Severity

Medium — complete service outage with no CLI recovery path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/cliCLI entry point, hermes_cli/, setup wizardcomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions