Skip to content

/restart bricks a launchd-managed gateway on macOS — exits 0, KeepAlive.SuccessfulExit=false won't revive it #43475

@ahmadalzaro1

Description

@ahmadalzaro1

Summary

On a launchd-managed gateway (macOS), the /restart slash command (and hermes gateway restart via the same path) stops the gateway but never relaunches it — the gateway exits 0, and the generated plist's KeepAlive.SuccessfulExit=false treats a clean exit as success, so launchd does not revive it. The agent goes silently unreachable until a manual launchctl kickstart.

Environment

  • Hermes v0.16.0 (2026.6.5), upstream 49dd776d
  • macOS 26.5.1 (launchd), gateway run as a per-user LaunchAgent (KeepAlive.SuccessfulExit=false, RunAtLoad=true)

Reproduction

  1. Profile gateway managed by launchd (the default macOS install).
  2. In Discord (or any platform): send /restart.
  3. Gateway stops; nothing relaunches it. /research-ops, /restart, etc. get no reply. launchctl list shows pid = -.
  4. Recovery: launchctl kickstart -k gui/$(id -u)/ai.hermes.gateway-<profile>.

Log evidence (profile gateway.log)

[Discord] slash '/restart' invoked by user=…
gateway.run: Stopping gateway for restart...
gateway.run: Gateway stopped (total teardown 0.08s)
          ← nothing. dead until a manual kickstart 12 min later.

Contrast a signal shutdown the same day, which exits non-zero and is revived:

Exiting with code 1 (signal-initiated shutdown without restart request)
   so systemd Restart=on-failure can revive the gateway.

The /restart path logs no "Exiting with code …" line — it returns success.

Root cause (gateway/run.py, start_gateway() exit logic)

The exit decision after a planned restart:

if _signal_initiated_shutdown and not runner._restart_requested:
    ... return False            # → sys.exit(1)   (signal path: revived)
if runner._restart_via_service:
    raise SystemExit(75)        # systemd path: revived
return True                     # ← /restart on launchd lands HERE → exit 0

A slash-command /restart sets _restart_requested=True but not _signal_initiated_shutdown and not _restart_via_service (that flag is for systemd installs). It therefore falls through to return Truemain() sees success → process exits 0.

The restart dispatch itself (gateway/run.py, the _restart_via_service branch) already documents the gap:

if self._restart_requested and self._restart_via_service:
    self._launch_systemd_restart_shortcut()
    # ... launchd's KeepAlive.SuccessfulExit=false needs a non-zero exit to [revive]

…but there is no launchd-equivalent branch for the non-systemd case, so launchd-managed /restart exits clean and launchd (correctly, per SuccessfulExit=false) declines to revive.

Impact

  • Severity: high. A documented operator command (/restart) takes the agent permanently offline with no error surfaced to the user. Reproduced repeatedly. The agent simply stops responding.

Suggested fix (any one)

  1. On launchd-managed gateways, a planned /restart should exit non-zero (e.g. reuse SystemExit(75)) so KeepAlive.SuccessfulExit=false relaunches it.
  2. …or route the launchd /restart through the detached respawn watcher (launch_detached_profile_gateway_restart) the way hermes update does.
  3. …or launchctl kickstart -k <label> the job directly (the verb that works on macOS 15+, unlike bootout/bootstrap which return exit 5).

Detection of "managed by launchd" already exists in hermes_cli/gateway.py; the exit/respawn path just needs a launchd arm equivalent to the systemd one.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existsarea/configConfig system, migrations, profilescomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions