Skip to content

[Bug]: /restart does not relaunch the gateway under macOS launchd #29180

@zhonghui5207

Description

@zhonghui5207

Bug Description

On macOS, the gateway's /restart command (and any other code path that asks the gateway to relaunch via the service manager) does not actually trigger a launchd-driven restart. The gateway exits with code 0, launchd's KeepAlive { SuccessfulExit: false } policy treats that as "stopped successfully", and the gateway stays down until the user manually re-bootstraps it.

Same code path works correctly on Linux/systemd.

Steps to Reproduce

  1. Install the gateway as a launchd service (the standard macOS deployment via hermes gateway install).
  2. Confirm it's running: launchctl list ai.hermes.gateway shows a PID.
  3. Send /restart to the bot (or trigger any code path that calls _handle_restart_command).
  4. The gateway gracefully drains and exits.
  5. Wait — and observe that launchd does not relaunch it. launchctl list ai.hermes.gateway still references the previous (now-dead) PID and no new process spawns. Telegram / Discord / Feishu adapters all stay disconnected.

Expected vs Actual

Expected: After /restart, the gateway exits and launchd brings it right back up — same behaviour as systemd on Linux.

Actual: The gateway exits cleanly (code 0) and stays down. launchctl list shows the stale PID and no relaunch happens.

Operating System

macOS 15.4 (Darwin 25.4.0)

Python Version

3.11.14

Hermes Version

Working off main (HEAD 6a6766fb8).

Additional Logs / Traceback (optional)

~/.hermes/logs/gateway-exit-diag.log for a working systemd-style restart shows:

{"tag": "asyncio.run.SystemExit", "code": 75}
{"tag": "gateway.start", "pid": <new>}

For the failing macOS launchd /restart, the SystemExit-75 line is missing entirely — the gateway falls through to return Truesys.exit(0), and the next gateway.start entry only shows up much later when the user manually runs launchctl kickstart -k.

Root Cause Analysis

In gateway/run.py (~line 9720), the gateway decides between two restart strategies:

_under_service = bool(os.environ.get("INVOCATION_ID"))  # systemd sets this
_in_container = os.path.exists("/.dockerenv") or os.path.exists("/run/.containerenv")
if _under_service or _in_container:
    self.request_restart(detached=False, via_service=True)
else:
    self.request_restart(detached=True, via_service=False)

INVOCATION_ID is set only by systemd. macOS launchd uses a different convention — it injects XPC_SERVICE_NAME and XPC_FLAGS into the environment of managed jobs but does not set INVOCATION_ID.

So under launchd, _under_service is False, the code takes the detached-subprocess branch, and request_restart(via_service=False) flows through to the exit path:

# gateway/run.py ~line 18162
if runner._restart_via_service:
    raise SystemExit(75)
return True

Because _restart_via_service=False, the SystemExit(75) branch is skipped, the function returns Truesys.exit(0). launchd's KeepAlive { SuccessfulExit: false } policy then refuses to relaunch a "successful" exit.

The detached-subprocess fallback (the branch the code does take) doesn't actually start a replacement process under launchd either, because launchd reparents the spawned subprocess and tears it down when the parent exits — same mechanism the _under_service block already documents for systemd KillMode=mixed.

Proposed Fix

Extend the probe to recognise launchd:

_under_service = bool(
    os.environ.get("INVOCATION_ID")        # systemd (Linux) sets this
    or os.environ.get("XPC_SERVICE_NAME")  # launchd (macOS) sets this
)

XPC_SERVICE_NAME is set by launchd for every managed job (LimitLoadToSessionType does not affect this). I've verified it is present in the live gateway process on macOS 15.4. The variable is launchd-specific so it can't false-positive on a Linux box.

PR with the fix and a regression test: see linked PR.

Are you willing to submit a PR for this?

  • I'd like to fix this myself and submit a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions