Skip to content

fix(gateway): detect macOS launchd in service-restart path#29181

Open
zhonghui5207 wants to merge 1 commit into
NousResearch:mainfrom
zhonghui5207:fix/gateway-detect-launchd-for-service-restart
Open

fix(gateway): detect macOS launchd in service-restart path#29181
zhonghui5207 wants to merge 1 commit into
NousResearch:mainfrom
zhonghui5207:fix/gateway-detect-launchd-for-service-restart

Conversation

@zhonghui5207

Copy link
Copy Markdown

What does this PR do?

Fixes /restart (and other _handle_restart_command code paths) so that they actually trigger a launchd-driven relaunch on macOS, instead of exiting cleanly and leaving the gateway down.

Bug

After running /restart on a macOS host where the gateway is managed by launchd, the gateway drains and exits with code 0. Because the launchd plist uses KeepAlive { SuccessfulExit: false }, launchd treats the clean exit as "stopped on purpose" and refuses to relaunch. The user has to manually launchctl kickstart -k to bring the gateway back up. Same flow works correctly on Linux/systemd.

Root cause

The gateway picks its restart strategy in gateway/run.py (~line 9720):

_under_service = bool(os.environ.get("INVOCATION_ID"))  # systemd sets this
_in_container = os.path.exists("/.dockerenv") or os.path.exists("/run/.containerenv")
if _under_service or _in_container:
    self.request_restart(detached=False, via_service=True)
else:
    self.request_restart(detached=True, via_service=False)

INVOCATION_ID is set only by systemd. macOS launchd uses XPC_SERVICE_NAME / XPC_FLAGS instead — never INVOCATION_ID. So on macOS, _under_service is always False, the detached-subprocess branch is taken, and via_service=False flows to the exit path:

# gateway/run.py ~line 18162
if runner._restart_via_service:
    raise SystemExit(75)
return True

SystemExit(75) is the contract launchd recognises ("unsuccessful exit" → KeepAlive relaunches). Because the branch is skipped, the function returns Truesys.exit(0), which launchd interprets as a successful, intentional stop. The gateway stays down.

Fix

Extend the probe to recognise launchd as a service manager:

_under_service = bool(
    os.environ.get("INVOCATION_ID")        # systemd (Linux) sets this
    or os.environ.get("XPC_SERVICE_NAME")  # launchd (macOS) sets this
)

XPC_SERVICE_NAME is injected by launchd for every managed job and is launchd-specific (no false positives on Linux). With this change, /restart on macOS takes the via_service=True branch and exits with code 75, which launchd recognises and relaunches.

Related Issue

Fixes #29180

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • gateway/run.py — extend the _under_service probe to also recognise launchd (XPC_SERVICE_NAME)
  • tests/gateway/test_restart_notification.py
    • New test_restart_command_uses_service_restart_under_launchd asserts that XPC_SERVICE_NAME triggers via_service=True
    • Existing test_restart_command_uses_detached_without_systemd now also clears XPC_SERVICE_NAME so it asserts the genuine "no service manager" case

How to Test

Automated:

./venv/bin/python -m pytest tests/gateway/test_restart_notification.py -v

26 tests pass on macOS 15.4 / Python 3.11.14.

Manual (macOS):

  1. Install the gateway as a launchd service (hermes gateway install).
  2. Confirm it's up: launchctl list ai.hermes.gateway shows a PID.
  3. Send /restart to the bot.
  4. After the gateway drains and exits, observe launchctl list ai.hermes.gateway showing a new PID within ~1 second (instead of staying on the stale one).
  5. ~/.hermes/logs/gateway-exit-diag.log should now contain a SystemExit code=75 entry followed immediately by a new gateway.start entry.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(gateway): …)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix
  • I've run pytest tests/gateway/test_restart_notification.py -v and all 26 tests pass
  • I've added a regression test for the launchd case
  • I've tested on my platform: macOS 15.4 (Darwin 25.4.0)

Documentation & Housekeeping

  • N/A — no docs/config changes needed for this bug fix
  • N/A — no cli-config.yaml.example keys touched
  • N/A — no architecture/workflow changes
  • Cross-platform impact considered: the new env-var check is launchd-specific; Linux/systemd behaviour is preserved (test test_restart_command_uses_service_restart_under_systemd still passes)
  • N/A — no tool descriptions/schemas changed

The /restart command uses an environment-variable probe to decide
between two restart strategies:

- service-restart path: exit with code 75 so a service manager
  (systemd / launchd) relaunches us
- detached-subprocess path: spawn a new gateway via setsid + bash

The probe only checked for systemd's INVOCATION_ID env var, so on
macOS launchd it always picked the detached path. The gateway then
exited with code 0, which launchd's KeepAlive { SuccessfulExit: false }
policy interprets as "stopped successfully — do not relaunch", leaving
the gateway down until manually bootstrapped.

Extend the probe to also recognise launchd by checking the
XPC_SERVICE_NAME env var (launchd injects this for managed jobs;
INVOCATION_ID is systemd-specific).

Tests:
- New test_restart_command_uses_service_restart_under_launchd verifies
  that XPC_SERVICE_NAME triggers via_service=True.
- The existing detached-without-systemd test now also clears
  XPC_SERVICE_NAME so it asserts the true "no service manager" case.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels May 20, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Duplicate of #19940 — same XPC_SERVICE_NAME launchd detection fix. See also #24898 and #24954 (previously flagged as dups of #19940).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: /restart does not relaunch the gateway under macOS launchd

2 participants