Skip to content

fix(gateway): detect launchd in the /restart service-manager probe#43888

Open
chazmaniandinkle wants to merge 1 commit into
NousResearch:mainfrom
chazmaniandinkle:fix/restart-launchd-service-detection
Open

fix(gateway): detect launchd in the /restart service-manager probe#43888
chazmaniandinkle wants to merge 1 commit into
NousResearch:mainfrom
chazmaniandinkle:fix/restart-launchd-service-detection

Conversation

@chazmaniandinkle

Copy link
Copy Markdown

Fixes #43475.

On a launchd-managed gateway (macOS), /restart stops the gateway and never relaunches it. The handler's service-manager probe checks only INVOCATION_ID (systemd) and container markers, so under launchd it takes the detached path and exits 0. The generated plist uses KeepAlive.SuccessfulExit=false, which treats a clean exit as a deliberate stop, so the gateway stays silently dead until a manual launchctl kickstart.

The fix detects launchd via XPC_SERVICE_NAME, which launchd sets to the job label. Two details distinguish it from the earlier attempts:

  1. XPC_SERVICE_NAME=0 must not count as launchd. Interactive macOS shells inherit XPC_SERVICE_NAME=0, a truthy string. The bare bool() probe in fix(gateway): detect launchd service for restart #19940/fix: /restart uses via_service=True on launchd (macOS) #33393 would route an unsupervised interactive gateway to the service path, where it exits 75 and nothing revives it. This probe excludes "" and "0" (verified: launchd jobs see the real label via ps eww; interactive shells see 0).
  2. Route via via_service=True rather than forcing a non-zero exit on the detached path (the fix(gateway): exit non-zero on /restart so launchd revives the gateway #43498/fix(gateway): exit non-zero on macOS launchd restart path (#43475) #43596 approach). The detached path spawns a relaunch helper; exiting non-zero there means the helper and launchd both respawn the gateway, and two instances then race for the same bot tokens, which Telegram and Discord reject (one connection per bot). The service path spawns no helper, so launchd is the single respawner.

This also targets the current handler location in gateway/slash_commands.py; the older probe PRs predate the move out of gateway/run.py.

Tests: tests/gateway/test_restart_service_detection.py pins all four routings (launchd job label, interactive =0, no service env, systemd INVOCATION_ID), built on the existing restart test helpers.

Production validation: running on two launchd-managed gateways since 2026-06-10; an earlier version of the same probe ran from 2026-06-03 (noted on #33393).

…esearch#43475)

On a launchd-managed gateway (macOS), /restart stopped the gateway but
never relaunched it: the handler's service detection checks only
INVOCATION_ID (systemd) and container markers, so under launchd it takes
the detached path and exits 0 — which KeepAlive.SuccessfulExit=false
treats as a deliberate stop. The gateway stays silently dead until a
manual launchctl kickstart.

Detect launchd via XPC_SERVICE_NAME, which launchd sets to the job label
for processes it spawns. The probe deliberately excludes the literal
"0": interactive macOS shells inherit XPC_SERVICE_NAME=0 (a truthy
string), and routing an unsupervised interactive gateway to the service
path would make it exit non-zero with nothing to revive it.

Routing through via_service=True (rather than forcing a non-zero exit
on the detached path) matters: the detached path also spawns a helper
that relaunches the gateway, so exiting non-zero there would have BOTH
the helper and launchd respawn it — two gateways racing for the same
bot tokens. The service path spawns no helper; launchd is the single
respawner.

Fixes NousResearch#43475. Supersedes the run.py-era probes in NousResearch#19940/NousResearch#33393 (the
handler has since moved to gateway/slash_commands.py) and avoids the
double-spawn risk in the exit-code-site approaches (NousResearch#43498, NousResearch#43596).
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists labels Jun 11, 2026
@liuhao1024

Copy link
Copy Markdown
Contributor

Verification: LGTM — correct launchd detection with proper edge-case handling.

Reviewed the diff and tests. Key observations:

  1. XPC_SERVICE_NAME check correctly treats "0" (interactive macOS shell inheritance) as non-service, while catching real launchd job labels like "ai.hermes.gateway"
  2. os.environ.get("XPC_SERVICE_NAME", "0") not in ("", "0") — the default "0" ensures missing env var falls through to detached path (correct)
  3. All 4 test cases cover: launchd (service path), interactive shell ("0" → detached), no env (detached), systemd (INVOCATION_ID → service path)
  4. Existing behavior preserved: systemd detection via INVOCATION_ID unchanged

No issues found.

@tonydwb tonydwb left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Verdict: Approved

✅ Looks Good

  • Well-scoped fix: Improves launchd detection in the /restart service-manager probe by checking for launchd in the service manager probe.
  • Small change: 50 additions, 5 deletions - minimal and targeted.
  • No security concerns: Local service detection logic, no external input or secrets.

Reviewed by Hermes Agent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

/restart bricks a launchd-managed gateway on macOS — exits 0, KeepAlive.SuccessfulExit=false won't revive it

4 participants