Skip to content

fix(gateway): kickstart launchd after SIGUSR1 self-restart on macOS#24993

Closed
konsisumer wants to merge 1 commit into
NousResearch:mainfrom
konsisumer:fix/launchd-restart-no-relaunch-after-sigusr1
Closed

fix(gateway): kickstart launchd after SIGUSR1 self-restart on macOS#24993
konsisumer wants to merge 1 commit into
NousResearch:mainfrom
konsisumer:fix/launchd-restart-no-relaunch-after-sigusr1

Conversation

@konsisumer

Copy link
Copy Markdown
Contributor

Fix the macOS launchd_restart() path that left the gateway permanently dead when hermes update ran inside the gateway process tree.

What changed and why

  • hermes_cli/gateway.pylaunchd_restart() no longer returns immediately after sending SIGUSR1. Both the SIGUSR1 path (when the gateway is an ancestor of the current process) and the SIGTERM path (when it is not) now fall through to _wait_for_gateway_exit and then launchctl kickstart -k. This mirrors the systemd restart path and matches what the issue author proposed.
  • The bug: when an agent invoked hermes update via its terminal tool, the gateway was an ancestor of the update process, so _request_gateway_self_restart succeeded and the function returned. The gateway drained and exited with code 75, but macOS launchd was observed to leave the domain in "on-demand-only mode" with no follow-up WILL_SPAWN, so the service stayed dead until manual hermes gateway start. Issuing launchctl kickstart -k unconditionally — which talks to launchd over XPC and works even after our ancestor process exits — restores the relaunch.
  • tests/hermes_cli/test_gateway_service.py — the existing test_launchd_restart_self_requests_graceful_restart_without_kickstart test was asserting precisely the buggy behavior (that launchctl should not run). It is replaced with test_launchd_restart_self_request_waits_then_kickstarts, which asserts the corrected sequence: SIGUSR1 request → _wait_for_gateway_exit with the drain timeout → launchctl kickstart -k <target>. The complementary test_launchd_restart_drains_running_gateway_before_kickstart (Path B, gateway not an ancestor) is unchanged and still passes.

How to test

  • pytest tests/hermes_cli/test_gateway_service.py -k launchd_restart -q — all 3 launchd_restart cases pass.
  • pytest tests/hermes_cli/test_gateway_service.py tests/hermes_cli/test_update_gateway_restart.py -q — 158 tests pass. 3 pre-existing systemd test failures on this Darwin host (UserSystemdUnavailableError: User D-Bus session is not available) are unrelated to this change and also fail on main.
  • Manual repro on a macOS host with the gateway running under launchd: from a Telegram conversation, ask the agent to run hermes update via the terminal tool. Before the fix, the gateway would exit with code 75 and not come back. After the fix, launchctl kickstart -k runs after the drain and the new gateway process is spawned within seconds.

What platforms tested on

  • macOS on darwin-arm64 (local) — unit tests; manual launchd repro is described above but requires a real LaunchAgent-managed gateway to verify.
  • Linux paths (systemd_restart) are untouched and were not exercised.

Fixes #11932

When `hermes update` runs inside the gateway process tree (e.g. invoked
by the agent via its terminal tool), `launchd_restart()` was taking the
SIGUSR1 path and returning immediately. The gateway then drained and
exited with code 75, but launchd was observed to leave the domain in
"on-demand-only mode" and never issue WILL_SPAWN, leaving the service
permanently dead until manual `hermes gateway start`.

Both branches now converge on `_wait_for_gateway_exit` + `launchctl
kickstart -k`, mirroring the systemd restart path. The `launchctl`
client talks to launchd over XPC, so the kickstart still works even
after the gateway (our ancestor) exits and we are reparented.

Fixes NousResearch#11932
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/gateway Gateway runner, session dispatch, delivery labels May 13, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Fix PR for #11932. Supersedes the earlier closed attempt #11934. Related: #12374 (orphan process cleanup), #10317 (SIGTERM race), #12438 (SIGUSR1 fails on crashed gateway).

@konsisumer

Copy link
Copy Markdown
Contributor Author

Thanks for the context, @alt-glitch. I've reviewed each cross-reference:

@konsisumer

Copy link
Copy Markdown
Contributor Author

Closing — this PR removes the immediate-return-after-SIGUSR1 in launchd_restart() but tests for that behavior still exist in files outside this PR's scope. A maintainer would need to either restore the removed code or remove the obsolete tests; the bot can't do either within scope. If the removal is still wanted, please reopen with the test cleanup attached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

macOS: launchd_restart() returns early after SIGUSR1, leaving gateway permanently dead

2 participants