Skip to content

fix(gateway): remove early return in launchd_restart() for SIGUSR1 path#11934

Closed
Ding-tech777 wants to merge 1 commit into
NousResearch:mainfrom
Ding-tech777:fix/launchd-restart-early-return
Closed

fix(gateway): remove early return in launchd_restart() for SIGUSR1 path#11934
Ding-tech777 wants to merge 1 commit into
NousResearch:mainfrom
Ding-tech777:fix/launchd-restart-early-return

Conversation

@Ding-tech777

Copy link
Copy Markdown

Summary

Fixes #11932

One-line fix: remove the early return after SIGUSR1 in launchd_restart() and change the second if pid is not None to elif, so both code paths converge on _wait_for_gateway_exit() + launchctl kickstart -k.

Before (buggy)

pid = get_running_pid()
if pid is not None and _request_gateway_self_restart(pid):  # SIGUSR1 path
    print("✓ Service restart requested")
    return                          # ← returns immediately, no kickstart
if pid is not None:                 # SIGTERM path
    terminate_pid(pid, force=False)
    # ... wait + kickstart
subprocess.run(["launchctl", "kickstart", "-k", target])

When the gateway PID is an ancestor of the calling process, SIGUSR1 is sent and launchd_restart() returns immediately. The gateway exits with code 75, but nobody calls kickstart, so launchd does not restart it. The service stays dead until manual hermes gateway start.

After (fixed)

pid = get_running_pid()
if pid is not None and _request_gateway_self_restart(pid):  # SIGUSR1 path
    print("✓ Service restart requested")                   # no return — falls through
elif pid is not None:                 # SIGTERM path (only if SIGUSR1 was NOT sent)
    terminate_pid(pid, force=False)
    # ... wait
subprocess.run(["launchctl", "kickstart", "-k", target])    # always executed

Both paths now reach kickstart -k, ensuring the gateway is always restarted.

Impact

  • SIGUSR1 path (ancestor process): Now correctly waits for exit and kickstarts. Previously broken — gateway died permanently.
  • SIGTERM path (non-ancestor): Unchanged behavior — already worked correctly.
  • Normal /update command: Unaffected — uses setsid to detach from gateway process tree, takes SIGTERM path.
  • Linux: Unaffected — uses systemd_restart(), separate code path.

Testing

  • Syntax verified via py_compile.
  • Local deployment tested: hermes update via agent terminal tool now correctly recovers gateway.
  • Existing tests unaffected (launchd is macOS-only, not covered by CI on Linux).

Related

When launchd_restart() sends SIGUSR1 (gateway is ancestor of current
process), it previously returned immediately without waiting for exit
or issuing kickstart. The gateway would exit with code 75 but launchd
would not restart it, leaving the service permanently dead.

Remove the early return and change the second pid check to elif so
both SIGUSR1 and SIGTERM paths converge on _wait_for_gateway_exit()
+ launchctl kickstart -k.

Refs: NousResearch#11932
@teknium1

Copy link
Copy Markdown
Contributor

Thanks @growing-future-coder! Holding off on this one — the current early return after a successful SIGUSR1 restart is intentional: self-restart already rebooted the gateway, and falling through to launchctl kickstart -k would stop the freshly-started process and kick it again, potentially racing with whatever the just-restarted gateway is doing. Your underlying observation about #11932 (launchd not auto-reviving after exit 75) is real and worth investigating, but we need to fix it at the plist level (KeepAlive semantics) rather than by adding a double-restart. Closing for now — feel free to reopen with a plist-level fix if you can reproduce the exit-75 scenario.

@teknium1 teknium1 closed this Apr 23, 2026
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery labels Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

macOS: launchd_restart() returns early after SIGUSR1, leaving gateway permanently dead

4 participants