Skip to content

fix(gateway): kickstart launchd after graceful restart#34366

Open
Jadoking wants to merge 4 commits into
NousResearch:mainfrom
Jadoking:fix/launchd-kickstart-after-graceful-drain
Open

fix(gateway): kickstart launchd after graceful restart#34366
Jadoking wants to merge 4 commits into
NousResearch:mainfrom
Jadoking:fix/launchd-kickstart-after-graceful-drain

Conversation

@Jadoking

@Jadoking Jadoking commented May 29, 2026

Copy link
Copy Markdown

Summary

  • Builds on the launchd service-manager detection work in fix(gateway): restart through launchd supervision #34230.
  • Keeps the graceful SIGUSR1 drain path, but does not stop after the old gateway exits.
  • Explicitly runs launchctl kickstart -k after graceful drain because macOS launchd can leave the job loaded but state = not running after exit 75 / EX_TEMPFAIL.
  • Adds regression coverage that the launchd restart path bootstraps if needed, performs graceful drain, and then kickstarts the job.

Local verification on affected machine

Before the extra kickstart, this machine reproduced the failure after hermes gateway restart:

  • launchctl print gui/501/ai.hermes.gateway showed state = not running.
  • last exit code = 75: EX_TEMPFAIL.
  • No gateway PID was present even though the service was loaded.

After this change:

  • hermes gateway restart exits successfully.
  • PID changes from the old gateway to a new gateway process.
  • launchctl print gui/501/ai.hermes.gateway shows state = running.

Test Plan

  • venv/bin/python -m py_compile hermes_cli/gateway.py gateway/run.py
  • venv/bin/python -m pytest tests/gateway/test_restart_notification.py tests/hermes_cli/test_gateway.py -q -o 'addopts='

chrometh and others added 3 commits May 28, 2026 21:26
Detect launchd-managed gateway processes during /restart and route restarts through the service manager instead of detached helpers. Bootstrap unloaded launchd jobs before self-signalling so macOS restarts do not strand the gateway.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery comp/cli CLI entry point, hermes_cli/, setup wizard labels May 29, 2026
@liuhao1024

Copy link
Copy Markdown
Contributor

I found one issue that looks worth fixing before merge.

hermes_cli/gateway.pylaunchd_restart(): missing launchctl kickstart call after graceful drain

The test test_launchd_restart_bootstraps_unloaded_job_before_self_signal expects three subprocess.run calls:

assert calls[0][1][:2] == ["launchctl", "bootstrap"]   # 1. bootstrap the service
assert calls[1] == ("sigusr1_restart", 4242, 35.0)      # 2. graceful drain via SIGUSR1
assert calls[2][1] == ["launchctl", "kickstart", "-k", "gui/501/ai.hermes.gateway"]  # 3. kickstart

But the production code in launchd_restart() only has:

if _graceful_restart_via_sigusr1(pid, drain_timeout + 5):
    # ... comment about kickstarting ...
    print("↻ Graceful drain complete; kickstarting launchd job")

There is no actual subprocess.run(["launchctl", "kickstart", "-k", ...]) call after the graceful drain succeeds. The comment says "explicitly kickstart the loaded job" and the print message says "kickstarting launchd job", but the code doesn't do it.

The test will fail at assert calls[2][0] == "run" because calls will only have 2 entries (bootstrap + sigusr1), not 3.

Suggested fix: After the graceful drain print, add:

subprocess.run(
    ["launchctl", "kickstart", "-k", get_launchd_label()],
    check=True,
    timeout=30,
)

This also applies to the else (timeout) branch — if the drain times out, the gateway is still alive and launchd won't restart it. A kickstart there would force the restart.

@Jadoking

Copy link
Copy Markdown
Author

Thanks — addressed in 4908cd7.

The production path already fell through to a shared launchctl kickstart -k <target> after the graceful SIGUSR1 block, which is why the existing test passed locally. But I agree the code was too easy to misread because the print/comment lived in the if body while the actual kickstart happened later after the branch.

Changes made:

  • factored the force restart into _launchd_kickstart(target, timeout=...) so the post-drain relaunch is explicit at the call site
  • kept the timeout branch falling through to the same forced kickstart path
  • changed the unloaded-job recovery branch to use kickstart -k as well
  • added coverage for the graceful-drain-timeout path
  • added coverage for unloaded job recovery: initial kickstart fails with launchd's unloaded code, then bootstrap, then kickstart -k

Verification:

  • venv/bin/python -m py_compile hermes_cli/gateway.py gateway/run.py
  • venv/bin/python -m pytest tests/gateway/test_restart_notification.py tests/hermes_cli/test_gateway.py -q -o 'addopts='
  • Result: 64 passed in 0.85s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants