Skip to content

fix(gateway): wire launchd_restart to drain-aware SIGUSR1 path (#27745)#27781

Closed
briandevans wants to merge 1 commit into
NousResearch:mainfrom
briandevans:fix/launchd-restart-drain-aware-27745
Closed

fix(gateway): wire launchd_restart to drain-aware SIGUSR1 path (#27745)#27781
briandevans wants to merge 1 commit into
NousResearch:mainfrom
briandevans:fix/launchd-restart-drain-aware-27745

Conversation

@briandevans

@briandevans briandevans commented May 18, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

hermes gateway restart on macOS (launchd) bypassed its own SIGUSR1 drain handler when invoked from a fresh shell — operator-requested restarts went straight to SIGTERM and killed in-flight agent runs. launchd_restart (hermes_cli/gateway.py:3021) gated its only SIGUSR1 call on _request_gateway_self_restart, which is itself guarded by _is_pid_ancestor_of_current_process. That ancestor guard is correct for the in-process self-restart case (gateway/run.py handlers asking their own gateway ancestor to restart) but returns False whenever the calling CLI process is a sibling under launchd — which is exactly the fresh-shell case. The fallback was terminate_pid(pid, force=False) (SIGTERM, no drain wait). This PR switches launchd_restart to the unconditional drain-aware helper _graceful_restart_via_sigusr1(...), matching what systemd_restart (line 2585) and the macOS hermes update flow already use. On successful drain, runs launchctl kickstart (without -k) to bypass launchd's throttle while preserving the drained in-flight runs.

Related Issue

Fixes #27745

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • hermes_cli/gateway.pylaunchd_restart now calls _graceful_restart_via_sigusr1(pid, drain_timeout + 5) unconditionally (same budget as systemd_restart). On success, launchctl kickstart without -k bypasses throttle but preserves drained runs. On failure, the original SIGTERM + kickstart -k fallback is kept intact.
  • tests/hermes_cli/test_gateway_service.py — new test_launchd_restart_drains_then_kickstarts_without_force (mocks a non-ancestor PID, verifies SIGUSR1 fires and unforced launchctl kickstart follows), and rewrote test_launchd_restart_falls_back_to_sigterm_when_drain_fails to mock the drain failing and verify the SIGTERM + kickstart -k path is preserved.

How to Test

  1. uv run --with pytest --with pytest-xdist --with pytest-asyncio python3 -m pytest tests/hermes_cli/test_gateway_service.py -v -k launchd — 13 passed.
  2. Regression guard: before the fix, _request_gateway_self_restart was mocked to False in the drain test, proving the SIGTERM path. After the fix that mock is unreachable; the new test exercises the SIGUSR1 path explicitly and asserts no SIGTERM call.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run focused tests for the touched code and all pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 15.x (launchd is the affected platform)

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Sibling Audit

Sibling cleanup: the legacy _request_gateway_self_restart helper is now unused outside its own definition. Happy to drop it (and the underlying _is_pid_ancestor_of_current_process) in this PR or a follow-up if preferred.

Related

Copilot AI review requested due to automatic review settings May 18, 2026 05:38

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates launchd_restart() to prefer a drain-aware SIGUSR1 restart path (with an immediate unforced launchctl kickstart) and adjusts tests to cover the new drain-first behavior and SIGTERM fallback.

Changes:

  • Replace the launchd “self-restart request” path with _graceful_restart_via_sigusr1(...) drain behavior.
  • On successful drain, run launchctl kickstart (without -k) to avoid killing in-flight work.
  • Update/rename gateway service tests to validate the new drain/kickstart flow and fallback behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
tests/hermes_cli/test_gateway_service.py Updates tests to validate SIGUSR1 drain + unforced kickstart, and fallback-to-SIGTERM when drain fails.
hermes_cli/gateway.py Changes launchd restart logic to use SIGUSR1 drain path and unforced kickstart to respawn immediately.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread hermes_cli/gateway.py
@@ -3026,8 +3026,19 @@ def launchd_restart():

try:
pid = get_running_pid()
if pid is not None and _request_gateway_self_restart(pid):
print("✓ Service restart requested")
if pid is not None and _graceful_restart_via_sigusr1(pid, drain_timeout + 5):
Comment thread hermes_cli/gateway.py
Comment on lines +3036 to 3042
subprocess.run(
["launchctl", "kickstart", target],
check=False,
timeout=30,
)
print("✓ Service restarted (drained)")
return
Comment thread hermes_cli/gateway.py
Comment on lines +3036 to +3040
subprocess.run(
["launchctl", "kickstart", target],
check=False,
timeout=30,
)
Comment thread hermes_cli/gateway.py
@@ -3026,8 +3026,19 @@ def launchd_restart():

try:
pid = get_running_pid()
if pid is not None and _request_gateway_self_restart(pid):
print("✓ Service restart requested")
if pid is not None and _graceful_restart_via_sigusr1(pid, drain_timeout + 5):
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery comp/cli CLI entry point, hermes_cli/, setup wizard labels May 18, 2026
@briandevans briandevans force-pushed the fix/launchd-restart-drain-aware-27745 branch 2 times, most recently from fdc7649 to 7fd7be5 Compare May 24, 2026 01:16
…esearch#27745)

`hermes gateway restart` on macOS (launchd) called the ancestor-guarded
`_request_gateway_self_restart`, which returns False when the calling
shell isn't a descendant of the gateway.  Operators running the command
from a fresh terminal therefore fell straight through to SIGTERM, no
SIGUSR1 was ever sent, and the gateway's drain handler never ran —
in-flight agent runs were SIGKILL'd after the short terminate timeout.

systemd_restart and the macOS `hermes update` flow already use the
unconditional `_graceful_restart_via_sigusr1` helper.  This change brings
launchd_restart in line: send SIGUSR1, wait up to drain_timeout + 5 for
the planned exit-75, then `launchctl kickstart` (without -k) to respawn
immediately instead of waiting for KeepAlive's throttle.  The SIGTERM
+ kickstart -k fallback is preserved for the case where the drain
doesn't complete.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@briandevans briandevans force-pushed the fix/launchd-restart-drain-aware-27745 branch from 7fd7be5 to 521a5a9 Compare May 26, 2026 07:10
@briandevans

Copy link
Copy Markdown
Contributor Author

Closing to focus the queue on security/file-safety work where civilian merges are landing. Happy to reopen if maintainers want this picked up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

macOS launchd: hermes gateway restart never invokes _graceful_restart_via_sigusr1, always takes non-drain SIGTERM path from fresh shells

3 participants