Skip to content

fix(gateway): use drain-aware SIGUSR1 in launchd_restart#35178

Open
liuhao1024 wants to merge 1 commit into
NousResearch:mainfrom
liuhao1024:fix/launchd-drain-aware-restart
Open

fix(gateway): use drain-aware SIGUSR1 in launchd_restart#35178
liuhao1024 wants to merge 1 commit into
NousResearch:mainfrom
liuhao1024:fix/launchd-drain-aware-restart

Conversation

@liuhao1024

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes launchd_restart() to use the drain-aware _graceful_restart_via_sigusr1() helper instead of the ancestor-guarded _request_gateway_self_restart(). Previously, running hermes gateway restart from a fresh shell always fell through to SIGTERM because the shell is not a descendant of the gateway process, bypassing the drain path that cleanly finishes in-flight agent runs.

Related Issue

Fixes #27745

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • hermes_cli/gateway.py: Replace _request_gateway_self_restart(pid) with _graceful_restart_via_sigusr1(pid, drain_timeout) in launchd_restart(), and update the success message to reflect the drain-aware behavior.
  • tests/hermes_cli/test_gateway_service.py: Update two launchd restart tests to mock _graceful_restart_via_sigusr1 instead of _request_gateway_self_restart, adjust assertions for the new function signature and output message.

How to Test

  1. Run pytest tests/hermes_cli/test_gateway_service.py -k "launchd_restart" -v — all 3 launchd restart tests should pass.
  2. On macOS with a running Hermes gateway managed by launchd: hermes gateway restart from a fresh terminal. The gateway should log a drain-aware restart (SIGUSR1) rather than an immediate SIGTERM.
  3. Verify that in-flight agent runs complete before the gateway exits (check gateway.log for drain completion messages).

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/hermes_cli/test_gateway_service.py -q and all launchd tests pass (systemd tests fail pre-existing on macOS)
  • I've added tests for my changes (updated existing tests to match new code path)
  • I've tested on my platform: macOS

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — SIGUSR1 is guarded by hasattr(signal, 'SIGUSR1'), returns False on Windows
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Code Intelligence

  • Analyzed: launchd_restart() in hermes_cli/gateway.py (callers: 2 — gateway restart CLI and hermes update flow)
  • Analyzed: _graceful_restart_via_sigusr1() (callers: 2 — systemd_restart and now launchd_restart)
  • Blast radius: LOW — single function call change, same pattern already used by systemd_restart
  • Related patterns: systemd_restart() at line 2645 uses the same _graceful_restart_via_sigusr1(pid, drain_timeout + 5) pattern. The +5 buffer is not needed for launchd because launchctl kickstart handles the relaunch after exit.

launchd_restart() called _request_gateway_self_restart() which is
guarded by _is_pid_ancestor_of_current_process(). When invoked from a
fresh shell (not a gateway descendant), the ancestor check fails and
the code falls through to SIGTERM — bypassing the drain-aware SIGUSR1
path that cleanly finishes in-flight agent runs.

Use _graceful_restart_via_sigusr1() instead, matching the systemd path.
SIGUSR1 triggers gateway/run.py's request_restart(via_service=True)
which drains agent runs then exits with code 75. launchd's
KeepAlive.SuccessfulExit=false relaunches the process automatically.

Fixes NousResearch#27745
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery comp/cli CLI entry point, hermes_cli/, setup wizard labels May 30, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Duplicate of closed #27781 — same fix wiring launchd_restart() to _graceful_restart_via_sigusr1() for #27745. #27781 was closed without merge; this PR re-implements the same approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

macOS launchd: hermes gateway restart never invokes _graceful_restart_via_sigusr1, always takes non-drain SIGTERM path from fresh shells

2 participants