fix(macos): skip redundant kickstart -k after SIGTERM to prevent restart race on launchd#10317
Open
AJV20 wants to merge 6 commits into
Open
fix(macos): skip redundant kickstart -k after SIGTERM to prevent restart race on launchd#10317AJV20 wants to merge 6 commits into
AJV20 wants to merge 6 commits into
Conversation
685f181 to
7f746e3
Compare
- Add ThrottleInterval=5s to launchd plist so rapid-restart cycles (from the update race condition) resolve within 5 seconds instead of the launchd default of 10s - Fix launchd_restart(): after SIGTERM causes the gateway to exit, launchd KeepAlive already starts a new instance; skip the redundant kickstart -k call which was killing that freshly-started instance and triggering an unnecessary throttle delay
/proc/<pid>/stat is Linux-only; on macOS it always returns None, leaving start_time: null in gateway_state.json. This breaks session identity validation and causes token counts to stay at 0 in /status output. Fall back to `ps -p <pid> -o lstart=` on platforms where /proc is absent, parsing the human-readable date via email.utils.parsedate.
…x empty reload error message - list_resources/list_prompts: servers that don't implement optional MCP capabilities return -32601 (Method not found). Treat this as DEBUG not ERROR to eliminate false-alarm log noise on every gateway startup. - MCP reload: use repr(e) so empty exception messages don't produce a blank "❌ MCP reload failed:" line in the UI. - register_mcp_servers: clear stale thread interrupt flag before MCP discovery so reused executor threads from prior agent sessions don't cancel the discovery coroutine (fixes CancelledError on reconnect).
d74069e to
afde7ab
Compare
This was referenced May 1, 2026
This was referenced May 13, 2026
Open
…fter-update # Conflicts: # gateway/run.py # gateway/status.py # tools/mcp_tool.py
Author
|
Updated this PR branch in What changed:
Verification:
Current pushed head: |
…fter-update # Conflicts: # hermes_cli/gateway.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On macOS, running
hermes updatefrom a standalone terminal leaves the gateway unresponsive for ~10 seconds after the update completes (or longer under load).Root cause
launchd_restart()sends SIGTERM to the old gateway. The gateway exits with code 1 (non-zero), which immediately triggers launchd'sKeepAlive(SuccessfulExit=false) — a new gateway instance starts. Thenlaunchd_restart()callslaunchctl kickstart -k, which kills that freshly-started instance within milliseconds of it starting. launchd sees a job that exited almost immediately, applies its defaultThrottleInterval(10 s), and delays the final restart.The
kickstart -kcall is only safe when KeepAlive is not configured (i.e., the service won't auto-restart on its own). WithKeepAlive.SuccessfulExit=false, SIGTERM → exit 1 → launchd restart is already a complete, race-free cycle.This affects all macOS users who:
hermes gateway start)hermes updatefrom a standalone terminal (the common case)The
_request_gateway_self_restartshortcut (SIGUSR1, no kickstart) only fires when the gateway is an ancestor of the calling process — which is not the case for a normal terminalhermes update.Fix
hermes_cli/gateway.py—launchd_restart(): after SIGTERM causes the gateway to exit cleanly, return early instead of callingkickstart -k. launchd's KeepAlive is already restarting the service; the extra kickstart only interferes.generate_launchd_plist(): addThrottleInterval=5so that if the race does occur (drain timeout path), launchd resolves it in 5 s instead of the 10 s default.Testing
Also verified
hermes gateway stopstill works correctly — it useslaunchctl bootoutwhich fully unloads the service, so KeepAlive never fires for an intentional stop.