Skip to content

fix(macos): skip redundant kickstart -k after SIGTERM to prevent restart race on launchd#10317

Open
AJV20 wants to merge 6 commits into
NousResearch:mainfrom
AJV20:fix/launchd-restart-after-update
Open

fix(macos): skip redundant kickstart -k after SIGTERM to prevent restart race on launchd#10317
AJV20 wants to merge 6 commits into
NousResearch:mainfrom
AJV20:fix/launchd-restart-after-update

Conversation

@AJV20

@AJV20 AJV20 commented Apr 15, 2026

Copy link
Copy Markdown

Problem

On macOS, running hermes update from a standalone terminal leaves the gateway unresponsive for ~10 seconds after the update completes (or longer under load).

Root cause

launchd_restart() sends SIGTERM to the old gateway. The gateway exits with code 1 (non-zero), which immediately triggers launchd's KeepAlive (SuccessfulExit=false) — a new gateway instance starts. Then launchd_restart() calls launchctl kickstart -k, which kills that freshly-started instance within milliseconds of it starting. launchd sees a job that exited almost immediately, applies its default ThrottleInterval (10 s), and delays the final restart.

The kickstart -k call is only safe when KeepAlive is not configured (i.e., the service won't auto-restart on its own). With KeepAlive.SuccessfulExit=false, SIGTERM → exit 1 → launchd restart is already a complete, race-free cycle.

This affects all macOS users who:

  • Have the gateway installed as a launchd service (hermes gateway start)
  • Run hermes update from a standalone terminal (the common case)

The _request_gateway_self_restart shortcut (SIGUSR1, no kickstart) only fires when the gateway is an ancestor of the calling process — which is not the case for a normal terminal hermes update.

Fix

hermes_cli/gateway.pylaunchd_restart(): after SIGTERM causes the gateway to exit cleanly, return early instead of calling kickstart -k. launchd's KeepAlive is already restarting the service; the extra kickstart only interferes.

generate_launchd_plist(): add ThrottleInterval=5 so that if the race does occur (drain timeout path), launchd resolves it in 5 s instead of the 10 s default.

Testing

# Verify gateway comes back immediately after update on macOS
hermes update
# Before fix: gateway offline for ~10s
# After fix: gateway back within 1-2s of the update completing

Also verified hermes gateway stop still works correctly — it uses launchctl bootout which fully unloads the service, so KeepAlive never fires for an intentional stop.

@AJV20 AJV20 force-pushed the fix/launchd-restart-after-update branch from 685f181 to 7f746e3 Compare April 15, 2026 14:12
AJV20 added 3 commits April 17, 2026 20:04
- Add ThrottleInterval=5s to launchd plist so rapid-restart cycles
  (from the update race condition) resolve within 5 seconds instead of
  the launchd default of 10s
- Fix launchd_restart(): after SIGTERM causes the gateway to exit,
  launchd KeepAlive already starts a new instance; skip the redundant
  kickstart -k call which was killing that freshly-started instance and
  triggering an unnecessary throttle delay
/proc/<pid>/stat is Linux-only; on macOS it always returns None, leaving
start_time: null in gateway_state.json. This breaks session identity
validation and causes token counts to stay at 0 in /status output.

Fall back to `ps -p <pid> -o lstart=` on platforms where /proc is
absent, parsing the human-readable date via email.utils.parsedate.
…x empty reload error message

- list_resources/list_prompts: servers that don't implement optional MCP
  capabilities return -32601 (Method not found). Treat this as DEBUG not
  ERROR to eliminate false-alarm log noise on every gateway startup.
- MCP reload: use repr(e) so empty exception messages don't produce a
  blank "❌ MCP reload failed:" line in the UI.
- register_mcp_servers: clear stale thread interrupt flag before MCP
  discovery so reused executor threads from prior agent sessions don't
  cancel the discovery coroutine (fixes CancelledError on reconnect).
…fter-update

# Conflicts:
#	gateway/run.py
#	gateway/status.py
#	tools/mcp_tool.py
@AJV20

AJV20 commented May 22, 2026

Copy link
Copy Markdown
Author

Updated this PR branch in ef2cb9ca6 after the latest main merge.

What changed:

  • Merged current origin/main into fix/launchd-restart-after-update.
  • Resolved conflicts where needed by preserving this PR's scoped behavior while keeping current upstream changes.

Verification:

  • git diff --check; python3.11 -m py_compile gateway/run.py gateway/status.py hermes_cli/gateway.py tools/mcp_tool.py

Current pushed head: ef2cb9ca6e08

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants