fix(gateway/macos): kill orphan gateway processes on launchd restart; add macOS cmdline fallback#12374
Open
Wonham wants to merge 1 commit into
Open
Conversation
… add macOS cmdline fallback launchd_restart() only killed the PID tracked in gateway.pid. When that file was stale (pointing to a dead PID), the Python kill step was skipped entirely. launchctl kickstart -k then only killed the launchd-tracked process, leaving any manually-started or previously-orphaned gateway alive. That surviving process held the platform token lock (e.g. Weixin), causing the new gateway to fail with "Weixin bot token already in use". Fix 1 — hermes_cli/gateway.py: Add _kill_orphan_gateway_pids() which calls the existing find_gateway_pids() to locate all gateway processes not tracked by launchd, then SIGTERMs them (with a SIGKILL escalation after 1 s). Called from launchd_restart() before kickstart -k and also on the SIGUSR1 self-restart path. Fix 2 — gateway/status.py: _read_process_cmdline() fell through to returning None on macOS because /proc/<pid>/cmdline does not exist on Darwin. This made _looks_like_ gateway_process() always return False, disabling the cmdline-based staleness check in acquire_scoped_lock() on macOS. Add a ps(1) fallback so cmdline resolution works on macOS/BSD. Relates to NousResearch#5109 (launchd can start a second gateway on macOS). Relates to NousResearch#11718 (--replace race condition: multiple instances). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This was referenced Apr 24, 2026
This was referenced May 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On macOS, after
hermes gateway restart, platform channels (e.g. Weixin/WeChat) stop responding. Logs show:Root Cause
1.
launchd_restart()leaves orphan gateway processes alivelaunchd_restart()only kills the PID recorded ingateway.pid. When that file is stale (points to a dead PID), the Python kill step is skipped entirely.launchctl kickstart -kthen only kills the single launchd-tracked process — any manually-started or previously-orphaned gateway process survives and continues holding the platform token lock (e.g.weixin-bot-token). The new gateway fails to acquire that lock and the platform doesn't connect.find_gateway_pids()already exists and can locate all gateway processes viaps, butlaunchd_restart()never calls it.2.
_read_process_cmdline()always returnsNoneon macOSThe function reads
/proc/<pid>/cmdline, which is Linux-only. On macOS/Darwin, this always fails silently, making_looks_like_gateway_process()always returnFalse. This disables the cmdline-based process identity check inacquire_scoped_lock()on macOS — the only remaining check isos.kill(pid, 0), which can't distinguish a recycled PID from a live gateway.Fix
hermes_cli/gateway.pyAdd
_kill_orphan_gateway_pids(): callsfind_gateway_pids(exclude_pids=service_pids)to locate all gateway processes not tracked by launchd, SIGTERMs them, then escalates to SIGKILL after 1 s. Called fromlaunchd_restart()beforekickstart -kand also on the SIGUSR1 self-restart path.gateway/status.pyAdd a
ps(1)fallback to_read_process_cmdline()so cmdline resolution works on macOS/BSD when/procis unavailable.Related Issues
gateway run --replacerace condition: multiple instances run simultaneouslyTesting
Reproduced on macOS 15.4 (Darwin 25.4.0) with launchd-managed gateway:
hermes gateway restartwhile a manually-started gateway holds the Weixin lockkickstart -k, new gateway acquires lock successfully