Skip to content

fix(gateway/macos): kill orphan gateway processes on launchd restart; add macOS cmdline fallback#12374

Open
Wonham wants to merge 1 commit into
NousResearch:mainfrom
Wonham:fix/gateway-orphan-processes-and-macos-lock-stale
Open

fix(gateway/macos): kill orphan gateway processes on launchd restart; add macOS cmdline fallback#12374
Wonham wants to merge 1 commit into
NousResearch:mainfrom
Wonham:fix/gateway-orphan-processes-and-macos-lock-stale

Conversation

@Wonham

@Wonham Wonham commented Apr 19, 2026

Copy link
Copy Markdown

Problem

On macOS, after hermes gateway restart, platform channels (e.g. Weixin/WeChat) stop responding. Logs show:

ERROR gateway.platforms.base: [Weixin] Weixin bot token already in use (PID XXXXX). Stop the other gateway first.
WARNING gateway.run: ✗ weixin failed to connect

Root Cause

1. launchd_restart() leaves orphan gateway processes alive

launchd_restart() only kills the PID recorded in gateway.pid. When that file is stale (points to a dead PID), the Python kill step is skipped entirely. launchctl kickstart -k then only kills the single launchd-tracked process — any manually-started or previously-orphaned gateway process survives and continues holding the platform token lock (e.g. weixin-bot-token). The new gateway fails to acquire that lock and the platform doesn't connect.

find_gateway_pids() already exists and can locate all gateway processes via ps, but launchd_restart() never calls it.

2. _read_process_cmdline() always returns None on macOS

The function reads /proc/<pid>/cmdline, which is Linux-only. On macOS/Darwin, this always fails silently, making _looks_like_gateway_process() always return False. This disables the cmdline-based process identity check in acquire_scoped_lock() on macOS — the only remaining check is os.kill(pid, 0), which can't distinguish a recycled PID from a live gateway.

Fix

hermes_cli/gateway.py

Add _kill_orphan_gateway_pids(): calls find_gateway_pids(exclude_pids=service_pids) to locate all gateway processes not tracked by launchd, SIGTERMs them, then escalates to SIGKILL after 1 s. Called from launchd_restart() before kickstart -k and also on the SIGUSR1 self-restart path.

gateway/status.py

Add a ps(1) fallback to _read_process_cmdline() so cmdline resolution works on macOS/BSD when /proc is unavailable.

Related Issues

Testing

Reproduced on macOS 15.4 (Darwin 25.4.0) with launchd-managed gateway:

  1. Run hermes gateway restart while a manually-started gateway holds the Weixin lock
  2. Before fix: new gateway logs "Weixin bot token already in use" and Weixin goes silent
  3. After fix: orphan is killed before kickstart -k, new gateway acquires lock successfully

… add macOS cmdline fallback

launchd_restart() only killed the PID tracked in gateway.pid.  When that
file was stale (pointing to a dead PID), the Python kill step was skipped
entirely.  launchctl kickstart -k then only killed the launchd-tracked
process, leaving any manually-started or previously-orphaned gateway alive.
That surviving process held the platform token lock (e.g. Weixin), causing
the new gateway to fail with "Weixin bot token already in use".

Fix 1 — hermes_cli/gateway.py:
  Add _kill_orphan_gateway_pids() which calls the existing find_gateway_pids()
  to locate all gateway processes not tracked by launchd, then SIGTERMs them
  (with a SIGKILL escalation after 1 s).  Called from launchd_restart() before
  kickstart -k and also on the SIGUSR1 self-restart path.

Fix 2 — gateway/status.py:
  _read_process_cmdline() fell through to returning None on macOS because
  /proc/<pid>/cmdline does not exist on Darwin.  This made _looks_like_
  gateway_process() always return False, disabling the cmdline-based
  staleness check in acquire_scoped_lock() on macOS.  Add a ps(1) fallback
  so cmdline resolution works on macOS/BSD.

Relates to NousResearch#5109 (launchd can start a second gateway on macOS).
Relates to NousResearch#11718 (--replace race condition: multiple instances).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants