Summary
On macOS (and Windows), gateway/status.py::acquire_scoped_lock() can refuse to start the gateway forever after an unclean shutdown, because its PID-reuse guard silently degrades to a bare os.kill(pid, 0) check. Once that PID gets recycled by any other process, the lock is treated as "still held" until a human deletes the file.
Symptom in the wild (macOS 26, launchd-managed gateway):
ERROR gateway.platforms.base: [Telegram] Telegram bot token already in use (PID 450). Stop the other gateway first.
ERROR gateway.run: Gateway hit a non-retryable startup conflict: telegram: Telegram bot token already in use (PID 450). Stop the other gateway first.
…where PID 450 was actually /usr/libexec/intelligentroutingd, recycled long after the original gateway had died. The lock file at ~/.local/state/hermes/gateway-locks/telegram-bot-token-<hash>.lock looked like:
{"pid": 450, "kind": "hermes-gateway", "start_time": null, "scope": "telegram-bot-token", ...}
KeepAlive then loops forever: launchd restarts the gateway, gateway sees the "live" lock, exits with the non-retryable error, repeat. Telegram bot stays unreachable until manual rm of the lockfile.
Root cause
In gateway/status.py:
_get_process_start_time(pid) only reads /proc/<pid>/stat (Linux-only). On macOS/Windows it always returns None.
- Because of (1), every lockfile written on macOS has
"start_time": null.
- The PID-reuse guard inside
acquire_scoped_lock() requires both the stored start_time AND the live one to be non-null:
if (
existing.get("start_time") is not None
and current_start is not None
and current_start != existing.get("start_time")
):
stale = True
On macOS both are None, so the guard is silently skipped.
- The fallback "is the process stopped (Ctrl+Z)?" check also reads
/proc/<pid>/status — Linux-only, so it's a no-op on macOS too.
- Net effect: as soon as the recorded PID is reused by anything alive,
os.kill(pid, 0) succeeds and the lock is treated as held — permanently.
For comparison, the runtime lock path uses _looks_like_gateway_process(pid) (which reads cmdline patterns) as a defense — but _read_process_cmdline() is /proc/<pid>/cmdline-only, also Linux-only. The scoped lock path doesn't even call that helper.
What made the lock stale in the first place was an unclean shutdown — the gateway logged Gateway drain timed out after 180.0s with 1 active agent(s); interrupting remaining work on the way down, so release_scoped_lock() never ran. (Likely launchd SIGKILL after its grace window.) That's the trigger; the bug above is what makes it stick forever.
Reproduction
On macOS:
# 1. Start the gateway, get its PID
launchctl print gui/$(id -u)/ai.hermes.gateway | grep pid
# 2. SIGKILL it so locks aren't released
kill -9 <pid>
# 3. Inspect the leftover lock — note start_time: null
cat ~/.local/state/hermes/gateway-locks/telegram-bot-token-*.lock
# 4. Wait for the macOS PID space to wrap (or just spawn enough processes to reach <pid>)
# On a busy laptop this happens within minutes.
# 5. Try to start the gateway again — fails with "already in use".
hermes gateway run --replace
Proposed fix
Three layers, in order of impact:
1. Cross-platform _get_process_start_time. Replace the /proc-only reader with psutil.Process(pid).create_time(). psutil is already widely available; if adding it as a hard dep is undesirable, fall back to sysctl KERN_PROC_PID on Darwin (subprocess to /usr/sbin/sysctl -n kern.proc.pid.<pid> works without new deps) and GetProcessTimes on Windows. Once start_time is populated on every OS, the existing guard at lines 513–518 of gateway/status.py does its job.
2. Identity check inside the scoped-lock staleness path. Even without start_time, _looks_like_gateway_process(pid) (line 139) plus a cross-platform cmdline reader (psutil.Process(pid).cmdline() or ps -o command= -p <pid>) would catch this case. The runtime lock path already uses the equivalent idea via _record_looks_like_gateway. Add the same to acquire_scoped_lock():
if not stale and existing.get("kind") == _GATEWAY_KIND \
and not _looks_like_gateway_process(existing_pid):
stale = True
3. Cleaner shutdown. Make sure release_scoped_lock() runs even when the agent drain times out — either bump the drain timeout's hard kill so the release path always fires, or register an atexit/signal-handler that releases scoped locks unconditionally. Reduces how often stale lockfiles appear in the first place.
Optional UX improvement: a hermes gateway unlock (or hermes doctor --fix-locks) command so end users don't need to know where ~/.local/state/hermes/gateway-locks/ lives.
Workaround
Until the fix lands, wrap the gateway launch with a prestart hook that scans $HERMES_GATEWAY_LOCK_DIR (or ~/.local/state/hermes/gateway-locks/) and removes any *.lock whose recorded PID is either dead or alive-but-not-a-gateway. On a launchd-managed setup, point the LaunchAgent's ProgramArguments at a small wrapper script that runs the cleanup, then execs the real gateway.
(Reporting this from a macOS deployment where the bot was silently down for hours before the stale-lock root cause was identified.)
Environment
- macOS 26 (arm64)
- Gateway managed by launchd via
~/Library/LaunchAgents/ai.hermes.gateway.plist
- Telegram platform (likely affects every scoped-lock-using platform on macOS/Windows)
Summary
On macOS (and Windows),
gateway/status.py::acquire_scoped_lock()can refuse to start the gateway forever after an unclean shutdown, because its PID-reuse guard silently degrades to a bareos.kill(pid, 0)check. Once that PID gets recycled by any other process, the lock is treated as "still held" until a human deletes the file.Symptom in the wild (macOS 26, launchd-managed gateway):
…where PID 450 was actually
/usr/libexec/intelligentroutingd, recycled long after the original gateway had died. The lock file at~/.local/state/hermes/gateway-locks/telegram-bot-token-<hash>.locklooked like:{"pid": 450, "kind": "hermes-gateway", "start_time": null, "scope": "telegram-bot-token", ...}KeepAlive then loops forever: launchd restarts the gateway, gateway sees the "live" lock, exits with the non-retryable error, repeat. Telegram bot stays unreachable until manual
rmof the lockfile.Root cause
In
gateway/status.py:_get_process_start_time(pid)only reads/proc/<pid>/stat(Linux-only). On macOS/Windows it always returnsNone."start_time": null.acquire_scoped_lock()requires both the stored start_time AND the live one to be non-null:None, so the guard is silently skipped./proc/<pid>/status— Linux-only, so it's a no-op on macOS too.os.kill(pid, 0)succeeds and the lock is treated as held — permanently.For comparison, the runtime lock path uses
_looks_like_gateway_process(pid)(which reads cmdline patterns) as a defense — but_read_process_cmdline()is/proc/<pid>/cmdline-only, also Linux-only. The scoped lock path doesn't even call that helper.What made the lock stale in the first place was an unclean shutdown — the gateway logged
Gateway drain timed out after 180.0s with 1 active agent(s); interrupting remaining workon the way down, sorelease_scoped_lock()never ran. (Likely launchd SIGKILL after its grace window.) That's the trigger; the bug above is what makes it stick forever.Reproduction
On macOS:
Proposed fix
Three layers, in order of impact:
1. Cross-platform
_get_process_start_time. Replace the/proc-only reader withpsutil.Process(pid).create_time().psutilis already widely available; if adding it as a hard dep is undesirable, fall back tosysctl KERN_PROC_PIDon Darwin (subprocess to/usr/sbin/sysctl -n kern.proc.pid.<pid>works without new deps) andGetProcessTimeson Windows. Once start_time is populated on every OS, the existing guard at lines 513–518 ofgateway/status.pydoes its job.2. Identity check inside the scoped-lock staleness path. Even without start_time,
_looks_like_gateway_process(pid)(line 139) plus a cross-platform cmdline reader (psutil.Process(pid).cmdline()orps -o command= -p <pid>) would catch this case. The runtime lock path already uses the equivalent idea via_record_looks_like_gateway. Add the same toacquire_scoped_lock():3. Cleaner shutdown. Make sure
release_scoped_lock()runs even when the agent drain times out — either bump the drain timeout's hard kill so the release path always fires, or register anatexit/signal-handler that releases scoped locks unconditionally. Reduces how often stale lockfiles appear in the first place.Optional UX improvement: a
hermes gateway unlock(orhermes doctor --fix-locks) command so end users don't need to know where~/.local/state/hermes/gateway-locks/lives.Workaround
Until the fix lands, wrap the gateway launch with a prestart hook that scans
$HERMES_GATEWAY_LOCK_DIR(or~/.local/state/hermes/gateway-locks/) and removes any*.lockwhose recorded PID is either dead or alive-but-not-a-gateway. On a launchd-managed setup, point the LaunchAgent'sProgramArgumentsat a small wrapper script that runs the cleanup, thenexecs the real gateway.(Reporting this from a macOS deployment where the bot was silently down for hours before the stale-lock root cause was identified.)
Environment
~/Library/LaunchAgents/ai.hermes.gateway.plist