Skip to content

gateway: scoped lock PID-reuse guard is a no-op on macOS/Windows — stale lockfiles permanently block startup #18778

@helmut-hoffer-von-ankershoffen

Description

Summary

On macOS (and Windows), gateway/status.py::acquire_scoped_lock() can refuse to start the gateway forever after an unclean shutdown, because its PID-reuse guard silently degrades to a bare os.kill(pid, 0) check. Once that PID gets recycled by any other process, the lock is treated as "still held" until a human deletes the file.

Symptom in the wild (macOS 26, launchd-managed gateway):

ERROR gateway.platforms.base: [Telegram] Telegram bot token already in use (PID 450). Stop the other gateway first.
ERROR gateway.run: Gateway hit a non-retryable startup conflict: telegram: Telegram bot token already in use (PID 450). Stop the other gateway first.

…where PID 450 was actually /usr/libexec/intelligentroutingd, recycled long after the original gateway had died. The lock file at ~/.local/state/hermes/gateway-locks/telegram-bot-token-<hash>.lock looked like:

{"pid": 450, "kind": "hermes-gateway", "start_time": null, "scope": "telegram-bot-token", ...}

KeepAlive then loops forever: launchd restarts the gateway, gateway sees the "live" lock, exits with the non-retryable error, repeat. Telegram bot stays unreachable until manual rm of the lockfile.

Root cause

In gateway/status.py:

  1. _get_process_start_time(pid) only reads /proc/<pid>/stat (Linux-only). On macOS/Windows it always returns None.
  2. Because of (1), every lockfile written on macOS has "start_time": null.
  3. The PID-reuse guard inside acquire_scoped_lock() requires both the stored start_time AND the live one to be non-null:
    if (
        existing.get("start_time") is not None
        and current_start is not None
        and current_start != existing.get("start_time")
    ):
        stale = True
    On macOS both are None, so the guard is silently skipped.
  4. The fallback "is the process stopped (Ctrl+Z)?" check also reads /proc/<pid>/status — Linux-only, so it's a no-op on macOS too.
  5. Net effect: as soon as the recorded PID is reused by anything alive, os.kill(pid, 0) succeeds and the lock is treated as held — permanently.

For comparison, the runtime lock path uses _looks_like_gateway_process(pid) (which reads cmdline patterns) as a defense — but _read_process_cmdline() is /proc/<pid>/cmdline-only, also Linux-only. The scoped lock path doesn't even call that helper.

What made the lock stale in the first place was an unclean shutdown — the gateway logged Gateway drain timed out after 180.0s with 1 active agent(s); interrupting remaining work on the way down, so release_scoped_lock() never ran. (Likely launchd SIGKILL after its grace window.) That's the trigger; the bug above is what makes it stick forever.

Reproduction

On macOS:

# 1. Start the gateway, get its PID
launchctl print gui/$(id -u)/ai.hermes.gateway | grep pid

# 2. SIGKILL it so locks aren't released
kill -9 <pid>

# 3. Inspect the leftover lock — note start_time: null
cat ~/.local/state/hermes/gateway-locks/telegram-bot-token-*.lock

# 4. Wait for the macOS PID space to wrap (or just spawn enough processes to reach <pid>)
#    On a busy laptop this happens within minutes.

# 5. Try to start the gateway again — fails with "already in use".
hermes gateway run --replace

Proposed fix

Three layers, in order of impact:

1. Cross-platform _get_process_start_time. Replace the /proc-only reader with psutil.Process(pid).create_time(). psutil is already widely available; if adding it as a hard dep is undesirable, fall back to sysctl KERN_PROC_PID on Darwin (subprocess to /usr/sbin/sysctl -n kern.proc.pid.<pid> works without new deps) and GetProcessTimes on Windows. Once start_time is populated on every OS, the existing guard at lines 513–518 of gateway/status.py does its job.

2. Identity check inside the scoped-lock staleness path. Even without start_time, _looks_like_gateway_process(pid) (line 139) plus a cross-platform cmdline reader (psutil.Process(pid).cmdline() or ps -o command= -p <pid>) would catch this case. The runtime lock path already uses the equivalent idea via _record_looks_like_gateway. Add the same to acquire_scoped_lock():

if not stale and existing.get("kind") == _GATEWAY_KIND \
        and not _looks_like_gateway_process(existing_pid):
    stale = True

3. Cleaner shutdown. Make sure release_scoped_lock() runs even when the agent drain times out — either bump the drain timeout's hard kill so the release path always fires, or register an atexit/signal-handler that releases scoped locks unconditionally. Reduces how often stale lockfiles appear in the first place.

Optional UX improvement: a hermes gateway unlock (or hermes doctor --fix-locks) command so end users don't need to know where ~/.local/state/hermes/gateway-locks/ lives.

Workaround

Until the fix lands, wrap the gateway launch with a prestart hook that scans $HERMES_GATEWAY_LOCK_DIR (or ~/.local/state/hermes/gateway-locks/) and removes any *.lock whose recorded PID is either dead or alive-but-not-a-gateway. On a launchd-managed setup, point the LaunchAgent's ProgramArguments at a small wrapper script that runs the cleanup, then execs the real gateway.

(Reporting this from a macOS deployment where the bot was silently down for hours before the stale-lock root cause was identified.)

Environment

  • macOS 26 (arm64)
  • Gateway managed by launchd via ~/Library/LaunchAgents/ai.hermes.gateway.plist
  • Telegram platform (likely affects every scoped-lock-using platform on macOS/Windows)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliverysweeper:implemented-on-mainSweeper: behavior already present on current maintype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions