gateway: scoped lock PID-reuse guard is a no-op on macOS/Windows — stale lockfiles permanently block startup

## Summary

On macOS (and Windows), `gateway/status.py::acquire_scoped_lock()` can refuse to start the gateway forever after an unclean shutdown, because its PID-reuse guard silently degrades to a bare `os.kill(pid, 0)` check. Once that PID gets recycled by any other process, the lock is treated as "still held" until a human deletes the file.

Symptom in the wild (macOS 26, launchd-managed gateway):

```
ERROR gateway.platforms.base: [Telegram] Telegram bot token already in use (PID 450). Stop the other gateway first.
ERROR gateway.run: Gateway hit a non-retryable startup conflict: telegram: Telegram bot token already in use (PID 450). Stop the other gateway first.
```

…where PID 450 was actually `/usr/libexec/intelligentroutingd`, recycled long after the original gateway had died. The lock file at `~/.local/state/hermes/gateway-locks/telegram-bot-token-<hash>.lock` looked like:

```json
{"pid": 450, "kind": "hermes-gateway", "start_time": null, "scope": "telegram-bot-token", ...}
```

KeepAlive then loops forever: launchd restarts the gateway, gateway sees the "live" lock, exits with the non-retryable error, repeat. Telegram bot stays unreachable until manual `rm` of the lockfile.

## Root cause

In `gateway/status.py`:

1. `_get_process_start_time(pid)` only reads `/proc/<pid>/stat` (Linux-only). On macOS/Windows it always returns `None`.
2. Because of (1), every lockfile written on macOS has `"start_time": null`.
3. The PID-reuse guard inside `acquire_scoped_lock()` requires both the stored start_time AND the live one to be non-null:
   ```python
   if (
       existing.get("start_time") is not None
       and current_start is not None
       and current_start != existing.get("start_time")
   ):
       stale = True
   ```
   On macOS both are `None`, so the guard is silently skipped.
4. The fallback "is the process stopped (Ctrl+Z)?" check also reads `/proc/<pid>/status` — Linux-only, so it's a no-op on macOS too.
5. Net effect: as soon as the recorded PID is reused by anything alive, `os.kill(pid, 0)` succeeds and the lock is treated as held — permanently.

For comparison, the **runtime** lock path uses `_looks_like_gateway_process(pid)` (which reads cmdline patterns) as a defense — but `_read_process_cmdline()` is `/proc/<pid>/cmdline`-only, also Linux-only. The scoped lock path doesn't even call that helper.

What made the lock stale in the first place was an unclean shutdown — the gateway logged `Gateway drain timed out after 180.0s with 1 active agent(s); interrupting remaining work` on the way down, so `release_scoped_lock()` never ran. (Likely launchd SIGKILL after its grace window.) That's the trigger; the bug above is what makes it stick forever.

## Reproduction

On macOS:

```bash
# 1. Start the gateway, get its PID
launchctl print gui/$(id -u)/ai.hermes.gateway | grep pid

# 2. SIGKILL it so locks aren't released
kill -9 <pid>

# 3. Inspect the leftover lock — note start_time: null
cat ~/.local/state/hermes/gateway-locks/telegram-bot-token-*.lock

# 4. Wait for the macOS PID space to wrap (or just spawn enough processes to reach <pid>)
#    On a busy laptop this happens within minutes.

# 5. Try to start the gateway again — fails with "already in use".
hermes gateway run --replace
```

## Proposed fix

Three layers, in order of impact:

**1. Cross-platform `_get_process_start_time`.** Replace the `/proc`-only reader with `psutil.Process(pid).create_time()`. `psutil` is already widely available; if adding it as a hard dep is undesirable, fall back to `sysctl KERN_PROC_PID` on Darwin (subprocess to `/usr/sbin/sysctl -n kern.proc.pid.<pid>` works without new deps) and `GetProcessTimes` on Windows. Once start_time is populated on every OS, the existing guard at lines 513–518 of `gateway/status.py` does its job.

**2. Identity check inside the scoped-lock staleness path.** Even without start_time, `_looks_like_gateway_process(pid)` (line 139) plus a cross-platform cmdline reader (`psutil.Process(pid).cmdline()` or `ps -o command= -p <pid>`) would catch this case. The runtime lock path already uses the equivalent idea via `_record_looks_like_gateway`. Add the same to `acquire_scoped_lock()`:
   ```python
   if not stale and existing.get("kind") == _GATEWAY_KIND \
           and not _looks_like_gateway_process(existing_pid):
       stale = True
   ```

**3. Cleaner shutdown.** Make sure `release_scoped_lock()` runs even when the agent drain times out — either bump the drain timeout's hard kill so the release path always fires, or register an `atexit`/signal-handler that releases scoped locks unconditionally. Reduces how often stale lockfiles appear in the first place.

Optional UX improvement: a `hermes gateway unlock` (or `hermes doctor --fix-locks`) command so end users don't need to know where `~/.local/state/hermes/gateway-locks/` lives.

## Workaround

Until the fix lands, wrap the gateway launch with a prestart hook that scans `$HERMES_GATEWAY_LOCK_DIR` (or `~/.local/state/hermes/gateway-locks/`) and removes any `*.lock` whose recorded PID is either dead or alive-but-not-a-gateway. On a launchd-managed setup, point the LaunchAgent's `ProgramArguments` at a small wrapper script that runs the cleanup, then `exec`s the real gateway.

(Reporting this from a macOS deployment where the bot was silently down for hours before the stale-lock root cause was identified.)

## Environment

- macOS 26 (arm64)
- Gateway managed by launchd via `~/Library/LaunchAgents/ai.hermes.gateway.plist`
- Telegram platform (likely affects every scoped-lock-using platform on macOS/Windows)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gateway: scoped lock PID-reuse guard is a no-op on macOS/Windows — stale lockfiles permanently block startup #18778

Summary

Root cause

Reproduction

Proposed fix

Workaround

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

gateway: scoped lock PID-reuse guard is a no-op on macOS/Windows — stale lockfiles permanently block startup #18778

Description

Summary

Root cause

Reproduction

Proposed fix

Workaround

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions