# Issue: `start_gateway` should verify lock-holder PID is alive before treating stale lock as "another instance"

# Issue: `start_gateway` should verify lock-holder PID is alive before treating stale lock as "another instance"

## Bug Report

**Component:** `gateway/status.py`, `gateway/run.py`
**Severity:** High — causes complete gateway startup failure after crash, requiring manual file cleanup
**Platform:** Primarily Windows (reproducible on any platform where `acquire_gateway_runtime_lock` returns `False` for a stale lock)

---

## Summary

When the gateway process crashes or is force-killed, the runtime lock file may remain "locked" in a way that `acquire_gateway_runtime_lock()` returns `False` on the next startup. The startup logic in `gateway/run.py` then exits with:

```
ERROR: Gateway runtime lock is already held by another instance. Exiting.
```

**However**, the existing `get_running_pid()` function already contains robust stale-PID detection (checks `os.kill(pid, 0)`, process start time, cmdline matching, and auto-cleans PID files). The startup flow does **not** call `get_running_pid()` before `acquire_gateway_runtime_lock()`, so this stale-lock cleanup logic is completely bypassed.

---

## Reproduction Steps

1. Start gateway normally: `hermes gateway run`
2. Force-kill the gateway process (e.g., `taskkill /F /PID <pid>` on Windows, or `kill -9` on POSIX)
3. Attempt to restart gateway: `hermes gateway run`
4. **Observed:** Gateway immediately exits with "runtime lock is already held by another instance"
5. **Workaround:** Manually delete `~/.hermes/gateway.lock` or `~/.hermes/gateway.pid`, then restart succeeds

---

## Root Cause Analysis

### Current startup flow (simplified)

```python
# gateway/run.py ~L15330
current_pid = get_running_pid()          # checks PID file + lock validity
if current_pid is not None and current_pid != os.getpid():
    logger.error("Another gateway instance started during our startup. Exiting.")
    return False

if not acquire_gateway_runtime_lock():   # ← ONLY checks file lock; NO PID validation
    logger.error("Gateway runtime lock is already held by another instance. Exiting.")
    return False
```

### The gap

`acquire_gateway_runtime_lock()` calls `_try_acquire_file_lock()`, which attempts to grab the OS-level file lock. If the lock is still held (e.g., Windows `msvcrt.locking` may not auto-release after `kill /F`), it returns `False` immediately. **It never asks: "who holds this lock, and are they still alive?"**

Meanwhile, `get_running_pid()` already does exactly this validation:

```python
# gateway/status.py ~L802
for record in (primary_record, fallback_record):
    pid = _pid_from_record(record)
    if pid is None:
        continue
    try:
        os.kill(pid, 0)  # existence check
    except ProcessLookupError:
        continue  # process is dead → stale
    # ... also checks start_time and cmdline
```

But `run.py` calls `get_running_pid()` **before** `acquire_gateway_runtime_lock()`, and only for the "another instance started during our startup" branch. If `get_running_pid()` returns `None` (because the PID file was already cleaned), but the **lock file itself** is still locked by a dead process, the code proceeds to `acquire_gateway_runtime_lock()` → `False` → exit.

---

## Proposed Fix

### Option A (Recommended): Reuse `get_running_pid()` as a lock-validity gate

In `gateway/run.py`, before calling `acquire_gateway_runtime_lock()`, attempt to read the PID from the lock file and validate it with `get_running_pid()`'s logic. If the recorded PID is dead, **forcibly break the stale lock** by closing/reopening the lock file (or documenting that the user should run with `--replace`).

### Option B: Make `acquire_gateway_runtime_lock()` smarter

Add a `cleanup_stale: bool = True` parameter to `acquire_gateway_runtime_lock()`. When the initial lock attempt fails:

1. Read the PID record from the lock file (`_read_gateway_lock_record()`)
2. If the recorded PID is dead (`os.kill(pid, 0)` raises `ProcessLookupError` or `OSError`)
3. Close the current handle, truncate/reopen the lock file, and retry the lock acquisition
4. Log a warning: `Recovered stale runtime lock from dead process PID {pid}`

This mirrors the pattern already used in `acquire_scoped_lock()`, which **does** replace stale records:

```python
# test_status.py references this behavior:
# test_acquire_scoped_lock_replaces_stale_record
# test_acquire_scoped_lock_recovers_empty_lock_file
# test_acquire_scoped_lock_recovers_corrupt_lock_file
```

### Option C: Startup script auto-detect

In `hermes gateway run` CLI, add a pre-flight check: if `acquire_gateway_runtime_lock()` fails, call `get_running_pid()`. If `get_running_pid()` returns `None`, print a helpful error:

```
Gateway lock file appears stale (no running process holds it).
Run `hermes gateway run --replace` to force-start, or manually remove:
  <lock_path>
```

---

## Related Code

| File | Lines | Description |
|------|-------|-------------|
| `gateway/run.py` | 15330-15350 | Startup lock acquisition + PID file race logic |
| `gateway/status.py` | 313-331 | `acquire_gateway_runtime_lock()` — only checks file lock |
| `gateway/status.py` | 348-368 | `is_gateway_runtime_lock_active()` — lock existence check |
| `gateway/status.py` | 802-852 | `get_running_pid()` — **already has stale-PID cleanup** |
| `tests/gateway/test_status.py` | 55-76 | Test: `test_get_running_pid_cleans_stale_record_from_dead_process` |
| `tests/gateway/test_status.py` | 421-466 | Tests for `acquire_scoped_lock` stale-lock recovery |

---

## Environment

- **OS:** Windows 10/11 (also reproducible on Linux if lock mechanism doesn't auto-release)
- **Hermes version:** v0.5.25+
- **Python:** 3.11+
- **Lock mechanism:** `msvcrt.locking` on Windows, `fcntl.flock` on POSIX

---

## Impact

This issue causes **complete service unavailability** after any ungraceful gateway shutdown (crash, `kill -9`, Windows force-kill, power loss). Users without knowledge of the internal lock file location cannot recover without manual intervention. It also breaks automated restart loops (systemd `Restart=always`, scheduled health-check restarts, etc.).

---

## Workaround (for users hitting this now)

```bash
# Remove stale lock and PID files
rm ~/.hermes/gateway.lock ~/.hermes/gateway.pid

# Or on Windows:
del %USERPROFILE%\.hermes\gateway.lock %USERPROFILE%\.hermes\gateway.pid

# Then restart
hermes gateway run
```


File	Lines	Description
`gateway/run.py`	15330-15350	Startup lock acquisition + PID file race logic
`gateway/status.py`	313-331	`acquire_gateway_runtime_lock()` — only checks file lock
`gateway/status.py`	348-368	`is_gateway_runtime_lock_active()` — lock existence check
`gateway/status.py`	802-852	`get_running_pid()` — already has stale-PID cleanup
`tests/gateway/test_status.py`	55-76	Test: `test_get_running_pid_cleans_stale_record_from_dead_process`
`tests/gateway/test_status.py`	421-466	Tests for `acquire_scoped_lock` stale-lock recovery

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

# Issue: `start_gateway` should verify lock-holder PID is alive before treating stale lock as "another instance" #28561

Issue: `start_gateway` should verify lock-holder PID is alive before treating stale lock as "another instance"

Bug Report

Summary

Reproduction Steps

Root Cause Analysis

Current startup flow (simplified)

The gap

Proposed Fix

Option A (Recommended): Reuse `get_running_pid()` as a lock-validity gate

Option B: Make `acquire_gateway_runtime_lock()` smarter

Option C: Startup script auto-detect

Related Code

Environment

Impact

Workaround (for users hitting this now)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

# Issue: start_gateway should verify lock-holder PID is alive before treating stale lock as "another instance" #28561

Description

Issue: start_gateway should verify lock-holder PID is alive before treating stale lock as "another instance"

Bug Report

Summary

Reproduction Steps

Root Cause Analysis

Current startup flow (simplified)

The gap

Proposed Fix

Option A (Recommended): Reuse get_running_pid() as a lock-validity gate

Option B: Make acquire_gateway_runtime_lock() smarter

Option C: Startup script auto-detect

Related Code

Environment

Impact

Workaround (for users hitting this now)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

# Issue: `start_gateway` should verify lock-holder PID is alive before treating stale lock as "another instance" #28561

Issue: `start_gateway` should verify lock-holder PID is alive before treating stale lock as "another instance"

Option A (Recommended): Reuse `get_running_pid()` as a lock-validity gate

Option B: Make `acquire_gateway_runtime_lock()` smarter