Issue: start_gateway should verify lock-holder PID is alive before treating stale lock as "another instance"
Bug Report
Component: gateway/status.py, gateway/run.py
Severity: High — causes complete gateway startup failure after crash, requiring manual file cleanup
Platform: Primarily Windows (reproducible on any platform where acquire_gateway_runtime_lock returns False for a stale lock)
Summary
When the gateway process crashes or is force-killed, the runtime lock file may remain "locked" in a way that acquire_gateway_runtime_lock() returns False on the next startup. The startup logic in gateway/run.py then exits with:
ERROR: Gateway runtime lock is already held by another instance. Exiting.
However, the existing get_running_pid() function already contains robust stale-PID detection (checks os.kill(pid, 0), process start time, cmdline matching, and auto-cleans PID files). The startup flow does not call get_running_pid() before acquire_gateway_runtime_lock(), so this stale-lock cleanup logic is completely bypassed.
Reproduction Steps
- Start gateway normally:
hermes gateway run
- Force-kill the gateway process (e.g.,
taskkill /F /PID <pid> on Windows, or kill -9 on POSIX)
- Attempt to restart gateway:
hermes gateway run
- Observed: Gateway immediately exits with "runtime lock is already held by another instance"
- Workaround: Manually delete
~/.hermes/gateway.lock or ~/.hermes/gateway.pid, then restart succeeds
Root Cause Analysis
Current startup flow (simplified)
# gateway/run.py ~L15330
current_pid = get_running_pid() # checks PID file + lock validity
if current_pid is not None and current_pid != os.getpid():
logger.error("Another gateway instance started during our startup. Exiting.")
return False
if not acquire_gateway_runtime_lock(): # ← ONLY checks file lock; NO PID validation
logger.error("Gateway runtime lock is already held by another instance. Exiting.")
return False
The gap
acquire_gateway_runtime_lock() calls _try_acquire_file_lock(), which attempts to grab the OS-level file lock. If the lock is still held (e.g., Windows msvcrt.locking may not auto-release after kill /F), it returns False immediately. It never asks: "who holds this lock, and are they still alive?"
Meanwhile, get_running_pid() already does exactly this validation:
# gateway/status.py ~L802
for record in (primary_record, fallback_record):
pid = _pid_from_record(record)
if pid is None:
continue
try:
os.kill(pid, 0) # existence check
except ProcessLookupError:
continue # process is dead → stale
# ... also checks start_time and cmdline
But run.py calls get_running_pid() before acquire_gateway_runtime_lock(), and only for the "another instance started during our startup" branch. If get_running_pid() returns None (because the PID file was already cleaned), but the lock file itself is still locked by a dead process, the code proceeds to acquire_gateway_runtime_lock() → False → exit.
Proposed Fix
Option A (Recommended): Reuse get_running_pid() as a lock-validity gate
In gateway/run.py, before calling acquire_gateway_runtime_lock(), attempt to read the PID from the lock file and validate it with get_running_pid()'s logic. If the recorded PID is dead, forcibly break the stale lock by closing/reopening the lock file (or documenting that the user should run with --replace).
Option B: Make acquire_gateway_runtime_lock() smarter
Add a cleanup_stale: bool = True parameter to acquire_gateway_runtime_lock(). When the initial lock attempt fails:
- Read the PID record from the lock file (
_read_gateway_lock_record())
- If the recorded PID is dead (
os.kill(pid, 0) raises ProcessLookupError or OSError)
- Close the current handle, truncate/reopen the lock file, and retry the lock acquisition
- Log a warning:
Recovered stale runtime lock from dead process PID {pid}
This mirrors the pattern already used in acquire_scoped_lock(), which does replace stale records:
# test_status.py references this behavior:
# test_acquire_scoped_lock_replaces_stale_record
# test_acquire_scoped_lock_recovers_empty_lock_file
# test_acquire_scoped_lock_recovers_corrupt_lock_file
Option C: Startup script auto-detect
In hermes gateway run CLI, add a pre-flight check: if acquire_gateway_runtime_lock() fails, call get_running_pid(). If get_running_pid() returns None, print a helpful error:
Gateway lock file appears stale (no running process holds it).
Run `hermes gateway run --replace` to force-start, or manually remove:
<lock_path>
Related Code
| File |
Lines |
Description |
gateway/run.py |
15330-15350 |
Startup lock acquisition + PID file race logic |
gateway/status.py |
313-331 |
acquire_gateway_runtime_lock() — only checks file lock |
gateway/status.py |
348-368 |
is_gateway_runtime_lock_active() — lock existence check |
gateway/status.py |
802-852 |
get_running_pid() — already has stale-PID cleanup |
tests/gateway/test_status.py |
55-76 |
Test: test_get_running_pid_cleans_stale_record_from_dead_process |
tests/gateway/test_status.py |
421-466 |
Tests for acquire_scoped_lock stale-lock recovery |
Environment
- OS: Windows 10/11 (also reproducible on Linux if lock mechanism doesn't auto-release)
- Hermes version: v0.5.25+
- Python: 3.11+
- Lock mechanism:
msvcrt.locking on Windows, fcntl.flock on POSIX
Impact
This issue causes complete service unavailability after any ungraceful gateway shutdown (crash, kill -9, Windows force-kill, power loss). Users without knowledge of the internal lock file location cannot recover without manual intervention. It also breaks automated restart loops (systemd Restart=always, scheduled health-check restarts, etc.).
Workaround (for users hitting this now)
# Remove stale lock and PID files
rm ~/.hermes/gateway.lock ~/.hermes/gateway.pid
# Or on Windows:
del %USERPROFILE%\.hermes\gateway.lock %USERPROFILE%\.hermes\gateway.pid
# Then restart
hermes gateway run
Issue:
start_gatewayshould verify lock-holder PID is alive before treating stale lock as "another instance"Bug Report
Component:
gateway/status.py,gateway/run.pySeverity: High — causes complete gateway startup failure after crash, requiring manual file cleanup
Platform: Primarily Windows (reproducible on any platform where
acquire_gateway_runtime_lockreturnsFalsefor a stale lock)Summary
When the gateway process crashes or is force-killed, the runtime lock file may remain "locked" in a way that
acquire_gateway_runtime_lock()returnsFalseon the next startup. The startup logic ingateway/run.pythen exits with:However, the existing
get_running_pid()function already contains robust stale-PID detection (checksos.kill(pid, 0), process start time, cmdline matching, and auto-cleans PID files). The startup flow does not callget_running_pid()beforeacquire_gateway_runtime_lock(), so this stale-lock cleanup logic is completely bypassed.Reproduction Steps
hermes gateway runtaskkill /F /PID <pid>on Windows, orkill -9on POSIX)hermes gateway run~/.hermes/gateway.lockor~/.hermes/gateway.pid, then restart succeedsRoot Cause Analysis
Current startup flow (simplified)
The gap
acquire_gateway_runtime_lock()calls_try_acquire_file_lock(), which attempts to grab the OS-level file lock. If the lock is still held (e.g., Windowsmsvcrt.lockingmay not auto-release afterkill /F), it returnsFalseimmediately. It never asks: "who holds this lock, and are they still alive?"Meanwhile,
get_running_pid()already does exactly this validation:But
run.pycallsget_running_pid()beforeacquire_gateway_runtime_lock(), and only for the "another instance started during our startup" branch. Ifget_running_pid()returnsNone(because the PID file was already cleaned), but the lock file itself is still locked by a dead process, the code proceeds toacquire_gateway_runtime_lock()→False→ exit.Proposed Fix
Option A (Recommended): Reuse
get_running_pid()as a lock-validity gateIn
gateway/run.py, before callingacquire_gateway_runtime_lock(), attempt to read the PID from the lock file and validate it withget_running_pid()'s logic. If the recorded PID is dead, forcibly break the stale lock by closing/reopening the lock file (or documenting that the user should run with--replace).Option B: Make
acquire_gateway_runtime_lock()smarterAdd a
cleanup_stale: bool = Trueparameter toacquire_gateway_runtime_lock(). When the initial lock attempt fails:_read_gateway_lock_record())os.kill(pid, 0)raisesProcessLookupErrororOSError)Recovered stale runtime lock from dead process PID {pid}This mirrors the pattern already used in
acquire_scoped_lock(), which does replace stale records:Option C: Startup script auto-detect
In
hermes gateway runCLI, add a pre-flight check: ifacquire_gateway_runtime_lock()fails, callget_running_pid(). Ifget_running_pid()returnsNone, print a helpful error:Related Code
gateway/run.pygateway/status.pyacquire_gateway_runtime_lock()— only checks file lockgateway/status.pyis_gateway_runtime_lock_active()— lock existence checkgateway/status.pyget_running_pid()— already has stale-PID cleanuptests/gateway/test_status.pytest_get_running_pid_cleans_stale_record_from_dead_processtests/gateway/test_status.pyacquire_scoped_lockstale-lock recoveryEnvironment
msvcrt.lockingon Windows,fcntl.flockon POSIXImpact
This issue causes complete service unavailability after any ungraceful gateway shutdown (crash,
kill -9, Windows force-kill, power loss). Users without knowledge of the internal lock file location cannot recover without manual intervention. It also breaks automated restart loops (systemdRestart=always, scheduled health-check restarts, etc.).Workaround (for users hitting this now)