Skip to content

# Issue: start_gateway should verify lock-holder PID is alive before treating stale lock as "another instance" #28561

@goddog2024

Description

@goddog2024

Issue: start_gateway should verify lock-holder PID is alive before treating stale lock as "another instance"

Bug Report

Component: gateway/status.py, gateway/run.py
Severity: High — causes complete gateway startup failure after crash, requiring manual file cleanup
Platform: Primarily Windows (reproducible on any platform where acquire_gateway_runtime_lock returns False for a stale lock)


Summary

When the gateway process crashes or is force-killed, the runtime lock file may remain "locked" in a way that acquire_gateway_runtime_lock() returns False on the next startup. The startup logic in gateway/run.py then exits with:

ERROR: Gateway runtime lock is already held by another instance. Exiting.

However, the existing get_running_pid() function already contains robust stale-PID detection (checks os.kill(pid, 0), process start time, cmdline matching, and auto-cleans PID files). The startup flow does not call get_running_pid() before acquire_gateway_runtime_lock(), so this stale-lock cleanup logic is completely bypassed.


Reproduction Steps

  1. Start gateway normally: hermes gateway run
  2. Force-kill the gateway process (e.g., taskkill /F /PID <pid> on Windows, or kill -9 on POSIX)
  3. Attempt to restart gateway: hermes gateway run
  4. Observed: Gateway immediately exits with "runtime lock is already held by another instance"
  5. Workaround: Manually delete ~/.hermes/gateway.lock or ~/.hermes/gateway.pid, then restart succeeds

Root Cause Analysis

Current startup flow (simplified)

# gateway/run.py ~L15330
current_pid = get_running_pid()          # checks PID file + lock validity
if current_pid is not None and current_pid != os.getpid():
    logger.error("Another gateway instance started during our startup. Exiting.")
    return False

if not acquire_gateway_runtime_lock():   # ← ONLY checks file lock; NO PID validation
    logger.error("Gateway runtime lock is already held by another instance. Exiting.")
    return False

The gap

acquire_gateway_runtime_lock() calls _try_acquire_file_lock(), which attempts to grab the OS-level file lock. If the lock is still held (e.g., Windows msvcrt.locking may not auto-release after kill /F), it returns False immediately. It never asks: "who holds this lock, and are they still alive?"

Meanwhile, get_running_pid() already does exactly this validation:

# gateway/status.py ~L802
for record in (primary_record, fallback_record):
    pid = _pid_from_record(record)
    if pid is None:
        continue
    try:
        os.kill(pid, 0)  # existence check
    except ProcessLookupError:
        continue  # process is dead → stale
    # ... also checks start_time and cmdline

But run.py calls get_running_pid() before acquire_gateway_runtime_lock(), and only for the "another instance started during our startup" branch. If get_running_pid() returns None (because the PID file was already cleaned), but the lock file itself is still locked by a dead process, the code proceeds to acquire_gateway_runtime_lock()False → exit.


Proposed Fix

Option A (Recommended): Reuse get_running_pid() as a lock-validity gate

In gateway/run.py, before calling acquire_gateway_runtime_lock(), attempt to read the PID from the lock file and validate it with get_running_pid()'s logic. If the recorded PID is dead, forcibly break the stale lock by closing/reopening the lock file (or documenting that the user should run with --replace).

Option B: Make acquire_gateway_runtime_lock() smarter

Add a cleanup_stale: bool = True parameter to acquire_gateway_runtime_lock(). When the initial lock attempt fails:

  1. Read the PID record from the lock file (_read_gateway_lock_record())
  2. If the recorded PID is dead (os.kill(pid, 0) raises ProcessLookupError or OSError)
  3. Close the current handle, truncate/reopen the lock file, and retry the lock acquisition
  4. Log a warning: Recovered stale runtime lock from dead process PID {pid}

This mirrors the pattern already used in acquire_scoped_lock(), which does replace stale records:

# test_status.py references this behavior:
# test_acquire_scoped_lock_replaces_stale_record
# test_acquire_scoped_lock_recovers_empty_lock_file
# test_acquire_scoped_lock_recovers_corrupt_lock_file

Option C: Startup script auto-detect

In hermes gateway run CLI, add a pre-flight check: if acquire_gateway_runtime_lock() fails, call get_running_pid(). If get_running_pid() returns None, print a helpful error:

Gateway lock file appears stale (no running process holds it).
Run `hermes gateway run --replace` to force-start, or manually remove:
  <lock_path>

Related Code

File Lines Description
gateway/run.py 15330-15350 Startup lock acquisition + PID file race logic
gateway/status.py 313-331 acquire_gateway_runtime_lock() — only checks file lock
gateway/status.py 348-368 is_gateway_runtime_lock_active() — lock existence check
gateway/status.py 802-852 get_running_pid()already has stale-PID cleanup
tests/gateway/test_status.py 55-76 Test: test_get_running_pid_cleans_stale_record_from_dead_process
tests/gateway/test_status.py 421-466 Tests for acquire_scoped_lock stale-lock recovery

Environment

  • OS: Windows 10/11 (also reproducible on Linux if lock mechanism doesn't auto-release)
  • Hermes version: v0.5.25+
  • Python: 3.11+
  • Lock mechanism: msvcrt.locking on Windows, fcntl.flock on POSIX

Impact

This issue causes complete service unavailability after any ungraceful gateway shutdown (crash, kill -9, Windows force-kill, power loss). Users without knowledge of the internal lock file location cannot recover without manual intervention. It also breaks automated restart loops (systemd Restart=always, scheduled health-check restarts, etc.).


Workaround (for users hitting this now)

# Remove stale lock and PID files
rm ~/.hermes/gateway.lock ~/.hermes/gateway.pid

# Or on Windows:
del %USERPROFILE%\.hermes\gateway.lock %USERPROFILE%\.hermes\gateway.pid

# Then restart
hermes gateway run

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions