Bug Description
Hermes Gateway PID 文件竞态问题及修复方案
问题现象
Hermes Gateway 启动时反复报错:
ERROR gateway.run: PID file race lost to another gateway instance. Exiting.
日志中可以看到 Gateway 多次尝试启动,每次都因为 PID 文件竞争而退出,但实际没有活跃的 Gateway 进程在运行。
问题原因
核心原因:过期 PID 文件未被自动清理
Hermes Gateway 使用 PID 文件(~/.hermes/gateway.pid)来检测是否存在已运行的 Gateway 实例,防止多个实例同时运行。PID 文件采用 JSON 格式,记录了进程的 PID、启动时间(start_time)、命令行参数等信息。
正常生命周期中:
- Gateway 启动时写入 PID 文件
- Gateway 正常退出时通过
remove_pid_file() 删除 PID 文件
- Gateway 异常退出(如 SIGKILL、崩溃)时 PID 文件不会被删除
问题场景:
- Gateway 异常退出(如被 kill -9、终端断开、OOM 等),PID 文件未被删除
- 新 Gateway 启动时调用
write_pid_file(),尝试用 O_CREAT | O_EXCL 原子创建 PID 文件
O_EXCL 标志导致文件已存在时抛出 FileExistsError
- 但
write_pid_file() 原来的逻辑没有检查 PID 文件中记录的进程是否还存活,直接把 FileExistsError 向上传播
- 结果:即使旧进程早已死亡,新 Gateway 也会因为过期 PID 文件而退出
原始 write_pid_file() 代码:
def write_pid_file() -> None:
"""Write the current process PID and metadata to the gateway PID file.
Uses atomic O_CREAT | O_EXCL creation so that concurrent --replace
invocations race: exactly one process wins and the rest get
FileExistsError.
"""
path = _get_pid_path()
path.parent.mkdir(parents=True, exist_ok=True)
record = json.dumps(_build_pid_record())
try:
fd = os.open(path, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
except FileExistsError:
raise # Let caller decide: another gateway is racing us
...
修复方案
在 write_pid_file() 中,O_EXCL 写入之前增加过期 PID 文件检测与清理逻辑:
- 检查 PID 文件是否存在
- 读取 PID 文件中记录的 PID
- 用
os.kill(pid, 0) 检测该进程是否存活
- 如果进程已死(
ProcessLookupError),删除过期 PID 文件
- 然后正常尝试
O_EXCL 写入
修复后的代码
def write_pid_file() -> None:
"""Write the current process PID and metadata to the gateway PID file.
Uses atomic O_CREAT | O_EXCL creation so that concurrent --replace
invocations race: exactly one process wins and the rest get
FileExistsError.
Before attempting O_EXCL, cleans up stale PID files where the recorded
process has exited (ProcessLookupError) - this prevents a stale PID file
from blocking a new gateway from starting.
"""
path = _get_pid_path()
path.parent.mkdir(parents=True, exist_ok=True)
record = json.dumps(_build_pid_record())
# Pre-check: if PID file exists but the recorded process is dead,
# remove it first so the O_EXCL write won't fail on a stale file.
if path.exists():
existing = _read_pid_record(path)
if existing:
try:
pid = int(existing["pid"])
os.kill(pid, 0)
except (ProcessLookupError, PermissionError, ValueError, TypeError):
# Process is gone - remove stale PID file
try:
path.unlink(missing_ok=True)
except OSError:
pass
try:
fd = os.open(path, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
except FileExistsError:
raise # Let caller decide: another gateway is racing us
try:
with os.fdopen(fd, "w", encoding="utf-8") as f:
f.write(record)
except Exception:
try:
path.unlink(missing_ok=True)
except OSError:
pass
raise
修改的文件
hermes-agent/gateway/status.py - write_pid_file() 函数
修复效果
- Gateway 异常退出后,下次启动自动清理过期 PID 文件,不再报错
- 并发
gateway run --replace 的竞态行为不受影响(真正的竞争进程会通过 O_EXCL 正常处理)
get_running_pid() 已有完整的进程存活检测逻辑(PID + start_time + cmdline),修复不影响已有行为
Steps to Reproduce
让hermes自动更新并重启网关,这个时候hermes直接执行了kill命令,再次启动时,无法收到消息,重启网关也一样.
Expected Behavior
pid占用
Actual Behavior
...
Affected Component
Gateway (Telegram/Discord/Slack/WhatsApp)
Messaging Platform (if gateway-related)
No response
Debug Report
ERROR gateway.run: PID file race lost to another gateway instance. Exiting.
Operating System
debian
Python Version
No response
Hermes Version
v0.10.0
Additional Logs / Traceback (optional)
Root Cause Analysis (optional)
No response
Proposed Fix (optional)
原始 write_pid_file() 代码:
def write_pid_file() -> None:
"""Write the current process PID and metadata to the gateway PID file.
Uses atomic O_CREAT | O_EXCL creation so that concurrent --replace
invocations race: exactly one process wins and the rest get
FileExistsError.
"""
path = _get_pid_path()
path.parent.mkdir(parents=True, exist_ok=True)
record = json.dumps(_build_pid_record())
try:
fd = os.open(path, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
except FileExistsError:
raise # Let caller decide: another gateway is racing us
...
修复方案
在 write_pid_file() 中,O_EXCL 写入之前增加过期 PID 文件检测与清理逻辑:
- 检查 PID 文件是否存在
- 读取 PID 文件中记录的 PID
- 用
os.kill(pid, 0) 检测该进程是否存活
- 如果进程已死(
ProcessLookupError),删除过期 PID 文件
- 然后正常尝试
O_EXCL 写入
修复后的代码
def write_pid_file() -> None:
"""Write the current process PID and metadata to the gateway PID file.
Uses atomic O_CREAT | O_EXCL creation so that concurrent --replace
invocations race: exactly one process wins and the rest get
FileExistsError.
Before attempting O_EXCL, cleans up stale PID files where the recorded
process has exited (ProcessLookupError) - this prevents a stale PID file
from blocking a new gateway from starting.
"""
path = _get_pid_path()
path.parent.mkdir(parents=True, exist_ok=True)
record = json.dumps(_build_pid_record())
# Pre-check: if PID file exists but the recorded process is dead,
# remove it first so the O_EXCL write won't fail on a stale file.
if path.exists():
existing = _read_pid_record(path)
if existing:
try:
pid = int(existing["pid"])
os.kill(pid, 0)
except (ProcessLookupError, PermissionError, ValueError, TypeError):
# Process is gone - remove stale PID file
try:
path.unlink(missing_ok=True)
except OSError:
pass
try:
fd = os.open(path, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
except FileExistsError:
raise # Let caller decide: another gateway is racing us
try:
with os.fdopen(fd, "w", encoding="utf-8") as f:
f.write(record)
except Exception:
try:
path.unlink(missing_ok=True)
except OSError:
pass
raise
修改的文件
hermes-agent/gateway/status.py - write_pid_file() 函数
修复效果
- Gateway 异常退出后,下次启动自动清理过期 PID 文件,不再报错
- 并发
gateway run --replace 的竞态行为不受影响(真正的竞争进程会通过 O_EXCL 正常处理)
get_running_pid() 已有完整的进程存活检测逻辑(PID + start_time + cmdline),修复不影响已有行为
Are you willing to submit a PR for this?
Bug Description
Hermes Gateway PID 文件竞态问题及修复方案
问题现象
Hermes Gateway 启动时反复报错:
日志中可以看到 Gateway 多次尝试启动,每次都因为 PID 文件竞争而退出,但实际没有活跃的 Gateway 进程在运行。
问题原因
核心原因:过期 PID 文件未被自动清理
Hermes Gateway 使用 PID 文件(
~/.hermes/gateway.pid)来检测是否存在已运行的 Gateway 实例,防止多个实例同时运行。PID 文件采用 JSON 格式,记录了进程的 PID、启动时间(start_time)、命令行参数等信息。正常生命周期中:
remove_pid_file()删除 PID 文件问题场景:
write_pid_file(),尝试用O_CREAT | O_EXCL原子创建 PID 文件O_EXCL标志导致文件已存在时抛出FileExistsErrorwrite_pid_file()原来的逻辑没有检查 PID 文件中记录的进程是否还存活,直接把FileExistsError向上传播原始
write_pid_file()代码:修复方案
在
write_pid_file()中,O_EXCL写入之前增加过期 PID 文件检测与清理逻辑:os.kill(pid, 0)检测该进程是否存活ProcessLookupError),删除过期 PID 文件O_EXCL写入修复后的代码
修改的文件
hermes-agent/gateway/status.py-write_pid_file()函数修复效果
gateway run --replace的竞态行为不受影响(真正的竞争进程会通过O_EXCL正常处理)get_running_pid()已有完整的进程存活检测逻辑(PID + start_time + cmdline),修复不影响已有行为Steps to Reproduce
让hermes自动更新并重启网关,这个时候hermes直接执行了kill命令,再次启动时,无法收到消息,重启网关也一样.
Expected Behavior
pid占用
Actual Behavior
...
Affected Component
Gateway (Telegram/Discord/Slack/WhatsApp)
Messaging Platform (if gateway-related)
No response
Debug Report
Operating System
debian
Python Version
No response
Hermes Version
v0.10.0
Additional Logs / Traceback (optional)
Root Cause Analysis (optional)
No response
Proposed Fix (optional)
原始
write_pid_file()代码:修复方案
在
write_pid_file()中,O_EXCL写入之前增加过期 PID 文件检测与清理逻辑:os.kill(pid, 0)检测该进程是否存活ProcessLookupError),删除过期 PID 文件O_EXCL写入修复后的代码
修改的文件
hermes-agent/gateway/status.py-write_pid_file()函数修复效果
gateway run --replace的竞态行为不受影响(真正的竞争进程会通过O_EXCL正常处理)get_running_pid()已有完整的进程存活检测逻辑(PID + start_time + cmdline),修复不影响已有行为Are you willing to submit a PR for this?