Skip to content

fix(gateway): prevent --replace race condition causing multiple instances#11734

Closed
opriz wants to merge 2 commits into
NousResearch:mainfrom
opriz:fix/gateway-replace-race-condition
Closed

fix(gateway): prevent --replace race condition causing multiple instances#11734
opriz wants to merge 2 commits into
NousResearch:mainfrom
opriz:fix/gateway-replace-race-condition

Conversation

@opriz

@opriz opriz commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #11718.

When starting the gateway with --replace, concurrent invocations could leave multiple instances running simultaneously. This triggered Telegram (and other platform) polling conflicts, causing the bot to become unresponsive.

Root Cause

write_pid_file() in gateway/status.py used a plain file overwrite. Two racing --replace processes could both proceed through the termination-wait logic, then the second process would silently overwrite the first process's PID record, leaving both instances alive.

Changes

  • gateway/status.py: write_pid_file() now uses atomic O_CREAT | O_EXCL file creation. If the PID file already exists, it raises FileExistsError, ensuring exactly one process wins the race.
  • gateway/run.py: Before writing the PID file, performs a defensive re-check via get_running_pid(). Also catches FileExistsError from write_pid_file(). In both cases, stops the runner and returns False so the process exits cleanly (non-zero exit code for systemd auto-restart compatibility).

Reproduction

Run two gateway --replace invocations concurrently (e.g. via systemd restart overlap or manual double-start):

# Terminal 1
hermes gateway run --replace &

# Terminal 2 (within ~1s)
hermes gateway run --replace &

Before the fix: both processes write to gateway.pid and start polling Telegram, causing:

Conflict: terminated by other getUpdates request; make sure that only one bot instance is running

After the fix: the second process detects the race, logs an error, and exits before starting any platform adapters.

Test Plan

Existing tests

pytest tests/gateway/test_status.py -q                    # 14 passed
pytest tests/gateway/test_runner_startup_failures.py -q   # included in 75 passed
pytest tests/hermes_cli/test_update_gateway_restart.py -q # 34 passed
pytest tests/gateway/test_telegram_conflict.py -q         # 6 passed

Concurrent race simulation

A standalone script was used to verify the atomicity under real OS-level concurrency:

import multiprocessing
import os
import tempfile
from pathlib import Path
from gateway import status

def racer(pid_file, queue):
    status._get_pid_path = lambda: Path(pid_file)
    try:
        status.write_pid_file()
        queue.put(("ok", os.getpid()))
    except FileExistsError:
        queue.put(("race_lost", os.getpid()))

pid_file = Path(tempfile.mkdtemp()) / "gateway.pid"
q = multiprocessing.Queue()
p1 = multiprocessing.Process(target=racer, args=(str(pid_file), q))
p2 = multiprocessing.Process(target=racer, args=(str(pid_file), q))
p1.start(); p2.start()
p1.join(); p2.join()

results = [q.get() for _ in range(2)]
assert len([r for r in results if r[0] == "ok"]) == 1
assert len([r for r in results if r[0] == "race_lost"]) == 1

Result: exactly 1 process succeeds, the other receives FileExistsError.

Backwards Compatibility

  • write_pid_file() signature unchanged; callers that don't handle FileExistsError will see the exception propagate (which is the desired behavior for race detection).
  • remove_pid_file() already guards against deleting another process's PID file, so old-process atexit handlers won't clobber the new PID record.

@opriz opriz force-pushed the fix/gateway-replace-race-condition branch 2 times, most recently from 08ae922 to d98f6f1 Compare April 17, 2026 18:30
@trevorgordon981

Copy link
Copy Markdown

LGTM. The O_CREAT | O_EXCL approach for PID file creation is the textbook way to implement a filesystem-level mutex, and the defensive get_running_pid() re-check before the write catches the window between the old process exiting and the PID file write. The cleanup path in write_pid_file (unlink on write failure after fd open) prevents leaked temp PID files.

Two small observations (not blocking):

  1. The return False contract from the startup function is a nice touch for systemd compatibility, since the caller can map it to a non-zero exit code that systemd's restart policy can act on.

  2. Worth confirming that the --replace path's existing "terminate old + wait + remove old PID file" sequence still runs before write_pid_file(). If the old process crashed without its atexit handler firing, the stale PID file would cause FileExistsError even for a single --replace invocation. The defensive get_running_pid() check handles this only if it validates liveness (e.g., os.kill(pid, 0)) rather than just reading the file.

opriz added 2 commits April 18, 2026 13:47
…nces

When starting the gateway with --replace, concurrent invocations could
leave multiple instances running simultaneously. This happened because
write_pid_file() used a plain overwrite, so the second racer would
silently replace the first process's PID record.

Changes:
- gateway/status.py: write_pid_file() now uses atomic O_CREAT|O_EXCL
  creation. If the file already exists, it raises FileExistsError,
  allowing exactly one process to win the race.
- gateway/run.py: before writing the PID file, re-check get_running_pid()
  and catch FileExistsError from write_pid_file(). In both cases, stop
  the runner and return False so the process exits cleanly.

Fixes NousResearch#11718
If the old process crashed without firing its atexit handler,
remove_pid_file() is a no-op.  Force-unlink the stale gateway.pid
so write_pid_file() (O_CREAT|O_EXCL) does not hit FileExistsError.
@opriz opriz force-pushed the fix/gateway-replace-race-condition branch from d98f6f1 to 730611c Compare April 18, 2026 05:56
@opriz

opriz commented Apr 18, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the review!
Re: point 2 — good catch. remove_pid_file() is a no-op when the PID doesn't match the current process, so I added a force-unlink of the stale gateway.pid right after the --replace wait/kill sequence. This covers the case where the old process crashed without firing its atexit handler.

LGTM. The O_CREAT | O_EXCL approach for PID file creation is the textbook way to implement a filesystem-level mutex, and the defensive get_running_pid() re-check before the write catches the window between the old process exiting and the PID file write. The cleanup path in write_pid_file (unlink on write failure after fd open) prevents leaked temp PID files.

Two small observations (not blocking):

  1. The return False contract from the startup function is a nice touch for systemd compatibility, since the caller can map it to a non-zero exit code that systemd's restart policy can act on.
  2. Worth confirming that the --replace path's existing "terminate old + wait + remove old PID file" sequence still runs before write_pid_file(). If the old process crashed without its atexit handler firing, the stale PID file would cause FileExistsError even for a single --replace invocation. The defensive get_running_pid() check handles this only if it validates liveness (e.g., os.kill(pid, 0)) rather than just reading the file.

@teknium1

Copy link
Copy Markdown
Contributor

Merged via PR #13388 with your commits cherry-picked and authorship preserved in git log. A follow-up commit on top hoists the PID-claim to before runner.start() so the loser of the race exits before ever touching platform adapters — closes the window completely. Thanks for the fix!

#13388

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gateway run --replace race condition: multiple instances run simultaneously

3 participants