Skip to content

fix(gateway): close --replace race completely (salvage #11734)#13388

Merged
teknium1 merged 3 commits into
mainfrom
hermes/hermes-96491119
Apr 21, 2026
Merged

fix(gateway): close --replace race completely (salvage #11734)#13388
teknium1 merged 3 commits into
mainfrom
hermes/hermes-96491119

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Salvages #11734 from @opriz with a follow-up hardening commit.

Summary

Two concurrent hermes gateway run --replace invocations no longer leave multiple Telegram-polling gateways alive. The O_CREAT|O_EXCL PID file from #11734 guarantees exactly one winner, and hoisting the PID-claim before runner.start() means the loser exits before touching any platform adapter.

Fixes #11718.

Changes

  • gateway/status.py (@opriz, cab9c66): write_pid_file() uses atomic O_CREAT | O_EXCL open — racers get FileExistsError instead of silently clobbering each other.
  • gateway/run.py (@opriz, cab9c66): catch FileExistsError / defensive get_running_pid() re-check; return False so systemd Restart works.
  • gateway/run.py (@opriz, 730611c): force-unlink stale PID file after takeover for the old-process-crashed case.
  • gateway/run.py (follow-up): hoist PID-claim to BEFORE runner.start(). The prior placement still allowed the loser to briefly open Telegram polling / Discord sockets before detecting the race; hoisting ensures only the winner ever brings up adapters.
  • tests/gateway/test_status.py (follow-up): regression test — second write_pid_file() raises FileExistsError rather than clobbering.
  • tests/gateway/test_runner_startup_failures.py (follow-up): two --replace tests now use stateful get_running_pid / remove_pid_file mocks, reflecting real post-kill state (None after removal). Static lambda: 42 no longer models the system accurately now that the re-check runs post-kill.

Validation

Result
tests/gateway/test_status.py 27 passed (+1 new race test)
tests/gateway/test_runner_startup_failures.py 36 passed
tests/hermes_cli/test_update_gateway_restart.py 34 passed
tests/gateway/test_telegram_conflict.py 6 passed
E2E: 5-way fork race on write_pid_file 1 winner, 4 FileExistsError

Credit to @opriz — both commits cherry-picked with authorship preserved. AUTHOR_MAP entry included.

opriz and others added 3 commits April 21, 2026 00:32
…nces

When starting the gateway with --replace, concurrent invocations could
leave multiple instances running simultaneously. This happened because
write_pid_file() used a plain overwrite, so the second racer would
silently replace the first process's PID record.

Changes:
- gateway/status.py: write_pid_file() now uses atomic O_CREAT|O_EXCL
  creation. If the file already exists, it raises FileExistsError,
  allowing exactly one process to win the race.
- gateway/run.py: before writing the PID file, re-check get_running_pid()
  and catch FileExistsError from write_pid_file(). In both cases, stop
  the runner and return False so the process exits cleanly.

Fixes #11718
If the old process crashed without firing its atexit handler,
remove_pid_file() is a no-op.  Force-unlink the stale gateway.pid
so write_pid_file() (O_CREAT|O_EXCL) does not hit FileExistsError.
…adapter startup

Follow-up on top of opriz's atomic PID file fix. The prior change caught
the race AFTER runner.start(), so the loser still opened Telegram polling
and Discord gateway sockets before detecting the conflict and exiting.

Hoist the PID-claim block to BEFORE runner.start(). Now the loser of the
O_CREAT|O_EXCL race returns from start_gateway() without ever bringing up
any platform adapter — no Telegram conflict, no Discord duplicate session.

Also add regression tests:
- test_write_pid_file_is_atomic_against_concurrent_writers: second
  write_pid_file() raises FileExistsError rather than clobbering.
- Two existing replace-path tests updated to stateful mocks since the
  real post-kill state (get_running_pid None after remove_pid_file)
  is now exercised by the hoisted re-check.
@teknium1 teknium1 merged commit ce9c91c into main Apr 21, 2026
6 of 7 checks passed
@teknium1 teknium1 deleted the hermes/hermes-96491119 branch April 21, 2026 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gateway run --replace race condition: multiple instances run simultaneously

2 participants