fix(gateway): close --replace race completely (salvage #11734)#13388
Merged
Conversation
…nces When starting the gateway with --replace, concurrent invocations could leave multiple instances running simultaneously. This happened because write_pid_file() used a plain overwrite, so the second racer would silently replace the first process's PID record. Changes: - gateway/status.py: write_pid_file() now uses atomic O_CREAT|O_EXCL creation. If the file already exists, it raises FileExistsError, allowing exactly one process to win the race. - gateway/run.py: before writing the PID file, re-check get_running_pid() and catch FileExistsError from write_pid_file(). In both cases, stop the runner and return False so the process exits cleanly. Fixes #11718
If the old process crashed without firing its atexit handler, remove_pid_file() is a no-op. Force-unlink the stale gateway.pid so write_pid_file() (O_CREAT|O_EXCL) does not hit FileExistsError.
…adapter startup Follow-up on top of opriz's atomic PID file fix. The prior change caught the race AFTER runner.start(), so the loser still opened Telegram polling and Discord gateway sockets before detecting the conflict and exiting. Hoist the PID-claim block to BEFORE runner.start(). Now the loser of the O_CREAT|O_EXCL race returns from start_gateway() without ever bringing up any platform adapter — no Telegram conflict, no Discord duplicate session. Also add regression tests: - test_write_pid_file_is_atomic_against_concurrent_writers: second write_pid_file() raises FileExistsError rather than clobbering. - Two existing replace-path tests updated to stateful mocks since the real post-kill state (get_running_pid None after remove_pid_file) is now exercised by the hoisted re-check.
This was referenced Apr 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Salvages #11734 from @opriz with a follow-up hardening commit.
Summary
Two concurrent
hermes gateway run --replaceinvocations no longer leave multiple Telegram-polling gateways alive. The O_CREAT|O_EXCL PID file from #11734 guarantees exactly one winner, and hoisting the PID-claim beforerunner.start()means the loser exits before touching any platform adapter.Fixes #11718.
Changes
gateway/status.py(@opriz, cab9c66):write_pid_file()uses atomicO_CREAT | O_EXCLopen — racers getFileExistsErrorinstead of silently clobbering each other.gateway/run.py(@opriz, cab9c66): catchFileExistsError/ defensiveget_running_pid()re-check; return False so systemd Restart works.gateway/run.py(@opriz, 730611c): force-unlink stale PID file after takeover for the old-process-crashed case.gateway/run.py(follow-up): hoist PID-claim to BEFORErunner.start(). The prior placement still allowed the loser to briefly open Telegram polling / Discord sockets before detecting the race; hoisting ensures only the winner ever brings up adapters.tests/gateway/test_status.py(follow-up): regression test — secondwrite_pid_file()raisesFileExistsErrorrather than clobbering.tests/gateway/test_runner_startup_failures.py(follow-up): two--replacetests now use statefulget_running_pid/remove_pid_filemocks, reflecting real post-kill state (Noneafter removal). Staticlambda: 42no longer models the system accurately now that the re-check runs post-kill.Validation
tests/gateway/test_status.pytests/gateway/test_runner_startup_failures.pytests/hermes_cli/test_update_gateway_restart.pytests/gateway/test_telegram_conflict.pywrite_pid_fileFileExistsErrorCredit to @opriz — both commits cherry-picked with authorship preserved. AUTHOR_MAP entry included.