fix(gateway): prevent --replace race condition causing multiple instances#11734
fix(gateway): prevent --replace race condition causing multiple instances#11734opriz wants to merge 2 commits into
Conversation
08ae922 to
d98f6f1
Compare
|
LGTM. The Two small observations (not blocking):
|
…nces When starting the gateway with --replace, concurrent invocations could leave multiple instances running simultaneously. This happened because write_pid_file() used a plain overwrite, so the second racer would silently replace the first process's PID record. Changes: - gateway/status.py: write_pid_file() now uses atomic O_CREAT|O_EXCL creation. If the file already exists, it raises FileExistsError, allowing exactly one process to win the race. - gateway/run.py: before writing the PID file, re-check get_running_pid() and catch FileExistsError from write_pid_file(). In both cases, stop the runner and return False so the process exits cleanly. Fixes NousResearch#11718
If the old process crashed without firing its atexit handler, remove_pid_file() is a no-op. Force-unlink the stale gateway.pid so write_pid_file() (O_CREAT|O_EXCL) does not hit FileExistsError.
d98f6f1 to
730611c
Compare
|
Thanks for the review!
|
Summary
Fixes #11718.
When starting the gateway with
--replace, concurrent invocations could leave multiple instances running simultaneously. This triggered Telegram (and other platform) polling conflicts, causing the bot to become unresponsive.Root Cause
write_pid_file()ingateway/status.pyused a plain file overwrite. Two racing--replaceprocesses could both proceed through the termination-wait logic, then the second process would silently overwrite the first process's PID record, leaving both instances alive.Changes
gateway/status.py:write_pid_file()now uses atomicO_CREAT | O_EXCLfile creation. If the PID file already exists, it raisesFileExistsError, ensuring exactly one process wins the race.gateway/run.py: Before writing the PID file, performs a defensive re-check viaget_running_pid(). Also catchesFileExistsErrorfromwrite_pid_file(). In both cases, stops the runner and returnsFalseso the process exits cleanly (non-zero exit code for systemd auto-restart compatibility).Reproduction
Run two
gateway --replaceinvocations concurrently (e.g. via systemd restart overlap or manual double-start):Before the fix: both processes write to
gateway.pidand start polling Telegram, causing:After the fix: the second process detects the race, logs an error, and exits before starting any platform adapters.
Test Plan
Existing tests
Concurrent race simulation
A standalone script was used to verify the atomicity under real OS-level concurrency:
Result: exactly 1 process succeeds, the other receives
FileExistsError.Backwards Compatibility
write_pid_file()signature unchanged; callers that don't handleFileExistsErrorwill see the exception propagate (which is the desired behavior for race detection).remove_pid_file()already guards against deleting another process's PID file, so old-process atexit handlers won't clobber the new PID record.