fix(gateway): prevent channel double-start after SIGUSR1 config reload#20930
fix(gateway): prevent channel double-start after SIGUSR1 config reload#20930widingmarcus-cyber wants to merge 1 commit intoopenclaw:mainfrom
Conversation
openclaw#20893) When config.patch/config.apply writes a config file, two independent restart paths race: 1. The config file watcher (chokidar, ~300ms debounce) detects the change and either hot-reloads channels or requests a full restart. 2. The WS handler schedules a SIGUSR1 restart (~2s delay). For channel-only changes (e.g. channels.telegram), the watcher hot-reloads the channel (stop + start), then the scheduled SIGUSR1 fires anyway, causing a full gateway restart that starts channels again — resulting in duplicate polling loops and permanent 409 conflicts for Telegram's getUpdates API. Fix with two complementary guards: - scheduleGatewaySigusr1Restart now snapshots restartCycleToken at schedule time and skips the emit if a restart was already consumed (by the config watcher) before the timer fires. - The config watcher's hot-reload path now calls cancelScheduledGatewaySigusr1Restart() after successfully applying a hot-reload, preventing the redundant scheduled restart entirely. - Export cancelScheduledGatewaySigusr1Restart() from restart.ts for use by the config reloader. Adds 3 regression tests covering cancellation, no-op cancel, and the config-watcher-fires-first race condition.
nikolasdehor
left a comment
There was a problem hiding this comment.
Solid fix. The two-pronged approach is the right call here:
-
Cycle token snapshot — snapshotting
restartCycleTokenat schedule time and comparing againstconsumedRestartTokenwhen the timer fires correctly handles the race where the config watcher triggers a full restart before the scheduled SIGUSR1 fires. This is the key guard that prevents the double-start. -
Explicit cancellation — calling
cancelScheduledGatewaySigusr1Restart()after a successful hot-reload provides defense-in-depth for the more common case (channel-only config changes that don't need a full restart at all).
A few things I verified in the diff:
scheduledRestartTimeris properly cleaned up in__testing.resetSigusr1State(), so test isolation is correct.- The
clearTimeoutcall inscheduleGatewaySigusr1Restartbefore setting a new timer prevents stacking ifconfig.patchis called multiple times in quick succession. - The cancellation in
config-reload.tsis placed afterawait opts.onHotReload(plan, nextConfig)succeeds, so it only cancels when hot-reload actually completed — if hot-reload throws, the scheduled restart still fires as a safety net. Good. - All 3 regression tests cover the right scenarios: explicit cancel, no-op cancel, and the config-watcher-fires-first race.
This is directly related to #19968 (SIGUSR1 config reload) which I filed — glad to see the race condition getting properly addressed. The Telegram 409 Conflict from duplicate polling loops was a real pain point.
LGTM.
|
Closing — upstream refactored restart coalescing with cooldown logic and pending timer management, superseding this simpler approach. |
Summary
Fixes #20893
When
config.patchorconfig.applywrites a config file, two independent restart paths race:scheduleGatewaySigusr1Restart)For channel-only changes (e.g.
channels.telegram), the watcher hot-reloads the channel (stop + start), then the scheduled SIGUSR1 fires anyway, causing a full gateway restart that starts channels again — resulting in duplicate polling loops and permanent409 Conflicterrors from Telegram'sgetUpdatesAPI.Root Cause
The config watcher and the explicit
scheduleGatewaySigusr1Restartfrom the WS handler operate independently with no coordination, so both fire for the same config write.Fix
Two complementary guards:
Cycle token snapshot —
scheduleGatewaySigusr1Restartnow snapshotsrestartCycleTokenat schedule time. When the timer fires, it checks if a restart was already consumed (e.g. by the config watcher) and skips the emit if so.Hot-reload cancellation — After the config watcher successfully applies a hot-reload, it calls
cancelScheduledGatewaySigusr1Restart()to cancel the pending scheduled restart entirely.Changes
src/infra/restart.ts: Track scheduled restart timer; addcancelScheduledGatewaySigusr1Restart(); snapshot cycle token to skip stale restartssrc/gateway/config-reload.ts: Cancel scheduled restart after successful hot-reloadsrc/infra/infra-runtime.test.ts: 3 regression tests (cancel works, no-op cancel, config-watcher-fires-first race)Test Results
All 386 gateway tests pass. All 18 restart tests pass (including 3 new regression tests).
Greptile Summary
Fixes race condition where config file changes trigger both hot-reload (via chokidar watcher) and scheduled SIGUSR1 restart (via WS handler), causing duplicate channel starts and
409 Conflicterrors from Telegram.The fix implements two complementary guards:
scheduleGatewaySigusr1Restartnow capturesrestartCycleTokenat schedule time and skips the restart ifconsumedRestartTokenhas advanced (indicating another restart already fired)cancelScheduledGatewaySigusr1Restart()after successful hot-reload to preemptively cancel the pending timerChanges are well-tested with 3 new regression tests covering cancellation, no-op cancellation, and the config-watcher-fires-first race. The logic correctly handles edge cases including previous restart cycles and multiple schedule attempts.
Confidence Score: 5/5
Last reviewed commit: 95e492e