Skip to content

fix(gateway): prevent channel double-start after SIGUSR1 config reload#20930

Closed
widingmarcus-cyber wants to merge 1 commit intoopenclaw:mainfrom
widingmarcus-cyber:fix/telegram-double-start-on-sigusr1
Closed

fix(gateway): prevent channel double-start after SIGUSR1 config reload#20930
widingmarcus-cyber wants to merge 1 commit intoopenclaw:mainfrom
widingmarcus-cyber:fix/telegram-double-start-on-sigusr1

Conversation

@widingmarcus-cyber
Copy link
Copy Markdown
Contributor

@widingmarcus-cyber widingmarcus-cyber commented Feb 19, 2026

Summary

Fixes #20893

When config.patch or config.apply writes a config file, two independent restart paths race:

  1. Config file watcher (chokidar, ~300ms debounce) detects the change and either hot-reloads channels or requests a full restart
  2. WS handler schedules a SIGUSR1 restart (~2s delay via scheduleGatewaySigusr1Restart)

For channel-only changes (e.g. channels.telegram), the watcher hot-reloads the channel (stop + start), then the scheduled SIGUSR1 fires anyway, causing a full gateway restart that starts channels again — resulting in duplicate polling loops and permanent 409 Conflict errors from Telegram's getUpdates API.

Root Cause

[reload] config change detected; evaluating reload (...channels.telegram...)
[gateway/channels] restarting telegram channel    ← hot-reload starts new instances
[gateway] signal SIGUSR1 received                 ← scheduled restart starts them AGAIN

The config watcher and the explicit scheduleGatewaySigusr1Restart from the WS handler operate independently with no coordination, so both fire for the same config write.

Fix

Two complementary guards:

  1. Cycle token snapshotscheduleGatewaySigusr1Restart now snapshots restartCycleToken at schedule time. When the timer fires, it checks if a restart was already consumed (e.g. by the config watcher) and skips the emit if so.

  2. Hot-reload cancellation — After the config watcher successfully applies a hot-reload, it calls cancelScheduledGatewaySigusr1Restart() to cancel the pending scheduled restart entirely.

Changes

  • src/infra/restart.ts: Track scheduled restart timer; add cancelScheduledGatewaySigusr1Restart(); snapshot cycle token to skip stale restarts
  • src/gateway/config-reload.ts: Cancel scheduled restart after successful hot-reload
  • src/infra/infra-runtime.test.ts: 3 regression tests (cancel works, no-op cancel, config-watcher-fires-first race)

Test Results

All 386 gateway tests pass. All 18 restart tests pass (including 3 new regression tests).

Greptile Summary

Fixes race condition where config file changes trigger both hot-reload (via chokidar watcher) and scheduled SIGUSR1 restart (via WS handler), causing duplicate channel starts and 409 Conflict errors from Telegram.

The fix implements two complementary guards:

  • Cycle token snapshot: scheduleGatewaySigusr1Restart now captures restartCycleToken at schedule time and skips the restart if consumedRestartToken has advanced (indicating another restart already fired)
  • Explicit cancellation: Config watcher calls cancelScheduledGatewaySigusr1Restart() after successful hot-reload to preemptively cancel the pending timer

Changes are well-tested with 3 new regression tests covering cancellation, no-op cancellation, and the config-watcher-fires-first race. The logic correctly handles edge cases including previous restart cycles and multiple schedule attempts.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The fix implements a robust two-pronged solution to a well-documented race condition. The cycle token snapshot logic correctly handles all edge cases (previous restarts, multiple schedules, race scenarios). The explicit cancellation provides defense-in-depth. All 3 new regression tests pass alongside 383 existing gateway tests. The changes are minimal, focused, and preserve backward compatibility.
  • No files require special attention

Last reviewed commit: 95e492e

openclaw#20893)

When config.patch/config.apply writes a config file, two independent
restart paths race:

1. The config file watcher (chokidar, ~300ms debounce) detects the
   change and either hot-reloads channels or requests a full restart.
2. The WS handler schedules a SIGUSR1 restart (~2s delay).

For channel-only changes (e.g. channels.telegram), the watcher
hot-reloads the channel (stop + start), then the scheduled SIGUSR1
fires anyway, causing a full gateway restart that starts channels
again — resulting in duplicate polling loops and permanent 409
conflicts for Telegram's getUpdates API.

Fix with two complementary guards:

- scheduleGatewaySigusr1Restart now snapshots restartCycleToken at
  schedule time and skips the emit if a restart was already consumed
  (by the config watcher) before the timer fires.

- The config watcher's hot-reload path now calls
  cancelScheduledGatewaySigusr1Restart() after successfully applying
  a hot-reload, preventing the redundant scheduled restart entirely.

- Export cancelScheduledGatewaySigusr1Restart() from restart.ts for
  use by the config reloader.

Adds 3 regression tests covering cancellation, no-op cancel, and
the config-watcher-fires-first race condition.
Copy link
Copy Markdown

@nikolasdehor nikolasdehor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid fix. The two-pronged approach is the right call here:

  1. Cycle token snapshot — snapshotting restartCycleToken at schedule time and comparing against consumedRestartToken when the timer fires correctly handles the race where the config watcher triggers a full restart before the scheduled SIGUSR1 fires. This is the key guard that prevents the double-start.

  2. Explicit cancellation — calling cancelScheduledGatewaySigusr1Restart() after a successful hot-reload provides defense-in-depth for the more common case (channel-only config changes that don't need a full restart at all).

A few things I verified in the diff:

  • scheduledRestartTimer is properly cleaned up in __testing.resetSigusr1State(), so test isolation is correct.
  • The clearTimeout call in scheduleGatewaySigusr1Restart before setting a new timer prevents stacking if config.patch is called multiple times in quick succession.
  • The cancellation in config-reload.ts is placed after await opts.onHotReload(plan, nextConfig) succeeds, so it only cancels when hot-reload actually completed — if hot-reload throws, the scheduled restart still fires as a safety net. Good.
  • All 3 regression tests cover the right scenarios: explicit cancel, no-op cancel, and the config-watcher-fires-first race.

This is directly related to #19968 (SIGUSR1 config reload) which I filed — glad to see the race condition getting properly addressed. The Telegram 409 Conflict from duplicate polling loops was a real pain point.

LGTM.

@widingmarcus-cyber
Copy link
Copy Markdown
Contributor Author

Closing — upstream refactored restart coalescing with cooldown logic and pending timer management, superseding this simpler approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Telegram providers start twice after SIGUSR1 restart, causing permanent getUpdates 409 conflicts

2 participants