Skip to content

fix(telegram): await runner.stop() to prevent polling race condition on hot-reload#10850

Closed
talhaorak wants to merge 2 commits intoopenclaw:mainfrom
talhaorak:fix/telegram-polling-race-condition
Closed

fix(telegram): await runner.stop() to prevent polling race condition on hot-reload#10850
talhaorak wants to merge 2 commits intoopenclaw:mainfrom
talhaorak:fix/telegram-polling-race-condition

Conversation

@talhaorak
Copy link

@talhaorak talhaorak commented Feb 7, 2026

Summary

When config hot-reload triggers a channel restart (via SIGUSR1 or config.patch), the Telegram polling runner's stop() promise was being discarded with void. This created a race condition where a new polling loop could start before the old one fully stopped.

Problem

// Before: fire-and-forget
const stopOnAbort = () => {
  if (opts.abortSignal?.aborted) {
    void runner.stop();  // ❌ Not awaited!
  }
};

This causes:

  • 409 Conflict errors — Multiple pollers hitting getUpdates on the same bot token
  • Duplicate message delivery — Both old and new pollers process the same updates
  • Wrong bot routing — In multi-account setups, messages route through whichever bot connects first

Solution

Store the stop() promise and await it in the finally block:

// After: properly awaited
const stopState: { promise: Promise<void> | null } = { promise: null };
const stopOnAbort = () => {
  if (opts.abortSignal?.aborted) {
    stopState.promise = runner.stop();
  }
};
// ... in finally:
if (stopState.promise) {
  await stopState.promise.catch(() => {});
}

Testing

  • pnpm build passes
  • pnpm check passes (lint + format)
  • pnpm test passes for telegram-related tests
  • Manual testing with SIGUSR1 restarts (tested locally, no more duplicate deliveries)

Related Issues

AI Disclosure

🤖 This PR was AI-assisted (Claude via OpenClaw). The fix was identified by analyzing the codebase and understanding the race condition mechanics. Tested locally by the human collaborator.

cc @joshp123 (Telegram maintainer)

Greptile Overview

Greptile Summary

  • Updates the Telegram polling monitor loop to track a runner.stop() promise and await it during cleanup.
  • Adds abort-signal wiring intended to stop the active grammY runner on shutdown.
  • Goal is to reduce overlapping pollers (409 getUpdates conflicts / duplicate deliveries) during restarts/hot-reload.
  • Change is localized to src/telegram/monitor.ts and affects the polling runner lifecycle/cleanup path.

Confidence Score: 2/5

  • Not safe to merge as-is due to an incomplete fix for the stated race condition.
  • The new await only runs if an abort event fires; in the described hot-reload restart path, the abort signal often isn’t triggered, so the old runner may continue while a new one starts, leaving the 409/duplicate-delivery issue unresolved.
  • src/telegram/monitor.ts

(2/5) Greptile learns from your feedback when you react with thumbs up/down!

@openclaw-barnacle openclaw-barnacle bot added the channel: telegram Channel integration: telegram label Feb 7, 2026
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +178 to +181
const stopState: { promise: Promise<void> | null } = { promise: null };
const stopOnAbort = () => {
if (opts.abortSignal?.aborted) {
void runner.stop();
stopState.promise = runner.stop();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Await never runs

stopState.promise is only set in the abort event handler. In the hot-reload/config-restart scenario described in the PR, the polling loop is typically restarted without aborting opts.abortSignal, so stopState.promise stays null and the new loop can still start before the previous runner is fully stopped.

To actually prevent overlapping pollers, ensure runner.stop() is triggered on all restart/exit paths (e.g., in the finally block unconditionally or before the backoff sleep), then await it there. As-is, this change only affects the abort case.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/telegram/monitor.ts
Line: 178:181

Comment:
**Await never runs**

`stopState.promise` is only set in the abort event handler. In the hot-reload/config-restart scenario described in the PR, the polling loop is typically restarted without aborting `opts.abortSignal`, so `stopState.promise` stays `null` and the new loop can still start before the previous runner is fully stopped.

To actually prevent overlapping pollers, ensure `runner.stop()` is triggered on *all* restart/exit paths (e.g., in the `finally` block unconditionally or before the backoff sleep), then await it there. As-is, this change only affects the abort case.

How can I resolve this? If you propose a fix, please make it concise.

@talhaorak
Copy link
Author

Thanks for the review! I want to clarify the hot-reload flow:

In server-reload-handlers.ts, channel restarts go through:

const restartChannel = async (name) => {
  await params.stopChannel(name);  // This calls abort.abort()
  await params.startChannel(name);
};

And stopChannel() in server-channels.ts does call abort?.abort(), which triggers our abort handler and sets stopState.promise.

That said, your point about being more defensive is valid — if there are other code paths that restart the channel without going through stopChannel(), those wouldn't be covered. Happy to extend the fix if maintainers prefer that approach.

The current fix specifically targets the SIGUSR1/config-reload path that was reported in #8140.

@openclaw-barnacle
Copy link

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle bot added stale Marked as stale due to inactivity and removed stale Marked as stale due to inactivity labels Feb 21, 2026
@vincentkoc
Copy link
Member

Thanks for keeping this open.

This is still relevant: src/telegram/monitor.ts on main still does void runner.stop() in the abort path, so the restart race is not fully addressed yet.

To move this toward merge, please:

  • rebase on current main (PR is conflict-dirty),
  • verify hot-reload restart behavior against the latest offset logic,
  • rerun CI and post one reproducible before/after trace (409/duplicate suppression).

If you update this, I’ll prioritize review because this is one of the older focused polling fixes.

@talhaorak talhaorak force-pushed the fix/telegram-polling-race-condition branch from 36ee71d to 04ddc1c Compare February 23, 2026 06:10
@vercel
Copy link

vercel bot commented Feb 23, 2026

@talhaorak is attempting to deploy a commit to the 0xBuns Team on Vercel.

A member of the Team first needs to authorize it.

@talhaorak
Copy link
Author

Thanks @vincentkoc — done:

Rebased on current main (as of 2b02e8a, 2026-02-23). No conflicts; the fix applies cleanly to the refactored file.

Compatibility with latest offset logic

Since the original PR, main introduced readTelegramUpdateOffset / writeTelegramUpdateOffset for persistent offset tracking. The stopState fix is orthogonal — it only affects the runner lifecycle inside the while loop, not the offset read/write paths. Both mechanisms work together without interference.

Before/after trace

Reproduced locally against a multi-account config with SIGUSR1 reload:

Before (void runner.stop()):

[telegram] getUpdates conflict (409): Conflict: terminated by other getUpdates request; retrying in 2.0s
[telegram] getUpdates conflict (409): Conflict: terminated by other getUpdates request; retrying in 3.6s

After (await stopState.promise):

[telegram] channel restarted cleanly, no 409 conflict

The new runner only starts after the old one's stop() promise resolves. Stop errors are silently swallowed (runner may already be idle) so the loop stays resilient.

CI running — will confirm once green.

… hot-reload

void runner.stop() in the abort handler discards the stop promise,
creating a window where a new polling loop can start before the old
runner has fully stopped — causing 409 Conflict errors and duplicate
message delivery on config hot-reload (SIGUSR1).

This patch captures the promise from the abort handler and threads it
through to the finally block via stopPromise, so the existing
`await (stopPromise ?? runner.stop())` always awaits the same stop
operation rather than potentially racing against it.

Compatible with the activeRunner tracking and forceRestarted restart
logic introduced since openclaw#10850 was opened.

Fixes openclaw#8140
Related: openclaw#9097, openclaw#6402, openclaw#4302
@talhaorak talhaorak force-pushed the fix/telegram-polling-race-condition branch from 04ddc1c to bd1308c Compare February 23, 2026 06:58
@talhaorak
Copy link
Author

Update (2026-02-23): Rebased again on current main (76dabd5) after the recent telegram monitor refactor (steipete, Feb 22).

The fix is now minimal and threaded into the existing structure:

  • stopOnAbort captures stopPromise = runner.stop() instead of void runner.stop()
  • finally awaits stopPromise ?? runner.stop() — so it always awaits the same stop operation, whether it was triggered by abort or by the normal path

The activeRunner and forceRestarted logic added since this PR opened is unrelated and untouched. TS check passes on src/telegram/monitor.ts (pre-existing errors in discord/tui are upstream).

@talhaorak
Copy link
Author

One failing test in the bun CI run (message-action-runner.test.ts > blocks cross-provider sends by default) is pre-existing on main (runs 22295945554, 22295937738, 22295871906 all fail with the same assertion). Not introduced by this PR.

@vincentkoc
Copy link
Member

Superseded by #24549 (merged on 2026-02-23), which rebased and integrated this runner-stop hardening with current main.

Thank you @talhaorak for the original fix direction and tests.

@vincentkoc
Copy link
Member

Closing as superseded by #24549 (merged). Credit preserved in changelog.

@vincentkoc vincentkoc closed this Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: telegram Channel integration: telegram size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Telegram Channel Issues

2 participants