Skip to content

fix(gateway): stagger Telegram channel startup to prevent polling stalls#78437

Open
glasswings-lang wants to merge 2 commits intoopenclaw:mainfrom
glasswings-lang:fix/sequential-channel-startup
Open

fix(gateway): stagger Telegram channel startup to prevent polling stalls#78437
glasswings-lang wants to merge 2 commits intoopenclaw:mainfrom
glasswings-lang:fix/sequential-channel-startup

Conversation

@glasswings-lang
Copy link
Copy Markdown

@glasswings-lang glasswings-lang commented May 6, 2026

Summary

Channel startup uses runTasksWithConcurrency with limit: 4, but each task's inner async function returns as soon as the provider promise is handed off (without awaiting it), so the concurrency limiter effectively serializes microsecond-scale handoffs while all providers boot in parallel. Combined with schedulePrimaryModelPrewarm firing concurrently with startChannels(), simultaneous getMe calls and provider-runtime imports starve the Node event loop.

On hosts with N>1 Telegram accounts, this manifests as: getUpdates polling stalls for several minutes before the watchdog forces restart cycles.

Changes

  • CHANNEL_STARTUP_CONCURRENCY: 4 → 1
  • New CHANNEL_STARTUP_STAGGER_MS = 3_000 with await sleepWithAbort() between task completions, so the limiter actually spaces account boots out
  • Move schedulePrimaryModelPrewarm() to run AFTER startChannels() awaits, so prewarm's provider-runtime imports don't compete for CPU with channel handshakes
  • Update server-channels.test.ts fanout assertions to match the new contract (concurrency 1, per-account stagger calls verified)
  • Add user-facing CHANGELOG entry under ### Fixes

Real behavior proof

Behavior or issue addressed: multi-Telegram-account gateway startup wedge — polling stalls and getMe cascades during simultaneous bot startup, observed on Windows 11 + Node 24. Subset of the chronic #73323 pattern (the startup-cascade slice of it).

Real environment tested: Windows 11 Home 26200, Node 24.14.0, glasswings-lang/openclaw fork at branch local/all-fixes (this PR's commits merged in). Five Telegram bot accounts: @EthelredBot, @MariaClone_bot, @runningkittenBot, @Rozaya1Bot, @HavenismBot. Single-host setup with the user's normal ~/.openclaw/openclaw.json.

Exact steps or command run after this patch: stopped the running gateway process and the registered Windows scheduled-task instance, removed the stale lock at %TEMP%\openclaw\gateway.<hash>.lock, then restarted with stdout redirected to a file:

cd "C:/git-src/openclaw"
export OPENCLAW_TELEGRAM_FORCE_IPV4=1
node dist/index.js gateway --port 18789 > gateway-startup.log 2>&1

After-fix evidence: copied live output from the gateway log on the affected host, ANSI stripped:

2026-05-06T12:32:12.314-08:00 [gateway] http server listening (2 plugins: memory-core, telegram; 9.4s)
2026-05-06T12:32:12.837-08:00 [gateway] starting channels and sidecars...
2026-05-06T12:32:14.280-08:00 [telegram] [aethelred] starting provider (@EthelredBot)
2026-05-06T12:32:17.078-08:00 [telegram] [maria]     starting provider (@MariaClone_bot)
2026-05-06T12:32:20.065-08:00 [telegram] [pet]       starting provider (@runningkittenBot)
2026-05-06T12:32:23.101-08:00 [telegram] [rozaya]    starting provider (@Rozaya1Bot)
2026-05-06T12:32:26.081-08:00 [telegram] [tensor]    starting provider (@HavenismBot)
2026-05-06T12:32:28.443-08:00 [gateway] ready

Observed result after fix: bot startups landed at 12:32:14, 12:32:17, 12:32:20, 12:32:23, 12:32:26 — inter-bot deltas of 2.798s, 2.987s, 3.036s, 2.980s, confirming the 3-second CHANNEL_STARTUP_STAGGER_MS spacing. All five accounts booted in serial under the new concurrency-1 contract; gateway reached ready 14 seconds after channel start kicked off; no polling stalls were observed during the boot window. Compare against the pre-fix #73323 reports where the same five-bot setup wedged for 4+ minutes during simultaneous startup. Bot tokens redacted; bot handles shown are public usernames.

What was not tested: the deeper chronic post-startup undici dispatcher state degradation from #73323 — that affects long-running gateway processes regardless of these mitigations and likely needs Node profiling to root-cause. This PR addresses the startup-cascade subset only.

AI-assisted PR

Authored with Claude Code (Anthropic) in collaboration with the human contributor. The contributor verified the fix on their own multi-Telegram Windows setup (proof above), reviewed the diff, and confirms understanding of the concurrency/stagger semantics.

Test plan

  • Build passes locally (pnpm build)
  • Updated server-channels.test.ts assertions reflect concurrency-1 + stagger contract
  • Manual: 5-bot Windows 11 + Node 24 reproduction shows clean staggered startup with no polling stalls during the boot window (log excerpt above)
  • Maintainer review of stagger constant choice (3s default — happy to make it configurable)

🤖 Generated with Claude Code

Channel startup uses runTasksWithConcurrency with limit=4, but each task's
inner async function returns as soon as the provider promise is handed
off (without awaiting it), so the concurrency limiter effectively serializes
microsecond-scale handoffs while all providers boot in parallel. Combined
with schedulePrimaryModelPrewarm firing concurrently with startChannels(),
simultaneous getMe calls and provider-runtime imports starve the Node event
loop. Telegram getUpdates stalls for several minutes before the watchdog
forces restart cycles.

- CHANNEL_STARTUP_CONCURRENCY: 4 -> 1
- Add CHANNEL_STARTUP_STAGGER_MS=3000 with sleep between task completions,
  so the limiter actually spaces account boots out
- Move schedulePrimaryModelPrewarm() to run AFTER startChannels() awaits

Repro: Windows 11, Node 24, 5 Telegram accounts. Before: 4+ min polling
stalls, ~2.5 min reply latency. After: 5 accounts boot ~15s spaced 3s
apart, no stalls observed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime size: XS triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 6, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 6, 2026

Codex review: needs changes before merge.

Summary
The PR serializes Gateway channel/account startup with a 3-second stagger, moves primary model prewarm after channel startup, updates gateway fanout tests, and adds a changelog entry.

Reproducibility: unclear. for a live reproduction of the full Windows polling-stall symptom; I did not reproduce that environment. Source inspection does show current main fans out startup handoffs, and the PR body supplies after-fix Windows logs for the staggered startup slice.

Real behavior proof
Sufficient (logs): The PR body includes copied live Windows gateway logs showing the after-fix staggered multi-Telegram startup and readiness result.

Next step before merge
A narrow automated repair can address the fake-timer startup hang, while final merge still needs maintainer policy review for the global stagger.

Security
Cleared: The diff changes Gateway startup timing/order plus tests and changelog only, with no dependency, workflow, secret-handling, package-resolution, or supply-chain surface.

Review findings

  • [P2] Make the startup stagger timer-safe — src/gateway/server-channels.ts:624-627
Review details

Best possible solution:

Make the stagger timer-safe and narrowly validated, then have Gateway maintainers decide whether a fixed global startup delay is the right shared policy.

Do we have a high-confidence way to reproduce the issue?

Unclear for a live reproduction of the full Windows polling-stall symptom; I did not reproduce that environment. Source inspection does show current main fans out startup handoffs, and the PR body supplies after-fix Windows logs for the staggered startup slice.

Is this the best way to solve the issue?

No, not yet. The approach is plausible, but the implementation needs a timer-safe seam or broader test updates and maintainer agreement on the global 3-second Gateway startup policy.

Full review comments:

  • [P2] Make the startup stagger timer-safe — src/gateway/server-channels.ts:624-627
    Adding an awaited sleepWithAbort(3_000, ...) after every handoff means fake-timer tests that await manager.startChannels() before advancing timers can hang. This suite enables fake timers globally, and many unchanged tests still await startup directly, so the patch needs a bypass/test seam or broader test updates beyond the fanout cases.
    Confidence: 0.94

Overall correctness: patch is incorrect
Overall confidence: 0.91

Acceptance criteria:

  • pnpm test src/gateway/server-channels.test.ts
  • pnpm test src/gateway/server-startup-post-attach.test.ts
  • pnpm build

What I checked:

  • Current main startup fanout: Current main sets CHANNEL_STARTUP_CONCURRENCY = 4, so channel/account startup can fan out before provider handoff completes. (src/gateway/server-channels.ts:40, 97b07eaeaf38)
  • Current main provider handoff: The current task stores the tracked provider promise and returns to the concurrency runner without awaiting provider lifetime, matching the PR's described source behavior. (src/gateway/server-channels.ts:619, 97b07eaeaf38)
  • PR adds awaited stagger: The PR adds await sleepWithAbort(CHANNEL_STARTUP_STAGGER_MS, abort.signal) after each account handoff in the shared startup path. (src/gateway/server-channels.ts:624, 7b2249cec3ee)
  • Fake timer suite exposure: server-channels.test.ts enables fake timers in beforeEach, and many unchanged tests still await manager.startChannels() before advancing timers. (src/gateway/server-channels.test.ts:183, 97b07eaeaf38)
  • Existing hang candidate: The first unchanged test awaits manager.startChannels() before vi.advanceTimersByTimeAsync, which would wait on the new 3-second fake-timer sleep. (src/gateway/server-channels.test.ts:204, 97b07eaeaf38)
  • Real behavior proof supplied: The PR body includes copied Windows gateway log output showing five Telegram provider startups at roughly 3-second intervals and the gateway reaching ready after the staggered boot window. (7b2249cec3ee)

Likely related people:

  • steipete: Local blame on the current startup fanout, handoff, prewarm ordering, and fake-timer test setup points to Peter Steinberger, and the prior ClawSweeper review also tied recent gateway startup work to this area. (role: recent maintainer and startup-path owner; confidence: high; commits: 167e43345acc, bf2711b40ef7, 11f0244cf431; files: src/gateway/server-channels.ts, src/gateway/server-channels.test.ts, src/gateway/server-startup-post-attach.ts)
  • kevinslin: The previous ClawSweeper review identified recent channel restart lifecycle work by Kevin Lin in the same server-channels surface; local shallow history was insufficient to independently expand that trail. (role: recent adjacent lifecycle maintainer; confidence: medium; commits: 5a8ccb6fe0ef; files: src/gateway/server-channels.ts, src/gateway/server-channels.test.ts)

Remaining risk / open question:

Codex review notes: model gpt-5.5, reasoning high; reviewed against 97b07eaeaf38.

…ger contract

The previous assertions in `limits whole-channel account startup fanout to four`
and `limits channel plugin startup fanout to four` expected `maxActive == 4`
under the old `CHANNEL_STARTUP_CONCURRENCY = 4` contract. With this PR's
serialization (concurrency 1, 3s handoff stagger), only one preflight runs at
a time, so the tests are renamed and rewritten to walk through the staggered
handoffs and verify single-active behavior plus per-account stagger calls.

Also adds the user-facing CHANGELOG entry the bot requested.

Addresses ClawSweeper P2 + P3 review findings on PR openclaw#78437.
@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 6, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. size: S

Projects

None yet

1 participant