fix(gateway): stagger Telegram channel startup to prevent polling stalls#78437
fix(gateway): stagger Telegram channel startup to prevent polling stalls#78437glasswings-lang wants to merge 2 commits intoopenclaw:mainfrom
Conversation
Channel startup uses runTasksWithConcurrency with limit=4, but each task's inner async function returns as soon as the provider promise is handed off (without awaiting it), so the concurrency limiter effectively serializes microsecond-scale handoffs while all providers boot in parallel. Combined with schedulePrimaryModelPrewarm firing concurrently with startChannels(), simultaneous getMe calls and provider-runtime imports starve the Node event loop. Telegram getUpdates stalls for several minutes before the watchdog forces restart cycles. - CHANNEL_STARTUP_CONCURRENCY: 4 -> 1 - Add CHANNEL_STARTUP_STAGGER_MS=3000 with sleep between task completions, so the limiter actually spaces account boots out - Move schedulePrimaryModelPrewarm() to run AFTER startChannels() awaits Repro: Windows 11, Node 24, 5 Telegram accounts. Before: 4+ min polling stalls, ~2.5 min reply latency. After: 5 accounts boot ~15s spaced 3s apart, no stalls observed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Codex review: needs changes before merge. Summary Reproducibility: unclear. for a live reproduction of the full Windows polling-stall symptom; I did not reproduce that environment. Source inspection does show current main fans out startup handoffs, and the PR body supplies after-fix Windows logs for the staggered startup slice. Real behavior proof Next step before merge Security Review findings
Review detailsBest possible solution: Make the stagger timer-safe and narrowly validated, then have Gateway maintainers decide whether a fixed global startup delay is the right shared policy. Do we have a high-confidence way to reproduce the issue? Unclear for a live reproduction of the full Windows polling-stall symptom; I did not reproduce that environment. Source inspection does show current main fans out startup handoffs, and the PR body supplies after-fix Windows logs for the staggered startup slice. Is this the best way to solve the issue? No, not yet. The approach is plausible, but the implementation needs a timer-safe seam or broader test updates and maintainer agreement on the global 3-second Gateway startup policy. Full review comments:
Overall correctness: patch is incorrect Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 97b07eaeaf38. |
…ger contract The previous assertions in `limits whole-channel account startup fanout to four` and `limits channel plugin startup fanout to four` expected `maxActive == 4` under the old `CHANNEL_STARTUP_CONCURRENCY = 4` contract. With this PR's serialization (concurrency 1, 3s handoff stagger), only one preflight runs at a time, so the tests are renamed and rewritten to walk through the staggered handoffs and verify single-active behavior plus per-account stagger calls. Also adds the user-facing CHANGELOG entry the bot requested. Addresses ClawSweeper P2 + P3 review findings on PR openclaw#78437.
Summary
Channel startup uses
runTasksWithConcurrencywithlimit: 4, but each task's inner async function returns as soon as the provider promise is handed off (without awaiting it), so the concurrency limiter effectively serializes microsecond-scale handoffs while all providers boot in parallel. Combined withschedulePrimaryModelPrewarmfiring concurrently withstartChannels(), simultaneousgetMecalls and provider-runtime imports starve the Node event loop.On hosts with N>1 Telegram accounts, this manifests as:
getUpdatespolling stalls for several minutes before the watchdog forces restart cycles.Changes
CHANNEL_STARTUP_CONCURRENCY: 4 → 1CHANNEL_STARTUP_STAGGER_MS = 3_000withawait sleepWithAbort()between task completions, so the limiter actually spaces account boots outschedulePrimaryModelPrewarm()to run AFTERstartChannels()awaits, so prewarm's provider-runtime imports don't compete for CPU with channel handshakesserver-channels.test.tsfanout assertions to match the new contract (concurrency 1, per-account stagger calls verified)### FixesReal behavior proof
Behavior or issue addressed: multi-Telegram-account gateway startup wedge — polling stalls and
getMecascades during simultaneous bot startup, observed on Windows 11 + Node 24. Subset of the chronic #73323 pattern (the startup-cascade slice of it).Real environment tested: Windows 11 Home 26200, Node 24.14.0, glasswings-lang/openclaw fork at branch
local/all-fixes(this PR's commits merged in). Five Telegram bot accounts:@EthelredBot,@MariaClone_bot,@runningkittenBot,@Rozaya1Bot,@HavenismBot. Single-host setup with the user's normal~/.openclaw/openclaw.json.Exact steps or command run after this patch: stopped the running gateway process and the registered Windows scheduled-task instance, removed the stale lock at
%TEMP%\openclaw\gateway.<hash>.lock, then restarted with stdout redirected to a file:After-fix evidence: copied live output from the gateway log on the affected host, ANSI stripped:
Observed result after fix: bot startups landed at 12:32:14, 12:32:17, 12:32:20, 12:32:23, 12:32:26 — inter-bot deltas of 2.798s, 2.987s, 3.036s, 2.980s, confirming the 3-second
CHANNEL_STARTUP_STAGGER_MSspacing. All five accounts booted in serial under the new concurrency-1 contract; gateway reachedready14 seconds after channel start kicked off; no polling stalls were observed during the boot window. Compare against the pre-fix #73323 reports where the same five-bot setup wedged for 4+ minutes during simultaneous startup. Bot tokens redacted; bot handles shown are public usernames.What was not tested: the deeper chronic post-startup undici dispatcher state degradation from #73323 — that affects long-running gateway processes regardless of these mitigations and likely needs Node profiling to root-cause. This PR addresses the startup-cascade subset only.
AI-assisted PR
Authored with Claude Code (Anthropic) in collaboration with the human contributor. The contributor verified the fix on their own multi-Telegram Windows setup (proof above), reviewed the diff, and confirms understanding of the concurrency/stagger semantics.
Test plan
pnpm build)server-channels.test.tsassertions reflect concurrency-1 + stagger contract🤖 Generated with Claude Code