fix(gateway): stagger Telegram channel startup to prevent polling stalls by glasswings-lang · Pull Request #78437 · openclaw/openclaw

glasswings-lang · 2026-05-06T11:18:45Z

Summary

Channel startup uses runTasksWithConcurrency with limit: 4, but each task's inner async function returns as soon as the provider promise is handed off (without awaiting it), so the concurrency limiter effectively serializes microsecond-scale handoffs while all providers boot in parallel. Combined with schedulePrimaryModelPrewarm firing concurrently with startChannels(), simultaneous getMe calls and provider-runtime imports starve the Node event loop.

On hosts with N>1 Telegram accounts, this manifests as: getUpdates polling stalls for several minutes before the watchdog forces restart cycles.

Changes

CHANNEL_STARTUP_CONCURRENCY: 4 → 1
New CHANNEL_STARTUP_STAGGER_MS = 3_000 with await sleepWithAbort() between task completions, so the limiter actually spaces account boots out
Move schedulePrimaryModelPrewarm() to run AFTER startChannels() awaits, so prewarm's provider-runtime imports don't compete for CPU with channel handshakes
Update server-channels.test.ts fanout assertions to match the new contract (concurrency 1, per-account stagger calls verified)
Add user-facing CHANGELOG entry under ### Fixes

Real behavior proof

Behavior or issue addressed: multi-Telegram-account gateway startup wedge — polling stalls and getMe cascades during simultaneous bot startup, observed on Windows 11 + Node 24. Subset of the chronic #73323 pattern (the startup-cascade slice of it).

Real environment tested: Windows 11 Home 26200, Node 24.14.0, glasswings-lang/openclaw fork at branch local/all-fixes (this PR's commits merged in). Five Telegram bot accounts: @EthelredBot, @MariaClone_bot, @runningkittenBot, @Rozaya1Bot, @HavenismBot. Single-host setup with the user's normal ~/.openclaw/openclaw.json.

Exact steps or command run after this patch: stopped the running gateway process and the registered Windows scheduled-task instance, removed the stale lock at %TEMP%\openclaw\gateway.<hash>.lock, then restarted with stdout redirected to a file:

cd "C:/git-src/openclaw"
export OPENCLAW_TELEGRAM_FORCE_IPV4=1
node dist/index.js gateway --port 18789 > gateway-startup.log 2>&1

After-fix evidence: copied live output from the gateway log on the affected host, ANSI stripped:

2026-05-06T12:32:12.314-08:00 [gateway] http server listening (2 plugins: memory-core, telegram; 9.4s)
2026-05-06T12:32:12.837-08:00 [gateway] starting channels and sidecars...
2026-05-06T12:32:14.280-08:00 [telegram] [aethelred] starting provider (@EthelredBot)
2026-05-06T12:32:17.078-08:00 [telegram] [maria]     starting provider (@MariaClone_bot)
2026-05-06T12:32:20.065-08:00 [telegram] [pet]       starting provider (@runningkittenBot)
2026-05-06T12:32:23.101-08:00 [telegram] [rozaya]    starting provider (@Rozaya1Bot)
2026-05-06T12:32:26.081-08:00 [telegram] [tensor]    starting provider (@HavenismBot)
2026-05-06T12:32:28.443-08:00 [gateway] ready

Observed result after fix: bot startups landed at 12:32:14, 12:32:17, 12:32:20, 12:32:23, 12:32:26 — inter-bot deltas of 2.798s, 2.987s, 3.036s, 2.980s, confirming the 3-second CHANNEL_STARTUP_STAGGER_MS spacing. All five accounts booted in serial under the new concurrency-1 contract; gateway reached ready 14 seconds after channel start kicked off; no polling stalls were observed during the boot window. Compare against the pre-fix #73323 reports where the same five-bot setup wedged for 4+ minutes during simultaneous startup. Bot tokens redacted; bot handles shown are public usernames.

What was not tested: the deeper chronic post-startup undici dispatcher state degradation from #73323 — that affects long-running gateway processes regardless of these mitigations and likely needs Node profiling to root-cause. This PR addresses the startup-cascade subset only.

AI-assisted PR

Authored with Claude Code (Anthropic) in collaboration with the human contributor. The contributor verified the fix on their own multi-Telegram Windows setup (proof above), reviewed the diff, and confirms understanding of the concurrency/stagger semantics.

Test plan

Build passes locally (pnpm build)
Updated server-channels.test.ts assertions reflect concurrency-1 + stagger contract
Manual: 5-bot Windows 11 + Node 24 reproduction shows clean staggered startup with no polling stalls during the boot window (log excerpt above)
Maintainer review of stagger constant choice (3s default — happy to make it configurable)

🤖 Generated with Claude Code

Channel startup uses runTasksWithConcurrency with limit=4, but each task's inner async function returns as soon as the provider promise is handed off (without awaiting it), so the concurrency limiter effectively serializes microsecond-scale handoffs while all providers boot in parallel. Combined with schedulePrimaryModelPrewarm firing concurrently with startChannels(), simultaneous getMe calls and provider-runtime imports starve the Node event loop. Telegram getUpdates stalls for several minutes before the watchdog forces restart cycles. - CHANNEL_STARTUP_CONCURRENCY: 4 -> 1 - Add CHANNEL_STARTUP_STAGGER_MS=3000 with sleep between task completions, so the limiter actually spaces account boots out - Move schedulePrimaryModelPrewarm() to run AFTER startChannels() awaits Repro: Windows 11, Node 24, 5 Telegram accounts. Before: 4+ min polling stalls, ~2.5 min reply latency. After: 5 accounts boot ~15s spaced 3s apart, no stalls observed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

clawsweeper · 2026-05-06T11:21:52Z

Codex review: needs changes before merge.

Summary
The PR serializes Gateway channel/account startup with a 3-second stagger, moves primary model prewarm after channel startup, updates gateway fanout tests, and adds a changelog entry.

Reproducibility: unclear. for a live reproduction of the full Windows polling-stall symptom; I did not reproduce that environment. Source inspection does show current main fans out startup handoffs, and the PR body supplies after-fix Windows logs for the staggered startup slice.

Real behavior proof
Sufficient (logs): The PR body includes copied live Windows gateway logs showing the after-fix staggered multi-Telegram startup and readiness result.

Next step before merge
A narrow automated repair can address the fake-timer startup hang, while final merge still needs maintainer policy review for the global stagger.

Security
Cleared: The diff changes Gateway startup timing/order plus tests and changelog only, with no dependency, workflow, secret-handling, package-resolution, or supply-chain surface.

Review findings

[P2] Make the startup stagger timer-safe — src/gateway/server-channels.ts:624-627

Review details

Best possible solution:

Make the stagger timer-safe and narrowly validated, then have Gateway maintainers decide whether a fixed global startup delay is the right shared policy.

Do we have a high-confidence way to reproduce the issue?

Unclear for a live reproduction of the full Windows polling-stall symptom; I did not reproduce that environment. Source inspection does show current main fans out startup handoffs, and the PR body supplies after-fix Windows logs for the staggered startup slice.

Is this the best way to solve the issue?

No, not yet. The approach is plausible, but the implementation needs a timer-safe seam or broader test updates and maintainer agreement on the global 3-second Gateway startup policy.

Full review comments:

[P2] Make the startup stagger timer-safe — src/gateway/server-channels.ts:624-627
Adding an awaited sleepWithAbort(3_000, ...) after every handoff means fake-timer tests that await manager.startChannels() before advancing timers can hang. This suite enables fake timers globally, and many unchanged tests still await startup directly, so the patch needs a bypass/test seam or broader test updates beyond the fanout cases.
Confidence: 0.94

Overall correctness: patch is incorrect
Overall confidence: 0.91

Acceptance criteria:

pnpm test src/gateway/server-channels.test.ts
pnpm test src/gateway/server-startup-post-attach.test.ts
pnpm build

What I checked:

Current main startup fanout: Current main sets CHANNEL_STARTUP_CONCURRENCY = 4, so channel/account startup can fan out before provider handoff completes. (src/gateway/server-channels.ts:40, 97b07eaeaf38)
Current main provider handoff: The current task stores the tracked provider promise and returns to the concurrency runner without awaiting provider lifetime, matching the PR's described source behavior. (src/gateway/server-channels.ts:619, 97b07eaeaf38)
PR adds awaited stagger: The PR adds await sleepWithAbort(CHANNEL_STARTUP_STAGGER_MS, abort.signal) after each account handoff in the shared startup path. (src/gateway/server-channels.ts:624, 7b2249cec3ee)
Fake timer suite exposure: server-channels.test.ts enables fake timers in beforeEach, and many unchanged tests still await manager.startChannels() before advancing timers. (src/gateway/server-channels.test.ts:183, 97b07eaeaf38)
Existing hang candidate: The first unchanged test awaits manager.startChannels() before vi.advanceTimersByTimeAsync, which would wait on the new 3-second fake-timer sleep. (src/gateway/server-channels.test.ts:204, 97b07eaeaf38)
Real behavior proof supplied: The PR body includes copied Windows gateway log output showing five Telegram provider startups at roughly 3-second intervals and the gateway reaching ready after the staggered boot window. (7b2249cec3ee)

Likely related people:

steipete: Local blame on the current startup fanout, handoff, prewarm ordering, and fake-timer test setup points to Peter Steinberger, and the prior ClawSweeper review also tied recent gateway startup work to this area. (role: recent maintainer and startup-path owner; confidence: high; commits: 167e43345acc, bf2711b40ef7, 11f0244cf431; files: src/gateway/server-channels.ts, src/gateway/server-channels.test.ts, src/gateway/server-startup-post-attach.ts)
kevinslin: The previous ClawSweeper review identified recent channel restart lifecycle work by Kevin Lin in the same server-channels surface; local shallow history was insufficient to independently expand that trail. (role: recent adjacent lifecycle maintainer; confidence: medium; commits: 5a8ccb6fe0ef; files: src/gateway/server-channels.ts, src/gateway/server-channels.test.ts)

Remaining risk / open question:

I did not run the targeted Vitest suite in this read-only review, so the timer failure is source-proven rather than test-executed.
The related Windows/Node runtime degradation in [Bug]: Gateway runtime degradation: pricing fetch 60s timeouts, Telegram polling stalls, slow RPC — chronic across 4.23/4.25/4.26 on Windows 11 + Node 24 #73323 remains broader than this PR's startup-cascade fix.
The global 3-second delay affects all channel/plugin startup, not only multi-account Telegram hosts.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 97b07eaeaf38.

…ger contract The previous assertions in `limits whole-channel account startup fanout to four` and `limits channel plugin startup fanout to four` expected `maxActive == 4` under the old `CHANNEL_STARTUP_CONCURRENCY = 4` contract. With this PR's serialization (concurrency 1, 3s handoff stagger), only one preflight runs at a time, so the tests are renamed and rewritten to walk through the staggered handoffs and verify single-active behavior plus per-account stagger calls. Also adds the user-facing CHANGELOG entry the bot requested. Addresses ClawSweeper P2 + P3 review findings on PR openclaw#78437.

openclaw-barnacle Bot added gateway Gateway runtime size: XS triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 6, 2026

openclaw-barnacle Bot added size: S and removed size: XS labels May 6, 2026

This was referenced May 6, 2026

fix(telegram): add OPENCLAW_TELEGRAM_FORCE_IPV4 opt-in env var #78438

Open

perf(models): memoize buildModelsProviderData with 60s TTL #78454

Open

openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 6, 2026

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(gateway): stagger Telegram channel startup to prevent polling stalls#78437

fix(gateway): stagger Telegram channel startup to prevent polling stalls#78437
glasswings-lang wants to merge 2 commits intoopenclaw:mainfrom
glasswings-lang:fix/sequential-channel-startup

glasswings-lang commented May 6, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

glasswings-lang commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Real behavior proof

AI-assisted PR

Test plan

Uh oh!

clawsweeper Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

glasswings-lang commented May 6, 2026 •

edited

Loading

clawsweeper Bot commented May 6, 2026 •

edited

Loading