Skip to content

fix: reduce WeChat channel cold-start delay from ~2min to ~5s#646

Merged
lefarcen merged 7 commits intomainfrom
claude/fix-wechat-startup-delay-GtnGL
Mar 30, 2026
Merged

fix: reduce WeChat channel cold-start delay from ~2min to ~5s#646
lefarcen merged 7 commits intomainfrom
claude/fix-wechat-startup-delay-GtnGL

Conversation

@lefarcen
Copy link
Copy Markdown
Collaborator

What

Reduce WeChat channel first-response latency from ~2 minutes to ~5 seconds after connecting or restarting.

Why

After connecting a WeChat channel for the first time (or reconnecting after a desktop restart), the first message takes up to 2 minutes to get a response. Users perceive the bot as broken. The delay compounds from three independent causes:

  1. Full gateway restart (~20-45s): Unlike Feishu (which has a prewarm account), WeChat had no config subtree until first connect → OpenClaw couldn't match it to any loaded plugin's reload prefixes → full process restart instead of hot-reload.
  2. 35s initial long-poll timeout: After extension starts, the first getUpdates() used the full 35s timeout, delaying message pickup.
  3. No readiness wait: connectWechat() returned immediately after syncAll(), so the UI had no way to know the channel was actually ready.

Closes #610

How

Three compounding fixes:

  1. WeChat prewarm account (channel-binding-compiler.ts): Always include openclaw-weixin in the compiled config with a disabled prewarm account (same pattern as Feishu at L140-149). When the user later connects a real WeChat account, it's an account-level change that triggers a fast hot-reload (~500ms) instead of a full gateway restart (~20-45s).

  2. Short initial poll timeout (monitor.ts): First 3 polls use a 3s timeout instead of 35s. This quickly picks up any messages queued during startup. After the initial phase, switches to the normal 35s long-poll. Server-suggested timeout overrides both.

  3. Readiness wait (channel-service.ts): Added waitForWechatReady() (same pattern as existing waitForWhatsappReady()) that polls channel readiness for up to 30s after connect. The UI now shows accurate status.

Affected areas

  • Controller (backend / API)
  • OpenClaw runtime

Checklist

  • pnpm typecheck passes
  • pnpm lint passes
  • pnpm test passes (519 passed)
  • pnpm generate-types run (if API routes/schemas changed) — N/A, no API route changes
  • No credentials or tokens in code or logs
  • No any types introduced (use unknown with narrowing)

Notes for reviewers

  • The prewarm pattern is identical to the existing Feishu prewarm at channel-binding-compiler.ts:140-149. The disabled prewarm account (enabled: false) is ignored by OpenClaw's channel manager but keeps the plugin loaded.
  • The WeChat plugin's startAccount() already gracefully handles unconfigured accounts (throws before starting the monitor), so the prewarm account won't cause any side effects.
  • The initial short-poll phase (3s × 3) adds minimal overhead (~9s of short polls) but dramatically reduces first-message latency.

https://claude.ai/code/session_01CSC1RKaRB9F7C4t3HQGpSs

Three changes that compound to eliminate the cold-start latency:

1. Add WeChat prewarm account (like Feishu) so openclaw-weixin is always
   in the config from first boot. When the user later connects WeChat,
   it's an account-level change → hot-reload (~500ms) instead of full
   gateway restart (~20-45s).

2. Use shorter initial poll timeout (3s × 3 polls) in the WeChat monitor
   before switching to the normal 35s long-poll. Picks up messages
   queued during startup within seconds instead of up to 35s.

3. Add waitForWechatReady() after connect (like WhatsApp) so the UI
   shows accurate status and users don't message a not-yet-started
   channel.

Closes #610

https://claude.ai/code/session_01CSC1RKaRB9F7C4t3HQGpSs
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9e450cca18

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread apps/controller/src/lib/channel-binding-compiler.ts
claude and others added 5 commits March 29, 2026 06:31
Change normalTimeoutMs from const to let so it stays in sync when the
server returns a longpolling_timeout_ms value. Prevents a theoretical
regression where the initial-poll phase end would fall back to the
original default instead of the server-suggested value.

https://claude.ai/code/session_01CSC1RKaRB9F7C4t3HQGpSs
When disconnecting a WeChat channel, remove the account's credential
file, sync state, and accounts.json index entry from the OpenClaw state
directory. Without this cleanup, old accounts accumulate across
disconnect/reconnect cycles and all start on the next cold boot,
wasting resources and causing session-expired errors.
Cover prewarm config compilation (4 tests), connect readiness polling
(2 tests), disconnect state cleanup (5 tests) including multi-cycle
accumulation regression.
Three fixes based on code review:

1. syncWeixinAccountIndex is now authoritative: only keeps account IDs
   present in the current config, and filters out __nexu_internal_*
   prewarm IDs. Prevents ghost accounts in accounts.json.

2. disconnectChannel cleanup runs BEFORE syncAll so the config writer
   never sees stale credential files during index sync.

3. connectWechat now rolls back (disconnect + cleanup) when readiness
   times out after 30s, matching the WhatsApp pattern. Previously it
   returned success even when the channel wasn't actually ready.
Copy link
Copy Markdown

@JiwaniZakir JiwaniZakir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prewarm sentinel constant INTERNAL_WECHAT_PREWARM_ACCOUNT_ID is defined in channel-binding-compiler.ts but never imported by openclaw-config-writer.ts, which instead relies on the magic string prefix "__nexu_internal_" to filter it out. This implicit naming convention is fragile — a future internal account whose ID doesn't start with that prefix would silently leak into the persisted accounts.json index. Exporting a shared NEXU_INTERNAL_ACCOUNT_PREFIX constant (or an explicit set of reserved IDs) from a shared module would make this coupling explicit and type-safe.

Additionally, in the rollback path inside connectWechat, cleanupWechatAccountState directly mutates accounts.json on disk and then syncAll() is called immediately after. syncWeixinAccountIndex will rewrite the same file based on the authoritative config, making the manual index mutation in cleanupWechatAccountState redundant for the rollback case specifically. Clarifying which path is responsible for index cleanup — the direct file manipulation or the authoritative config writer — would reduce the risk of them diverging in future edits.

Address community review feedback:

1. Extract NEXU_INTERNAL_ACCOUNT_PREFIX as a shared exported constant
   from channel-binding-compiler.ts. The config writer now imports it
   instead of relying on a magic string prefix.

2. cleanupWechatAccountState only deletes credential/sync files — index
   reconciliation is exclusively owned by the authoritative config
   writer during syncAll(). This eliminates the dual-write ambiguity.
@lefarcen
Copy link
Copy Markdown
Collaborator Author

@JiwaniZakir Thanks for the thorough review — both points are spot-on.

On the magic prefix: Extracted NEXU_INTERNAL_ACCOUNT_PREFIX as a shared exported constant from channel-binding-compiler.ts. The config writer now imports it directly instead of relying on a hardcoded string. The prewarm account IDs are also derived from this prefix via template literals, so any future internal accounts will be automatically filtered.

On the dual index mutation: Agreed — the rollback path's cleanupWechatAccountState no longer touches accounts.json. Index reconciliation is now exclusively owned by the authoritative config writer during syncAll(). The cleanup method only removes credential and sync files (which is appropriate since a failed connect produces useless credentials). This makes the ownership boundary clear: files → direct cleanup, index → writer.

Both changes are in 7f4f126. Thanks for catching these!

@JiwaniZakir
Copy link
Copy Markdown

The three-pronged decomposition here is solid — each fix is independently correct and the combined effect makes sense. One thing worth confirming: the prewarm account approach for WeChat mirrors the Feishu pattern, so make sure the config subtree key used during prewarm doesn't collide with a real user account prefix if someone happens to register a WeChat ID matching it. Also, dropping the initial getUpdates() timeout to something short (< 5s) is the right call, but verify the fallback behavior if the first poll returns empty — you don't want a tight retry loop burning CPU before the channel is fully authenticated.

@lefarcen
Copy link
Copy Markdown
Collaborator Author

@JiwaniZakir Good points to double-check — both are covered:

  1. Prewarm ID collision: The prewarm sentinel is __nexu_internal_wechat_prewarm__ (prefixed with __nexu_internal_), while real account IDs are cuid2-generated (e.g. a4946e575b9e-im-bot). No realistic collision risk, and the config writer now explicitly filters by the shared NEXU_INTERNAL_ACCOUNT_PREFIX constant as an extra guard.

  2. Short-poll empty response: The monitor has a RETRY_DELAY_MS = 2_000 sleep between each poll iteration regardless of response content, and initialPollsRemaining only decrements on successful (non-error) responses. So an empty first poll just waits 2s and retries — no tight loop.

Thanks for keeping an eye on these edge cases!

@JiwaniZakir
Copy link
Copy Markdown

The NEXU_INTERNAL_ACCOUNT_PREFIX extraction is the right call — relying on a hardcoded string in two separate places for collision avoidance was a latent bug waiting to happen. One thing worth double-checking: the readiness wait added to connectWechat() should have a bounded timeout rather than blocking in, otherwise a failed prewarm (e.g. network issue during initial sync) would hang the connect flow instead of surfacing an error. If that's already handled upstream in the caller, ignore this.

@lefarcen
Copy link
Copy Markdown
Collaborator Author

@JiwaniZakir Already handled — waitForWechatReady() has a 30s deadline (Date.now() + 30_000). If it times out, connectWechat() rolls back the channel and throws, so the caller gets a clear error instead of a hang.

@JiwaniZakir
Copy link
Copy Markdown

on the prewarm account leaking into syncWeixinAccountIndex() — the fix should filter it out at the merge boundary, similar to how internal sentinel values are excluded elsewhere. Adding a guard like if (accountId.startsWith('__nexu_internal_')) continue; in syncWeixinAccountIndex() before the merge would prevent it from persisting without requiring callers to be aware of the internal ID. Alternatively, if the prewarm account needs to stay out of the index entirely, the cleaner approach is to give it a dedicated config subtree that OpenClaw recognizes for hot-reload matching but never passes to the account sync path.

@lefarcen
Copy link
Copy Markdown
Collaborator Author

@chatgpt-codex-connector This is already addressed in the latest commits:

  1. syncWeixinAccountIndex() is now authoritative — it only keeps IDs present in the current config, and explicitly filters out any ID starting with NEXU_INTERNAL_ACCOUNT_PREFIX (__nexu_internal_). The prewarm ID never lands in accounts.json.

  2. The writer also runs an orphan file sweep on openclaw-weixin/accounts/, removing credential files for accounts no longer in the authoritative set — so listStoredWeixinAccountIds() can't resurrect them either.

See commits f799947 and 7f4f126.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create an environment for this repo.

@JiwaniZakir
Copy link
Copy Markdown

The bounded timeout on the readiness wait is the critical piece here — without it, a failed prewarm silently blocks connectWechat() in. Worth confirming the fallback path when the timeout fires actually surfaces an error to the caller rather than swallowing it and returning a false-ready state. Also, dropping the initial getUpdates() timeout from 35s to something shorter is a good fix, but make sure the reduced value is configurable or at least documented, since aggressive polling on reconnect could interact poorly with WeChat's rate limits in high-restart scenarios.

@lefarcen lefarcen merged commit 40605df into main Mar 30, 2026
11 checks passed
@JiwaniZakir
Copy link
Copy Markdown

The three-pronged approach makes sense — each fix is independently valid, but together they compound nicely just like the original delays did. One thing worth verifying: the reduced initial long-poll timeout (presumably dropping from 35s to something much shorter) should have a fallback or backoff strategy to avoid hammering the WeChat endpoint if the channel is genuinely idle at startup. Also worth confirming the prewarm account pattern mirrors exactly what Feishu does so the two code paths don't diverge in maintenance burden over time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] WeChat channel: extremely slow first response after connection (up to 2 minutes)

4 participants