Skip to content

[Bug]: Regression — channel sidecar startup again blocks for ~3 min after ready on v2026.4.25 (recurrence of #63450) #72846

@RayWoo

Description

@RayWoo

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

The bug fixed in #63450 / PR #63480 ("Gateway channel sidecar startup blocked by chat.history WS request, ~80–110 s delay since v2026.4.8", closed 2026-04-09) has returned in v2026.4.25 with a longer delay (~180 s instead of ~80–110 s). The original issue thread is locked to comments, so filing this as a separate regression report.

Symptom matches #63450's description nearly verbatim: between the starting channels and sidecars... log line and the moment channels (Browser control, Telegram provider, acpx runtime) actually start, the gateway sits silent for ~3 minutes. CLI WebSocket handshakes time out during the window, and any inbound Telegram message that arrives during it gets queued and replied to ~2 minutes after it lands.

Steps to reproduce

  1. Install v2026.4.25 stable on Linux Debian 11 (Node 24.13.0 via NVM, npm-global install). Eager bundled-plugin postinstall (OPENCLAW_EAGER_BUNDLED_PLUGIN_DEPS=1) completed cleanly — ~/.openclaw/plugin-runtime-deps/openclaw-2026.4.25-<hash>/ is fully populated, so this is not on-demand plugin compilation.
  2. Restart the gateway (systemctl --user restart openclaw-gateway on a user systemd setup).
  3. Watch the gateway log. The ready (N plugins: ...; XX.Xs) line lands in ~10–12 s. The next non-noise log entry is ~3 minutes later.
  4. During the ~3-minute window, run any openclaw CLI command (e.g. openclaw cron list). It hangs, and after 30 s the gateway log shows WARN handshake timeout conn=... peer=127.0.0.1:<ephemeral>.
  5. Send a Telegram message to the bot during the window. The reply is sent ~2 minutes later (specifically, telegram sendMessage ok appears in the gateway log roughly 8 minutes after the restart).

Expected behavior

Per the resolution of #63450, channel sidecars should start within ~3 s of starting channels and sidecars, not ~3 minutes. PR #63480 fixed this regression at v4.8; v4.25 has reintroduced an equivalent or worse blocker.

Actual behavior — log evidence (two independent restarts on same machine)

Restart 1 — 2026-04-27 13:09:33Z

13:09:48.391  INFO  ready (10 plugins: acpx, active-memory, bonjour, browser, device-pair,
                    memory-core, memory-wiki, phone-control, talk-voice, telegram; 10.2s)
13:09:48.521  INFO  starting channels and sidecars...
13:09:48.733  INFO  loaded 4 internal hook handlers
                    ↑ ~3 min of silence ↑
13:12:53.xxx  WARN  handshake timeout conn=... peer=127.0.0.1:56754 -> 127.0.0.1:18789
13:12:59.xxx  INFO  Browser control listening on http://127.0.0.1:18791/ (auth=token)
13:12:59.xxx  INFO  [default] starting provider (@raywu07_bot)

Restart 2 — 2026-04-27 13:52:31Z

13:52:48.055  INFO  ready (10 plugins: ...; 11.6s)
13:52:48.190  INFO  starting channels and sidecars...
13:52:48.390  INFO  loaded 4 internal hook handlers
                    ↑ ~3 min of silence ↑
13:55:44.533  WARN  handshake timeout conn=... peer=127.0.0.1:32940 -> 127.0.0.1:18789
13:55:44.553  INFO  embedded acpx runtime backend registered
13:55:44.897  INFO  Browser control listening on http://127.0.0.1:18791/ (auth=token)
13:55:45.823  INFO  [default] starting provider (@raywu07_bot)
14:00:39.221  INFO  telegram sendMessage ok chat=... message=5593
                    ↑ first reply finally sent — 8 min after restart

Diagnostic notes

  • Telegram channel itself is healthy throughout the windowgetMe, getMyCommands, getWebhookInfo all respond fast (<300 ms) over the Bot API. The bot is reachable; the gateway just hasn't started its provider client yet.
  • HTTP routes work fastGET /health returns 200 in ~5 ms during the window. Only WebSocket handshakes (/__openclaw__/ws) hang.
  • No plugin error in the ready line — all 10 plugins reported loaded successfully.
  • OPENCLAW_EAGER_BUNDLED_PLUGIN_DEPS=1 does not help — the bundled-plugin install ran at upgrade time and the staged tree was complete by the time the gateway started. Whatever the channel-startup is waiting on, it's not on-disk plugin compilation.
  • Possibly related secondary symptom: active-memory plugin's pre-reply sub-agent run elapses ~45 s for a configured timeoutMs: 15000. Documented separately on [Bug]: Stuck processing sessions are detected but never aborted — gateway requires external restart to recover #71127. If both symptoms share the same lock/queue holdup, fixing this regression should fix that secondary case too.

Suspected cause (informed guess)

The v2026.4.25 release notes touch the runtime-context / chat-history path in several places:

  • "Heartbeat, cron, and exec wakeups submitted as transient runtime context (removed from visible transcripts)"
  • "Sessions separate reset freshness from store updatedAt (heartbeat/cron/exec no longer prevent daily/idle resets)"
  • "Embedded runtime context sent as hidden next-turn custom message (not visible user prompt)"
  • "Doctor repairs 2026.4.24 transcripts with duplicated prompt-rewrite branches"

If any of these reintroduced a sync chat.history read on the channel-startup path (which was the root cause #63480 originally fixed by deferring), the symptom and timing would match exactly. Worth comparing the channel-startup hook order against the pre-#63480 codepath.

Environment

Item Value
OpenClaw version 2026.4.25 (stable)
OS Debian 11 (AWS Bitnami)
Node 24.13.0 (NVM)
Install npm install -g openclaw@2026.4.25 --ignore-scripts + restore deps + eager bundled-plugin postinstall
Gateway service systemd user-level
Plugins enabled acpx, active-memory, bonjour (dormant via OPENCLAW_DISABLE_BONJOUR=1), browser, device-pair, memory-core, memory-wiki, phone-control, talk-voice, telegram
Hardening flags set OPENCLAW_SERVICE_REPAIR_POLICY=external, OPENCLAW_DISABLE_BONJOUR=1, OPENCLAW_EAGER_BUNDLED_PLUGIN_DEPS=1

Mitigation suggestion

Two options worth considering, possibly in combination:

  1. Restore fix(gateway): start channels before WebSocket handlers (#63450) #63480's defer behavior — whatever code path is again synchronously waiting on chat.history (or an equivalent runtime-context build) during channel startup should be deferred or made async, the same way fix(gateway): start channels before WebSocket handlers (#63450) #63480 originally addressed it.
  2. Stop logging ready (...) until channels are actually up — the misleading ready log makes ops scripts (and humans) assume the gateway is usable when it isn't. Either delay the ready line by ~3 minutes (matches reality) or emit a separate available event after channels register.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggatewayGateway runtimeregressionBehavior that previously worked and now fails

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions