[Bug]: 4.29 dispatch prep stages take ~73s of synchronous CPU work, blocking event loop
Bug type
Performance regression (introduced in 4.29; not present in 4.27)
Summary
Upgrading from 4.24/4.27 → 4.29 caused every agent dispatch to take 2–5 minutes to first reply. The gateway log shows new prep stages instrumentation in 4.29 that reports each dispatch spending ~73 s of synchronous CPU work before the LLM is even called, with single operations blocking the Node.js event loop for over 30 seconds.
The same 13-agent workspace setup on 4.27 returns replies in <1 minute.
A separate Python-based agent runtime (Hermes) on the same machine, using the same Z.AI/MiniMax/DeepSeek API keys and same glm-5-turbo model, returns replies in <10 seconds — confirming the bottleneck is inside the OpenClaw runtime, not the LLM provider, network, or model.
Evidence
Stage breakdown from a real 4.29 dispatch (commander, glm-5-turbo, ~5 min total)
[trace:embedded-run] startup stages totalMs=28630
workspace:1ms, runtime-plugins:3ms, hooks:0ms,
model-resolution:6794ms, auth:12471ms,
context-engine:0ms, attempt-dispatch:11612ms
[trace:embedded-run] prep stages totalMs=73394
workspace-sandbox:610ms, skills:0ms,
core-plugin-tools:8765ms, bootstrap-context:8821ms,
bundle-tools:3532ms,
system-prompt:23317ms, ← largest contributor
session-resource-loader:7546ms,
agent-session:5ms,
stream-setup:20798ms ← second-largest
[diagnostic] liveness warning:
eventLoopDelayMaxMs=34024.2 ← single 34-second event-loop block
eventLoopUtilization=1
cpuCoreRatio=1.013
prep stages totals 73 s and startup stages adds another 28 s, so each dispatch consumes ~100 seconds of CPU time before the model even starts streaming. With CPU saturated, the fallback chain then trips fetch-timeouts cascading for another 1–3 minutes.
408 [fetch-timeout] fetch timeout reached log lines were observed in a 2-hour window during typical use.
4.27 vs 4.29 instrumentation diff
grep prepStages.mark returns:
- 4.29
dist/selection-CwAy0mf2.js: 9 hits (workspace-sandbox, skills, core-plugin-tools, bootstrap-context, bundle-tools, system-prompt, session-resource-loader, agent-session, stream-setup)
- 4.27
dist/selection-*.js: 0 hits
The new prep stages instrumentation is the most visible signal that dispatch flow was substantially reworked in 4.29.
Cross-runtime baseline (same machine, same provider, same model)
| Runtime |
Reply latency |
Notes |
| Hermes (Python) |
<10 s |
Same glm-5-turbo, same Z.AI Coding Plan key |
| OpenClaw 4.27 |
<60 s |
Production agents, 13 telegram channels |
| OpenClaw 4.29 |
2–5 min |
Same workspace, same config |
Reproduction steps
- Install
openclaw@2026.4.29 with a non-trivial workspace (≥10 skills under workspace-*/skills/) and a Z.AI / MiniMax / DeepSeek primary model.
- Bind a Telegram channel to one of the agents.
- Send any short prompt (e.g.
hi).
- Observe in
journalctl --user -u openclaw-gateway:
prep stages totalMs >= 60000
eventLoopDelayMaxMs > 5000
- Reply latency 2–5 minutes
- Downgrade to
openclaw@2026.4.27 (set OPENCLAW_ALLOW_OLDER_BINARY_DESTRUCTIVE_ACTIONS=1), restart gateway, repeat step 3 — reply now <60 s.
Suspected hot paths
dist/selection-CwAy0mf2.js regions between the new prep stage marks:
system-prompt stage (23 s): buildEmbeddedSystemPrompt → buildAgentSystemPrompt (in system-prompt-DZrkA5Mv.js:282-648) does large synchronous string concat + XML escaping + conditional rendering of all skill metadata, with no per-(skills hash + workspace files hash) cache. bootstrap-cache-CmO66T4a.js only caches per-session, invalidated each dispatch.
stream-setup stage (21 s): covers selection-CwAy0mf2.js:6934-7148, including applyExtraParamsToAgent calls into provider runtime deps. (Not the new Google prompt cache path — isGooglePromptCacheEligible early-returns for non-Gemini models.)
Impact
- Telegram bots become unusable (>2 min reply means users assume the bot is broken).
- Per-dispatch CPU saturation cascades: gateway can only handle a single request at a time without queueing.
[telegram] sendChatAction failed and typing TTL reached (2m); stopping typing indicator appear consistently.
Workaround in production
Pinned to openclaw@2026.4.27 and disabled weekly-openclaw-update.timer to prevent auto-upgrade. Required:
Environment=OPENCLAW_ALLOW_OLDER_BINARY_DESTRUCTIVE_ACTIONS=1 systemd drop-in (since 4.27 refuses to start against a config last written by 4.29).
- Stripping
plugins.entries.active-memory.config (4.27 schema rejects it as additional properties).
Environment
openclaw 2026.4.29 (regression) vs 2026.4.27 (baseline working)
- Node.js v22.22.2 (managed via nvm)
- Ubuntu 25.10 (Linux 6.17.0-22-generic)
- Gateway run via user systemd unit (
systemctl --user)
- 13 agents, average workspace skills/ size ~3 MB, several
glm-5-turbo / MiniMax-M2.7 / deepseek-v4-flash models in fallback chains
Suggested fix direction
- Cache the built system prompt keyed on (skills SKILL.md hash + AGENTS.md/SOUL.md/IDENTITY.md/USER.md/MEMORY.md hashes); invalidate only when those files change. Skip
buildEmbeddedSystemPrompt on cache hit.
- Move CPU-bound prep work off the main event loop (worker thread or chunked yield).
- Reduce per-dispatch work in
stream-setup if possible (verify wrapper layers don't re-initialize per dispatch).
Happy to provide additional traces or test patches against affected files.
[Bug]: 4.29 dispatch prep stages take ~73s of synchronous CPU work, blocking event loop
Bug type
Performance regression (introduced in 4.29; not present in 4.27)
Summary
Upgrading from 4.24/4.27 → 4.29 caused every agent dispatch to take 2–5 minutes to first reply. The gateway log shows new
prep stagesinstrumentation in 4.29 that reports each dispatch spending ~73 s of synchronous CPU work before the LLM is even called, with single operations blocking the Node.js event loop for over 30 seconds.The same 13-agent workspace setup on 4.27 returns replies in <1 minute.
A separate Python-based agent runtime (Hermes) on the same machine, using the same Z.AI/MiniMax/DeepSeek API keys and same
glm-5-turbomodel, returns replies in <10 seconds — confirming the bottleneck is inside the OpenClaw runtime, not the LLM provider, network, or model.Evidence
Stage breakdown from a real 4.29 dispatch (commander, glm-5-turbo, ~5 min total)
prep stagestotals73 sandstartup stagesadds another28 s, so each dispatch consumes ~100 seconds of CPU time before the model even starts streaming. With CPU saturated, the fallback chain then trips fetch-timeouts cascading for another 1–3 minutes.408
[fetch-timeout] fetch timeout reachedlog lines were observed in a 2-hour window during typical use.4.27 vs 4.29 instrumentation diff
grep prepStages.markreturns:dist/selection-CwAy0mf2.js: 9 hits (workspace-sandbox, skills, core-plugin-tools, bootstrap-context, bundle-tools, system-prompt, session-resource-loader, agent-session, stream-setup)dist/selection-*.js: 0 hitsThe new
prep stagesinstrumentation is the most visible signal that dispatch flow was substantially reworked in 4.29.Cross-runtime baseline (same machine, same provider, same model)
glm-5-turbo, same Z.AI Coding Plan keyReproduction steps
openclaw@2026.4.29with a non-trivial workspace (≥10 skills underworkspace-*/skills/) and a Z.AI / MiniMax / DeepSeek primary model.hi).journalctl --user -u openclaw-gateway:prep stages totalMs >= 60000eventLoopDelayMaxMs > 5000openclaw@2026.4.27(setOPENCLAW_ALLOW_OLDER_BINARY_DESTRUCTIVE_ACTIONS=1), restart gateway, repeat step 3 — reply now <60 s.Suspected hot paths
dist/selection-CwAy0mf2.jsregions between the new prep stage marks:system-promptstage (23 s):buildEmbeddedSystemPrompt→buildAgentSystemPrompt(insystem-prompt-DZrkA5Mv.js:282-648) does large synchronous string concat + XML escaping + conditional rendering of all skill metadata, with no per-(skills hash + workspace files hash) cache.bootstrap-cache-CmO66T4a.jsonly caches per-session, invalidated each dispatch.stream-setupstage (21 s): coversselection-CwAy0mf2.js:6934-7148, includingapplyExtraParamsToAgentcalls into provider runtime deps. (Not the new Google prompt cache path —isGooglePromptCacheEligibleearly-returns for non-Gemini models.)Impact
[telegram] sendChatAction failedandtyping TTL reached (2m); stopping typing indicatorappear consistently.Workaround in production
Pinned to
openclaw@2026.4.27and disabledweekly-openclaw-update.timerto prevent auto-upgrade. Required:Environment=OPENCLAW_ALLOW_OLDER_BINARY_DESTRUCTIVE_ACTIONS=1systemd drop-in (since 4.27 refuses to start against a config last written by 4.29).plugins.entries.active-memory.config(4.27 schema rejects it as additional properties).Environment
openclaw2026.4.29 (regression) vs 2026.4.27 (baseline working)systemctl --user)glm-5-turbo/MiniMax-M2.7/deepseek-v4-flashmodels in fallback chainsSuggested fix direction
buildEmbeddedSystemPrompton cache hit.stream-setupif possible (verify wrapper layers don't re-initialize per dispatch).Happy to provide additional traces or test patches against affected files.