Skip to content

[Bug]: 4.29 dispatch prep stages take ~73s of synchronous CPU work, blocking event loop #75999

@zackchiutw

Description

@zackchiutw

[Bug]: 4.29 dispatch prep stages take ~73s of synchronous CPU work, blocking event loop

Bug type

Performance regression (introduced in 4.29; not present in 4.27)

Summary

Upgrading from 4.24/4.27 → 4.29 caused every agent dispatch to take 2–5 minutes to first reply. The gateway log shows new prep stages instrumentation in 4.29 that reports each dispatch spending ~73 s of synchronous CPU work before the LLM is even called, with single operations blocking the Node.js event loop for over 30 seconds.

The same 13-agent workspace setup on 4.27 returns replies in <1 minute.

A separate Python-based agent runtime (Hermes) on the same machine, using the same Z.AI/MiniMax/DeepSeek API keys and same glm-5-turbo model, returns replies in <10 seconds — confirming the bottleneck is inside the OpenClaw runtime, not the LLM provider, network, or model.

Evidence

Stage breakdown from a real 4.29 dispatch (commander, glm-5-turbo, ~5 min total)

[trace:embedded-run] startup stages totalMs=28630
  workspace:1ms, runtime-plugins:3ms, hooks:0ms,
  model-resolution:6794ms, auth:12471ms,
  context-engine:0ms, attempt-dispatch:11612ms

[trace:embedded-run] prep stages totalMs=73394
  workspace-sandbox:610ms, skills:0ms,
  core-plugin-tools:8765ms, bootstrap-context:8821ms,
  bundle-tools:3532ms,
  system-prompt:23317ms,            ← largest contributor
  session-resource-loader:7546ms,
  agent-session:5ms,
  stream-setup:20798ms              ← second-largest

[diagnostic] liveness warning:
  eventLoopDelayMaxMs=34024.2 ← single 34-second event-loop block
  eventLoopUtilization=1
  cpuCoreRatio=1.013

prep stages totals 73 s and startup stages adds another 28 s, so each dispatch consumes ~100 seconds of CPU time before the model even starts streaming. With CPU saturated, the fallback chain then trips fetch-timeouts cascading for another 1–3 minutes.

408 [fetch-timeout] fetch timeout reached log lines were observed in a 2-hour window during typical use.

4.27 vs 4.29 instrumentation diff

grep prepStages.mark returns:

  • 4.29 dist/selection-CwAy0mf2.js: 9 hits (workspace-sandbox, skills, core-plugin-tools, bootstrap-context, bundle-tools, system-prompt, session-resource-loader, agent-session, stream-setup)
  • 4.27 dist/selection-*.js: 0 hits

The new prep stages instrumentation is the most visible signal that dispatch flow was substantially reworked in 4.29.

Cross-runtime baseline (same machine, same provider, same model)

Runtime Reply latency Notes
Hermes (Python) <10 s Same glm-5-turbo, same Z.AI Coding Plan key
OpenClaw 4.27 <60 s Production agents, 13 telegram channels
OpenClaw 4.29 2–5 min Same workspace, same config

Reproduction steps

  1. Install openclaw@2026.4.29 with a non-trivial workspace (≥10 skills under workspace-*/skills/) and a Z.AI / MiniMax / DeepSeek primary model.
  2. Bind a Telegram channel to one of the agents.
  3. Send any short prompt (e.g. hi).
  4. Observe in journalctl --user -u openclaw-gateway:
    • prep stages totalMs >= 60000
    • eventLoopDelayMaxMs > 5000
    • Reply latency 2–5 minutes
  5. Downgrade to openclaw@2026.4.27 (set OPENCLAW_ALLOW_OLDER_BINARY_DESTRUCTIVE_ACTIONS=1), restart gateway, repeat step 3 — reply now <60 s.

Suspected hot paths

dist/selection-CwAy0mf2.js regions between the new prep stage marks:

  • system-prompt stage (23 s): buildEmbeddedSystemPromptbuildAgentSystemPrompt (in system-prompt-DZrkA5Mv.js:282-648) does large synchronous string concat + XML escaping + conditional rendering of all skill metadata, with no per-(skills hash + workspace files hash) cache. bootstrap-cache-CmO66T4a.js only caches per-session, invalidated each dispatch.
  • stream-setup stage (21 s): covers selection-CwAy0mf2.js:6934-7148, including applyExtraParamsToAgent calls into provider runtime deps. (Not the new Google prompt cache path — isGooglePromptCacheEligible early-returns for non-Gemini models.)

Impact

  • Telegram bots become unusable (>2 min reply means users assume the bot is broken).
  • Per-dispatch CPU saturation cascades: gateway can only handle a single request at a time without queueing.
  • [telegram] sendChatAction failed and typing TTL reached (2m); stopping typing indicator appear consistently.

Workaround in production

Pinned to openclaw@2026.4.27 and disabled weekly-openclaw-update.timer to prevent auto-upgrade. Required:

  • Environment=OPENCLAW_ALLOW_OLDER_BINARY_DESTRUCTIVE_ACTIONS=1 systemd drop-in (since 4.27 refuses to start against a config last written by 4.29).
  • Stripping plugins.entries.active-memory.config (4.27 schema rejects it as additional properties).

Environment

  • openclaw 2026.4.29 (regression) vs 2026.4.27 (baseline working)
  • Node.js v22.22.2 (managed via nvm)
  • Ubuntu 25.10 (Linux 6.17.0-22-generic)
  • Gateway run via user systemd unit (systemctl --user)
  • 13 agents, average workspace skills/ size ~3 MB, several glm-5-turbo / MiniMax-M2.7 / deepseek-v4-flash models in fallback chains

Suggested fix direction

  1. Cache the built system prompt keyed on (skills SKILL.md hash + AGENTS.md/SOUL.md/IDENTITY.md/USER.md/MEMORY.md hashes); invalidate only when those files change. Skip buildEmbeddedSystemPrompt on cache hit.
  2. Move CPU-bound prep work off the main event loop (worker thread or chunked yield).
  3. Reduce per-dispatch work in stream-setup if possible (verify wrapper layers don't re-initialize per dispatch).

Happy to provide additional traces or test patches against affected files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions