Skip to content

[Bug]: Gateway main thread CPU-bound at ~100% on v2026.4.26 / current main; clean on v2026.4.22 (fs.stat storm in microtask queue) #74328

@odrobnik

Description

@odrobnik

Bug type

Regression (worked before, now fails)

Summary

After upgrading from v2026.4.22 to v2026.4.26 (also reproduces on current main at 9bb1e59, which package.json reports as 2026.4.27), the gateway sits at ~100% CPU on its single main thread and stops responding to local probes. Same host, same ~/.openclaw, no config changes — only git checkout differs.

I've seen #74209 and the regression range overlaps, but on my machine the dominant signal in a CPU sample is a fs.stat storm in the JS microtask queue rather than bonjour. Filing separately in case the maintainer wants to triage as the same root cause or as a sibling regression.

Versions

  • macOS 26.4 (Mac Studio, ARM64), Node 22 via Homebrew
  • Bad: v2026.4.26 (be8c246), and current main at 9bb1e59 (reports as 2026.4.27)
  • Good: v2026.4.22 (00bd2cf)

Steps to reproduce

Same ~/.openclaw, just switch versions:

git checkout v2026.4.26 && pnpm install && pnpm build
openclaw gateway restart
sleep 30
ps -o stat,%cpu,etime $(pgrep -f 'dist/index.js gateway')
# → R  95-100  sustained

git checkout v2026.4.22 && pnpm install && pnpm build
openclaw gateway restart
sleep 30
ps -o stat,%cpu,etime $(pgrep -f 'dist/index.js gateway')
# → S  2.6

Side-by-side, same host

4.26 / main 4.22
ps STAT %CPU R 95-100 S 2.6
eventLoopDelayMaxMs (liveness warning) up to 314 866 ms none reported
eventLoopUtilization 0.95–1.00 <0.10
curl -m 3 http://127.0.0.1:18789/ 3 s timeout 3-9 ms
Discord WS lifetime closes 1000/zombie every 60-90 s stable
openclaw gateway status "Connectivity probe: failed (timeout)" while runtime is "active" OK, admin-capable

A representative liveness warning on main:

[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization
  interval=320s eventLoopDelayP99Ms=314874.8 eventLoopDelayMaxMs=314874.8
  eventLoopUtilization=0.999 cpuCoreRatio=0.585 active=0 waiting=0 queued=0

i.e. the loop blocked for over 5 minutes with nothing in the active queue.

CPU sample on main (5 s, idle, no incoming traffic)

/usr/bin/sample <gateway-pid> 5 — all 4074 stacks collapse to this single path:

4074 uv__io_poll
+ 4074 uv__async_io
+   4074 uv__work_done
+     4074 MakeLibuvRequestCallback<uv_fs_s>::Wrapper
+       4074 node::fs::AfterStat
+         4074 MicrotaskQueue::PerformCheckpointInternal
+           4074 MicrotaskQueue::RunMicrotasks
+             4074 Builtins_PromiseFulfillReactionJob
+               4074 AsyncFunctionAwaitResolveClosure
+                 4074 <JIT JS, unsymbolicated>

Every sample is in node::fs::AfterStat. The main thread is consumed resolving promises from a flood of fs.stat calls. The same sample on 4.22 with the same config and data spends almost the whole 5 s in kqueue waiting.

Effect on real usage

Channel messages arrive (Discord WS frames are received), the session enters state=processing and ages indefinitely without a reply — I saw agent:main:discord:direct:oliver reach age=1376s queueDepth=1 before the gateway watchdog killed and restarted the process. No subprocess is involved (no docker, no acpx wrapper, no model API call) — the wedge is purely in-JS work on the main thread.

Workaround

Pin to v2026.4.22. Disabling the OpenClaw Auto-Update cron is recommended so the upgrade doesn't reapply.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions