Summary
On OpenClaw 2026.4.26, the gateway intermittently stalls under active orchestration / multi-agent workload. During stalls, the gateway PID remains alive, but local /readyz probes time out and the main Node/OpenClaw gateway thread is CPU-hot while the host is otherwise mostly idle.
This presents externally as Discord/app-command failures, delayed /status output, dropped typing indicators, and delayed agent coordination, but the captured evidence points to gateway/control-plane starvation rather than a Discord-only issue.
Environment
- OpenClaw:
2026.4.26
- Runtime: Node
24.x
- OS: Linux, systemd user service
- Gateway: local loopback on
127.0.0.1:18789
- Channel surface involved: Discord DM-only, WebUI/WS, local
/readyz
- Plugins:
memory-lancedb-pro present with auto-recall/smart extraction enabled
- Host had ample RAM; system CPU was mostly idle during captured stalls
Sensitive paths, tokens, hostnames, and private config values are intentionally omitted/redacted.
Observed behavior
Captured multiple stall windows where:
curl --max-time 2 http://127.0.0.1:18789/readyz timed out locally
- gateway PID stayed alive
top -H showed the main OpenClaw/Node gateway thread at ~99.9% CPU
- overall host CPU stayed mostly idle, usually ~
97-98% idle
- RSS was around
1.4-1.5 GB, not host memory pressure
- other V8/libuv/tokio worker threads were mostly sleeping
Representative packet excerpts:
LOOPSTALL_PACKET count=1 trigger_ts=2026-04-28T01:34:44-07:00 pid=229097 readyz=000 2.002326curlfail cpu=34.2 rss_kb=1464220 nlwp=47
Threads: 47 total, 1 running, 46 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.6 us, 0.7 sy, 0.0 ni, 97.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
229097 argus 20 0 23.8g 1.4g 110144 R 99.9 0.6 8:55.89 opencla+
LOOPSTALL_PACKET count=3 trigger_ts=2026-04-28T01:34:57-07:00 pid=229097 readyz=000 2.002098curlfail cpu=34.6 rss_kb=1466852 nlwp=47
Threads: 47 total, 1 running, 46 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.0 us, 0.8 sy, 0.0 ni, 97.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
229097 argus 20 0 23.8g 1.4g 110144 R 99.9 0.6 9:08.90 opencla+
LOOPSTALL_PACKET_WAIT count=2 trigger_ts=2026-04-28T01:37:29-07:00 pid=229097 readyz=000 2.002292curlfail cpu=38.3 rss_kb=1544108 nlwp=75
Threads: 75 total, 1 running, 74 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.3 us, 0.1 sy, 0.0 ni, 98.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
229097 argus 20 0 23.8g 1.5g 110144 R 99.9 0.6 9:54.53 opencla+
During the same broader incident, reload/restart remained deferred for more than 9 minutes behind active bookkeeping:
2026-04-28T01:56:41-07:00 [reload] restart still deferred after 426856ms with 3 task run(s) active
2026-04-28T01:57:29-07:00 [reload] restart still deferred after 474798ms with 2 operation(s), 1 reply(ies), 1 embedded run(s), 3 task run(s) active
2026-04-28T01:57:59-07:00 [reload] restart still deferred after 505189ms with 2 operation(s), 1 reply(ies), 1 embedded run(s), 3 task run(s) active
2026-04-28T01:58:30-07:00 [reload] restart still deferred after 535678ms with 2 operation(s), 1 reply(ies), 1 embedded run(s), 3 task run(s) active
2026-04-28T01:59:00-07:00 [reload] restart still deferred after 566110ms with 3 task run(s) active
Adjacent plugin/memory activity was present, but I do not have enough evidence to say it is the root cause. Example adjacent lines:
2026-04-28T01:57:30-07:00 [plugins] memory-lancedb-pro: auto-recall query truncated from 6874 to 1000 chars
2026-04-28T01:57:32-07:00 [plugins] memory-lancedb-pro: strict governance filtered all candidates; using relaxed fallback recall (...)
2026-04-28T01:57:45-07:00 [plugins] memory-lancedb-pro: auto-recall timed out after 15000ms; skipping memory injection to avoid stalling agent startup
2026-04-28T01:58:05-07:00 [plugins] memory-lancedb-pro: injecting 2 memories into context for agent argus
2026-04-28T01:58:54-07:00 [plugins] memory-lancedb-pro: smart-extracted 1 created, 1 merged, 2 skipped for agent argus
User-visible effects during these windows:
- Discord
/status returned an application failure, then stale status output was delivered much later after a newer user message
- Discord typing indicators paused/dropped
- WebUI/WS operations such as
sessions.list and chat.history became delayed or bunched after stalls
- reload/config changes did not apply while restart remained deferred behind active run bookkeeping
- orchestration became unsafe because corrective messages and status checks could be delayed/stale
Expected behavior
/readyz should remain responsive, or return a degraded status, even during long agent/plugin/model work.
- Discord/app-command/control-plane traffic should not be blocked by a single CPU-hot gateway event-loop path.
- Deferred reload/restart should not remain blocked indefinitely by stale reply/embedded/task-run accounting.
- Long-running plugin hooks or recall work should be bounded/offloaded so they cannot starve gateway control paths.
- If active runs block reload, there should be an inspectable source of truth and a safe timeout/reaper path.
Why this looks gateway/control-plane related
This is not just Discord channel flakiness:
- the local
/readyz endpoint missed on loopback while the process was alive
- the main gateway thread was running at ~99.9% CPU during misses
- the host was mostly idle and not RAM constrained
- reload/restart bookkeeping reported active reply/embedded/task runs for many minutes
- stale
/status output was later delivered after the user had moved on
Related / possibly relevant issues
This looks distinct from, but likely related to, prior gateway/event-loop/resource issues:
The case here is specifically 2026.4.26 gateway runtime behavior: local /readyz stalls + main gateway thread CPU spin + deferred reload behind active run accounting.
What would help
- Recommended way to capture a V8/Node CPU profile or event-loop delay trace from a live gateway during a stall
- Built-in gateway event-loop lag diagnostics exposed in logs or
/readyz
- A hard timeout/reaper for stale active reply / embedded run / task-run accounting
- A way to inspect active reply/embedded/task-run blockers directly
- Guidance on whether plugin
before_prompt_build hooks can block the gateway event loop
- A safe way to force reload/restart after a grace period without waiting indefinitely on stale task bookkeeping
Notes
I can provide more redacted packet/log excerpts if useful. I avoided attaching full logs because the original environment contains secrets and private channel/session details.
Summary
On OpenClaw
2026.4.26, the gateway intermittently stalls under active orchestration / multi-agent workload. During stalls, the gateway PID remains alive, but local/readyzprobes time out and the main Node/OpenClaw gateway thread is CPU-hot while the host is otherwise mostly idle.This presents externally as Discord/app-command failures, delayed
/statusoutput, dropped typing indicators, and delayed agent coordination, but the captured evidence points to gateway/control-plane starvation rather than a Discord-only issue.Environment
2026.4.2624.x127.0.0.1:18789/readyzmemory-lancedb-propresent with auto-recall/smart extraction enabledSensitive paths, tokens, hostnames, and private config values are intentionally omitted/redacted.
Observed behavior
Captured multiple stall windows where:
curl --max-time 2 http://127.0.0.1:18789/readyztimed out locallytop -Hshowed the main OpenClaw/Node gateway thread at ~99.9%CPU97-98%idle1.4-1.5 GB, not host memory pressureRepresentative packet excerpts:
During the same broader incident, reload/restart remained deferred for more than 9 minutes behind active bookkeeping:
Adjacent plugin/memory activity was present, but I do not have enough evidence to say it is the root cause. Example adjacent lines:
User-visible effects during these windows:
/statusreturned an application failure, then stale status output was delivered much later after a newer user messagesessions.listandchat.historybecame delayed or bunched after stallsExpected behavior
/readyzshould remain responsive, or return a degraded status, even during long agent/plugin/model work.Why this looks gateway/control-plane related
This is not just Discord channel flakiness:
/readyzendpoint missed on loopback while the process was alive/statusoutput was later delivered after the user had moved onRelated / possibly relevant issues
This looks distinct from, but likely related to, prior gateway/event-loop/resource issues:
The case here is specifically
2026.4.26gateway runtime behavior: local/readyzstalls + main gateway thread CPU spin + deferred reload behind active run accounting.What would help
/readyzbefore_prompt_buildhooks can block the gateway event loopNotes
I can provide more redacted packet/log excerpts if useful. I avoided attaching full logs because the original environment contains secrets and private channel/session details.