Skip to content

[2026.4.26] Gateway main thread stalls under orchestration load; /readyz timeouts and reload deferred behind active runs #73467

@cygnostik

Description

@cygnostik

Summary

On OpenClaw 2026.4.26, the gateway intermittently stalls under active orchestration / multi-agent workload. During stalls, the gateway PID remains alive, but local /readyz probes time out and the main Node/OpenClaw gateway thread is CPU-hot while the host is otherwise mostly idle.

This presents externally as Discord/app-command failures, delayed /status output, dropped typing indicators, and delayed agent coordination, but the captured evidence points to gateway/control-plane starvation rather than a Discord-only issue.

Environment

  • OpenClaw: 2026.4.26
  • Runtime: Node 24.x
  • OS: Linux, systemd user service
  • Gateway: local loopback on 127.0.0.1:18789
  • Channel surface involved: Discord DM-only, WebUI/WS, local /readyz
  • Plugins: memory-lancedb-pro present with auto-recall/smart extraction enabled
  • Host had ample RAM; system CPU was mostly idle during captured stalls

Sensitive paths, tokens, hostnames, and private config values are intentionally omitted/redacted.

Observed behavior

Captured multiple stall windows where:

  • curl --max-time 2 http://127.0.0.1:18789/readyz timed out locally
  • gateway PID stayed alive
  • top -H showed the main OpenClaw/Node gateway thread at ~99.9% CPU
  • overall host CPU stayed mostly idle, usually ~97-98% idle
  • RSS was around 1.4-1.5 GB, not host memory pressure
  • other V8/libuv/tokio worker threads were mostly sleeping

Representative packet excerpts:

LOOPSTALL_PACKET count=1 trigger_ts=2026-04-28T01:34:44-07:00 pid=229097 readyz=000 2.002326curlfail cpu=34.2 rss_kb=1464220 nlwp=47
Threads:  47 total,   1 running,  46 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.6 us,  0.7 sy,  0.0 ni, 97.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
 229097 argus     20   0   23.8g   1.4g 110144 R  99.9   0.6   8:55.89 opencla+
LOOPSTALL_PACKET count=3 trigger_ts=2026-04-28T01:34:57-07:00 pid=229097 readyz=000 2.002098curlfail cpu=34.6 rss_kb=1466852 nlwp=47
Threads:  47 total,   1 running,  46 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.0 us,  0.8 sy,  0.0 ni, 97.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
 229097 argus     20   0   23.8g   1.4g 110144 R  99.9   0.6   9:08.90 opencla+
LOOPSTALL_PACKET_WAIT count=2 trigger_ts=2026-04-28T01:37:29-07:00 pid=229097 readyz=000 2.002292curlfail cpu=38.3 rss_kb=1544108 nlwp=75
Threads:  75 total,   1 running,  74 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.3 us,  0.1 sy,  0.0 ni, 98.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
 229097 argus     20   0   23.8g   1.5g 110144 R  99.9   0.6   9:54.53 opencla+

During the same broader incident, reload/restart remained deferred for more than 9 minutes behind active bookkeeping:

2026-04-28T01:56:41-07:00 [reload] restart still deferred after 426856ms with 3 task run(s) active
2026-04-28T01:57:29-07:00 [reload] restart still deferred after 474798ms with 2 operation(s), 1 reply(ies), 1 embedded run(s), 3 task run(s) active
2026-04-28T01:57:59-07:00 [reload] restart still deferred after 505189ms with 2 operation(s), 1 reply(ies), 1 embedded run(s), 3 task run(s) active
2026-04-28T01:58:30-07:00 [reload] restart still deferred after 535678ms with 2 operation(s), 1 reply(ies), 1 embedded run(s), 3 task run(s) active
2026-04-28T01:59:00-07:00 [reload] restart still deferred after 566110ms with 3 task run(s) active

Adjacent plugin/memory activity was present, but I do not have enough evidence to say it is the root cause. Example adjacent lines:

2026-04-28T01:57:30-07:00 [plugins] memory-lancedb-pro: auto-recall query truncated from 6874 to 1000 chars
2026-04-28T01:57:32-07:00 [plugins] memory-lancedb-pro: strict governance filtered all candidates; using relaxed fallback recall (...)
2026-04-28T01:57:45-07:00 [plugins] memory-lancedb-pro: auto-recall timed out after 15000ms; skipping memory injection to avoid stalling agent startup
2026-04-28T01:58:05-07:00 [plugins] memory-lancedb-pro: injecting 2 memories into context for agent argus
2026-04-28T01:58:54-07:00 [plugins] memory-lancedb-pro: smart-extracted 1 created, 1 merged, 2 skipped for agent argus

User-visible effects during these windows:

  • Discord /status returned an application failure, then stale status output was delivered much later after a newer user message
  • Discord typing indicators paused/dropped
  • WebUI/WS operations such as sessions.list and chat.history became delayed or bunched after stalls
  • reload/config changes did not apply while restart remained deferred behind active run bookkeeping
  • orchestration became unsafe because corrective messages and status checks could be delayed/stale

Expected behavior

  • /readyz should remain responsive, or return a degraded status, even during long agent/plugin/model work.
  • Discord/app-command/control-plane traffic should not be blocked by a single CPU-hot gateway event-loop path.
  • Deferred reload/restart should not remain blocked indefinitely by stale reply/embedded/task-run accounting.
  • Long-running plugin hooks or recall work should be bounded/offloaded so they cannot starve gateway control paths.
  • If active runs block reload, there should be an inspectable source of truth and a safe timeout/reaper path.

Why this looks gateway/control-plane related

This is not just Discord channel flakiness:

  • the local /readyz endpoint missed on loopback while the process was alive
  • the main gateway thread was running at ~99.9% CPU during misses
  • the host was mostly idle and not RAM constrained
  • reload/restart bookkeeping reported active reply/embedded/task runs for many minutes
  • stale /status output was later delivered after the user had moved on

Related / possibly relevant issues

This looks distinct from, but likely related to, prior gateway/event-loop/resource issues:

The case here is specifically 2026.4.26 gateway runtime behavior: local /readyz stalls + main gateway thread CPU spin + deferred reload behind active run accounting.

What would help

  • Recommended way to capture a V8/Node CPU profile or event-loop delay trace from a live gateway during a stall
  • Built-in gateway event-loop lag diagnostics exposed in logs or /readyz
  • A hard timeout/reaper for stale active reply / embedded run / task-run accounting
  • A way to inspect active reply/embedded/task-run blockers directly
  • Guidance on whether plugin before_prompt_build hooks can block the gateway event loop
  • A safe way to force reload/restart after a grace period without waiting indefinitely on stale task bookkeeping

Notes

I can provide more redacted packet/log excerpts if useful. I avoided attaching full logs because the original environment contains secrets and private channel/session details.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions