[2026.4.26] Gateway main thread stalls under orchestration load; /readyz timeouts and reload deferred behind active runs

## Summary

On OpenClaw `2026.4.26`, the gateway intermittently stalls under active orchestration / multi-agent workload. During stalls, the gateway PID remains alive, but local `/readyz` probes time out and the main Node/OpenClaw gateway thread is CPU-hot while the host is otherwise mostly idle.

This presents externally as Discord/app-command failures, delayed `/status` output, dropped typing indicators, and delayed agent coordination, but the captured evidence points to gateway/control-plane starvation rather than a Discord-only issue.

## Environment

- OpenClaw: `2026.4.26`
- Runtime: Node `24.x`
- OS: Linux, systemd user service
- Gateway: local loopback on `127.0.0.1:18789`
- Channel surface involved: Discord DM-only, WebUI/WS, local `/readyz`
- Plugins: `memory-lancedb-pro` present with auto-recall/smart extraction enabled
- Host had ample RAM; system CPU was mostly idle during captured stalls

Sensitive paths, tokens, hostnames, and private config values are intentionally omitted/redacted.

## Observed behavior

Captured multiple stall windows where:

- `curl --max-time 2 http://127.0.0.1:18789/readyz` timed out locally
- gateway PID stayed alive
- `top -H` showed the main OpenClaw/Node gateway thread at ~`99.9%` CPU
- overall host CPU stayed mostly idle, usually ~`97-98%` idle
- RSS was around `1.4-1.5 GB`, not host memory pressure
- other V8/libuv/tokio worker threads were mostly sleeping

Representative packet excerpts:

```text
LOOPSTALL_PACKET count=1 trigger_ts=2026-04-28T01:34:44-07:00 pid=229097 readyz=000 2.002326curlfail cpu=34.2 rss_kb=1464220 nlwp=47
Threads:  47 total,   1 running,  46 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.6 us,  0.7 sy,  0.0 ni, 97.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
 229097 argus     20   0   23.8g   1.4g 110144 R  99.9   0.6   8:55.89 opencla+
```

```text
LOOPSTALL_PACKET count=3 trigger_ts=2026-04-28T01:34:57-07:00 pid=229097 readyz=000 2.002098curlfail cpu=34.6 rss_kb=1466852 nlwp=47
Threads:  47 total,   1 running,  46 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.0 us,  0.8 sy,  0.0 ni, 97.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
 229097 argus     20   0   23.8g   1.4g 110144 R  99.9   0.6   9:08.90 opencla+
```

```text
LOOPSTALL_PACKET_WAIT count=2 trigger_ts=2026-04-28T01:37:29-07:00 pid=229097 readyz=000 2.002292curlfail cpu=38.3 rss_kb=1544108 nlwp=75
Threads:  75 total,   1 running,  74 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.3 us,  0.1 sy,  0.0 ni, 98.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
 229097 argus     20   0   23.8g   1.5g 110144 R  99.9   0.6   9:54.53 opencla+
```

During the same broader incident, reload/restart remained deferred for more than 9 minutes behind active bookkeeping:

```text
2026-04-28T01:56:41-07:00 [reload] restart still deferred after 426856ms with 3 task run(s) active
2026-04-28T01:57:29-07:00 [reload] restart still deferred after 474798ms with 2 operation(s), 1 reply(ies), 1 embedded run(s), 3 task run(s) active
2026-04-28T01:57:59-07:00 [reload] restart still deferred after 505189ms with 2 operation(s), 1 reply(ies), 1 embedded run(s), 3 task run(s) active
2026-04-28T01:58:30-07:00 [reload] restart still deferred after 535678ms with 2 operation(s), 1 reply(ies), 1 embedded run(s), 3 task run(s) active
2026-04-28T01:59:00-07:00 [reload] restart still deferred after 566110ms with 3 task run(s) active
```

Adjacent plugin/memory activity was present, but I do **not** have enough evidence to say it is the root cause. Example adjacent lines:

```text
2026-04-28T01:57:30-07:00 [plugins] memory-lancedb-pro: auto-recall query truncated from 6874 to 1000 chars
2026-04-28T01:57:32-07:00 [plugins] memory-lancedb-pro: strict governance filtered all candidates; using relaxed fallback recall (...)
2026-04-28T01:57:45-07:00 [plugins] memory-lancedb-pro: auto-recall timed out after 15000ms; skipping memory injection to avoid stalling agent startup
2026-04-28T01:58:05-07:00 [plugins] memory-lancedb-pro: injecting 2 memories into context for agent argus
2026-04-28T01:58:54-07:00 [plugins] memory-lancedb-pro: smart-extracted 1 created, 1 merged, 2 skipped for agent argus
```

User-visible effects during these windows:

- Discord `/status` returned an application failure, then stale status output was delivered much later after a newer user message
- Discord typing indicators paused/dropped
- WebUI/WS operations such as `sessions.list` and `chat.history` became delayed or bunched after stalls
- reload/config changes did not apply while restart remained deferred behind active run bookkeeping
- orchestration became unsafe because corrective messages and status checks could be delayed/stale

## Expected behavior

- `/readyz` should remain responsive, or return a degraded status, even during long agent/plugin/model work.
- Discord/app-command/control-plane traffic should not be blocked by a single CPU-hot gateway event-loop path.
- Deferred reload/restart should not remain blocked indefinitely by stale reply/embedded/task-run accounting.
- Long-running plugin hooks or recall work should be bounded/offloaded so they cannot starve gateway control paths.
- If active runs block reload, there should be an inspectable source of truth and a safe timeout/reaper path.

## Why this looks gateway/control-plane related

This is not just Discord channel flakiness:

- the local `/readyz` endpoint missed on loopback while the process was alive
- the main gateway thread was running at ~99.9% CPU during misses
- the host was mostly idle and not RAM constrained
- reload/restart bookkeeping reported active reply/embedded/task runs for many minutes
- stale `/status` output was later delivered after the user had moved on

## Related / possibly relevant issues

This looks distinct from, but likely related to, prior gateway/event-loop/resource issues:

- #56960: provider retry loop causing gateway event-loop degradation
- #43178: multi-agent load causing gateway restart/watchdog behavior
- #42026: RFC to separate control plane from agent runtime
- #63249: 99% CPU busy-wait in OpenClaw CLI subcommands
- #49726: local CLI commands failing under gateway load / handshake timeouts

The case here is specifically `2026.4.26` gateway runtime behavior: local `/readyz` stalls + main gateway thread CPU spin + deferred reload behind active run accounting.

## What would help

- Recommended way to capture a V8/Node CPU profile or event-loop delay trace from a live gateway during a stall
- Built-in gateway event-loop lag diagnostics exposed in logs or `/readyz`
- A hard timeout/reaper for stale active reply / embedded run / task-run accounting
- A way to inspect active reply/embedded/task-run blockers directly
- Guidance on whether plugin `before_prompt_build` hooks can block the gateway event loop
- A safe way to force reload/restart after a grace period without waiting indefinitely on stale task bookkeeping

## Notes

I can provide more redacted packet/log excerpts if useful. I avoided attaching full logs because the original environment contains secrets and private channel/session details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[2026.4.26] Gateway main thread stalls under orchestration load; /readyz timeouts and reload deferred behind active runs #73467

Summary

Environment

Observed behavior

Expected behavior

Why this looks gateway/control-plane related

Related / possibly relevant issues

What would help

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[2026.4.26] Gateway main thread stalls under orchestration load; /readyz timeouts and reload deferred behind active runs #73467

Description

Summary

Environment

Observed behavior

Expected behavior

Why this looks gateway/control-plane related

Related / possibly relevant issues

What would help

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions