Summary
We hit a production-impacting regression after upgrading OpenClaw beyond 2026.4.23. In our environment, every tested version after 2026.4.23 showed instability or severe degradation in real use, including 2026.4.24, 2026.4.25, 2026.4.26, 2026.4.27, beta attempts, and 2026.4.29. The system was usable again only after rolling back to 2026.4.23.
The visible symptom is that the Gateway / Control UI appears to hang while loading session-related surfaces. Locally, the strongest evidence came from sessions.list, models.list, node.list, event-loop diagnostics, and CPU saturation.
This looks related to existing reports about event-loop saturation, slow/unbounded sessions.list, Control UI polling, stuck sessions, and runtime-deps issues around 2026.4.26 / 2026.4.27 / current builds.
Environment
- Host OS: Linux
6.8.0-110-generic x64
- Node:
v24.14.1
- Gateway: systemd user service
- Gateway bind: loopback
127.0.0.1:18789
- Current stable rollback version:
OpenClaw 2026.4.23 (a979721)
- Affected versions observed across the incident sequence: every tested version after
2026.4.23, including 2026.4.24, 2026.4.25, 2026.4.26, 2026.4.27, beta attempts, and 2026.4.29
- Channels/plugins in use include Telegram, Control UI/webchat, ACP/Codex-related tooling, browser/device-pair/talk-voice, etc.
What happened
We had 2026.4.27 apparently working after addressing runtime dependency issues around memory-core, chokidar, and sqlite-vec. After a config/model correction and a gateway/server restart, the instance became heavily degraded. We then tried newer builds including 2026.4.29, but the symptoms remained. Rolling back to 2026.4.23 restored practical stability.
I am not claiming the config/model correction is the root cause; it may simply have been the restart that exposed the regression. The observed failure pattern points more strongly to gateway event-loop/session-list/provider-loading behavior.
Local evidence from the last affected attempt (2026.4.29)
From journalctl --user -u openclaw-gateway.service around 2026-04-30 19:07-19:10 ART:
19:07:55 [ws] res sessions.list 38151ms
19:07:55 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=64s eventLoopDelayP99Ms=24679.3 eventLoopDelayMaxMs=24679.3 eventLoopUtilization=1 cpuCoreRatio=1.085 active=0 waiting=0 queued=0
19:08:16 [ws] res models.list 59604ms
19:08:16 [ws] handshake timeout
19:08:33 gateway SIGUSR1 restart
19:09:21 [ws] res sessions.list 13437ms
19:09:21 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=39s eventLoopDelayP99Ms=14327.7 eventLoopDelayMaxMs=14327.7 eventLoopUtilization=0.993 cpuCoreRatio=1.035 active=0 waiting=0 queued=0
19:10:02 Stopping openclaw-gateway.service - OpenClaw Gateway (v2026.4.29)
19:10:05 Stopped openclaw-gateway.service - OpenClaw Gateway (v2026.4.29), CPU 2min44s, memory peak 1.6G
After rollback to 2026.4.23:
19:15:34 Started openclaw-gateway.service - OpenClaw Gateway (v2026.4.23)
19:16:46 gateway ready (6 plugins; 66.8s)
19:19:12 sessions.list 518ms
19:19:32 sessions.list 454ms, chat.history 521ms, models.list 316ms
19:23:52 sessions.list 694ms, chat.history 1103ms, health 1131ms
19:30-19:33 sessions.list roughly 330-690ms, chat.history roughly 50-500ms
Current verification after rollback:
OpenClaw 2026.4.23 (a979721)
Gateway probe: ok
Gateway status: systemd active, connectivity probe ok
Local state that may amplify the bug
This instance has a relatively large session/transcript footprint. Current session directory sizes:
2.5G /root/.openclaw/agents/scout-localidades/sessions
379M /root/.openclaw/agents/main/sessions
349M /root/.openclaw/agents/scout-artistas/sessions
312M /root/.openclaw/agents/bruno/sessions
266M /root/.openclaw/agents/validator/sessions
151M /root/.openclaw/agents/research/sessions
150M /root/.openclaw/agents/frankie/sessions
There is also a previously archived checkpoint bundle outside the hot sessions path:
4.5G /root/.openclaw/archive/session-checkpoints-2026-04-27-incident
This likely amplifies sessions.list / transcript scanning / Control UI behavior, but it does not seem to be the sole cause: the same local state is usable again on 2026.4.23.
Related issues that look relevant
These existing issues seem strongly related:
Expected behavior
A stable release newer than 2026.4.23 should not saturate the gateway event loop or make Control UI/session/model surfaces take 10-60 seconds on the same local state that remains usable on 2026.4.23.
Actual behavior
On tested versions after 2026.4.23, especially 2026.4.27/2026.4.29, the gateway becomes heavily degraded:
sessions.list: 13-38s
models.list: ~59s
node.list/device.pair.list: can also take ~14s+
- WebSocket handshake timeouts
- event-loop delay warnings with p99/max >14-24s
- eventLoopUtilization ~1
- CPU around one saturated core
Rollback to 2026.4.23 restores usable behavior.
Question
Is this expected to be covered by the fixes for the issues above, or should this be tracked as a separate regression? I can provide more logs/details if useful, but I wanted to report the concrete version-to-version behavior and timings from a real production-like OpenClaw state.
Summary
We hit a production-impacting regression after upgrading OpenClaw beyond
2026.4.23. In our environment, every tested version after2026.4.23showed instability or severe degradation in real use, including2026.4.24,2026.4.25,2026.4.26,2026.4.27, beta attempts, and2026.4.29. The system was usable again only after rolling back to2026.4.23.The visible symptom is that the Gateway / Control UI appears to hang while loading session-related surfaces. Locally, the strongest evidence came from
sessions.list,models.list,node.list, event-loop diagnostics, and CPU saturation.This looks related to existing reports about event-loop saturation, slow/unbounded
sessions.list, Control UI polling, stuck sessions, and runtime-deps issues around2026.4.26/2026.4.27/ current builds.Environment
6.8.0-110-genericx64v24.14.1127.0.0.1:18789OpenClaw 2026.4.23 (a979721)2026.4.23, including2026.4.24,2026.4.25,2026.4.26,2026.4.27, beta attempts, and2026.4.29What happened
We had
2026.4.27apparently working after addressing runtime dependency issues aroundmemory-core,chokidar, andsqlite-vec. After a config/model correction and a gateway/server restart, the instance became heavily degraded. We then tried newer builds including2026.4.29, but the symptoms remained. Rolling back to2026.4.23restored practical stability.I am not claiming the config/model correction is the root cause; it may simply have been the restart that exposed the regression. The observed failure pattern points more strongly to gateway event-loop/session-list/provider-loading behavior.
Local evidence from the last affected attempt (
2026.4.29)From
journalctl --user -u openclaw-gateway.servicearound 2026-04-30 19:07-19:10 ART:After rollback to
2026.4.23:Current verification after rollback:
Local state that may amplify the bug
This instance has a relatively large session/transcript footprint. Current session directory sizes:
There is also a previously archived checkpoint bundle outside the hot sessions path:
This likely amplifies
sessions.list/ transcript scanning / Control UI behavior, but it does not seem to be the sole cause: the same local state is usable again on2026.4.23.Related issues that look relevant
These existing issues seem strongly related:
sessions.listreturns quickly #64004 — Control UI remains slow althoughsessions.listreturns quicklysessions.listslow: N+1 transcript fallback + full row build before limitsessions.list2026.4.27runtime-deps issues aroundsqlite-vec,chokidar, and auto-loadedmemory-coreExpected behavior
A stable release newer than
2026.4.23should not saturate the gateway event loop or make Control UI/session/model surfaces take 10-60 seconds on the same local state that remains usable on2026.4.23.Actual behavior
On tested versions after
2026.4.23, especially2026.4.27/2026.4.29, the gateway becomes heavily degraded:sessions.list: 13-38smodels.list: ~59snode.list/device.pair.list: can also take ~14s+Rollback to
2026.4.23restores usable behavior.Question
Is this expected to be covered by the fixes for the issues above, or should this be tracked as a separate regression? I can provide more logs/details if useful, but I wanted to report the concrete version-to-version behavior and timings from a real production-like OpenClaw state.