Skip to content

Gateway event-loop saturation and very slow sessions.list/models.list on all tested versions after 2026.4.23; rollback restores stability #75297

@lisandromachado

Description

@lisandromachado

Summary

We hit a production-impacting regression after upgrading OpenClaw beyond 2026.4.23. In our environment, every tested version after 2026.4.23 showed instability or severe degradation in real use, including 2026.4.24, 2026.4.25, 2026.4.26, 2026.4.27, beta attempts, and 2026.4.29. The system was usable again only after rolling back to 2026.4.23.

The visible symptom is that the Gateway / Control UI appears to hang while loading session-related surfaces. Locally, the strongest evidence came from sessions.list, models.list, node.list, event-loop diagnostics, and CPU saturation.

This looks related to existing reports about event-loop saturation, slow/unbounded sessions.list, Control UI polling, stuck sessions, and runtime-deps issues around 2026.4.26 / 2026.4.27 / current builds.

Environment

  • Host OS: Linux 6.8.0-110-generic x64
  • Node: v24.14.1
  • Gateway: systemd user service
  • Gateway bind: loopback 127.0.0.1:18789
  • Current stable rollback version: OpenClaw 2026.4.23 (a979721)
  • Affected versions observed across the incident sequence: every tested version after 2026.4.23, including 2026.4.24, 2026.4.25, 2026.4.26, 2026.4.27, beta attempts, and 2026.4.29
  • Channels/plugins in use include Telegram, Control UI/webchat, ACP/Codex-related tooling, browser/device-pair/talk-voice, etc.

What happened

We had 2026.4.27 apparently working after addressing runtime dependency issues around memory-core, chokidar, and sqlite-vec. After a config/model correction and a gateway/server restart, the instance became heavily degraded. We then tried newer builds including 2026.4.29, but the symptoms remained. Rolling back to 2026.4.23 restored practical stability.

I am not claiming the config/model correction is the root cause; it may simply have been the restart that exposed the regression. The observed failure pattern points more strongly to gateway event-loop/session-list/provider-loading behavior.

Local evidence from the last affected attempt (2026.4.29)

From journalctl --user -u openclaw-gateway.service around 2026-04-30 19:07-19:10 ART:

19:07:55 [ws] res sessions.list 38151ms
19:07:55 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=64s eventLoopDelayP99Ms=24679.3 eventLoopDelayMaxMs=24679.3 eventLoopUtilization=1 cpuCoreRatio=1.085 active=0 waiting=0 queued=0
19:08:16 [ws] res models.list 59604ms
19:08:16 [ws] handshake timeout
19:08:33 gateway SIGUSR1 restart
19:09:21 [ws] res sessions.list 13437ms
19:09:21 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=39s eventLoopDelayP99Ms=14327.7 eventLoopDelayMaxMs=14327.7 eventLoopUtilization=0.993 cpuCoreRatio=1.035 active=0 waiting=0 queued=0
19:10:02 Stopping openclaw-gateway.service - OpenClaw Gateway (v2026.4.29)
19:10:05 Stopped openclaw-gateway.service - OpenClaw Gateway (v2026.4.29), CPU 2min44s, memory peak 1.6G

After rollback to 2026.4.23:

19:15:34 Started openclaw-gateway.service - OpenClaw Gateway (v2026.4.23)
19:16:46 gateway ready (6 plugins; 66.8s)
19:19:12 sessions.list 518ms
19:19:32 sessions.list 454ms, chat.history 521ms, models.list 316ms
19:23:52 sessions.list 694ms, chat.history 1103ms, health 1131ms
19:30-19:33 sessions.list roughly 330-690ms, chat.history roughly 50-500ms

Current verification after rollback:

OpenClaw 2026.4.23 (a979721)
Gateway probe: ok
Gateway status: systemd active, connectivity probe ok

Local state that may amplify the bug

This instance has a relatively large session/transcript footprint. Current session directory sizes:

2.5G  /root/.openclaw/agents/scout-localidades/sessions
379M  /root/.openclaw/agents/main/sessions
349M  /root/.openclaw/agents/scout-artistas/sessions
312M  /root/.openclaw/agents/bruno/sessions
266M  /root/.openclaw/agents/validator/sessions
151M  /root/.openclaw/agents/research/sessions
150M  /root/.openclaw/agents/frankie/sessions

There is also a previously archived checkpoint bundle outside the hot sessions path:

4.5G /root/.openclaw/archive/session-checkpoints-2026-04-27-incident

This likely amplifies sessions.list / transcript scanning / Control UI behavior, but it does not seem to be the sole cause: the same local state is usable again on 2026.4.23.

Related issues that look relevant

These existing issues seem strongly related:

Expected behavior

A stable release newer than 2026.4.23 should not saturate the gateway event loop or make Control UI/session/model surfaces take 10-60 seconds on the same local state that remains usable on 2026.4.23.

Actual behavior

On tested versions after 2026.4.23, especially 2026.4.27/2026.4.29, the gateway becomes heavily degraded:

  • sessions.list: 13-38s
  • models.list: ~59s
  • node.list/device.pair.list: can also take ~14s+
  • WebSocket handshake timeouts
  • event-loop delay warnings with p99/max >14-24s
  • eventLoopUtilization ~1
  • CPU around one saturated core

Rollback to 2026.4.23 restores usable behavior.

Question

Is this expected to be covered by the fixes for the issues above, or should this be tracked as a separate regression? I can provide more logs/details if useful, but I wanted to report the concrete version-to-version behavior and timings from a real production-like OpenClaw state.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions