Gateway event-loop saturation and very slow sessions.list/models.list on all tested versions after 2026.4.23; rollback restores stability

## Summary

We hit a production-impacting regression after upgrading OpenClaw beyond `2026.4.23`. In our environment, every tested version after `2026.4.23` showed instability or severe degradation in real use, including `2026.4.24`, `2026.4.25`, `2026.4.26`, `2026.4.27`, beta attempts, and `2026.4.29`. The system was usable again only after rolling back to `2026.4.23`.

The visible symptom is that the Gateway / Control UI appears to hang while loading session-related surfaces. Locally, the strongest evidence came from `sessions.list`, `models.list`, `node.list`, event-loop diagnostics, and CPU saturation.

This looks related to existing reports about event-loop saturation, slow/unbounded `sessions.list`, Control UI polling, stuck sessions, and runtime-deps issues around `2026.4.26` / `2026.4.27` / current builds.

## Environment

- Host OS: Linux `6.8.0-110-generic` x64
- Node: `v24.14.1`
- Gateway: systemd user service
- Gateway bind: loopback `127.0.0.1:18789`
- Current stable rollback version: `OpenClaw 2026.4.23 (a979721)`
- Affected versions observed across the incident sequence: every tested version after `2026.4.23`, including `2026.4.24`, `2026.4.25`, `2026.4.26`, `2026.4.27`, beta attempts, and `2026.4.29`
- Channels/plugins in use include Telegram, Control UI/webchat, ACP/Codex-related tooling, browser/device-pair/talk-voice, etc.

## What happened

We had `2026.4.27` apparently working after addressing runtime dependency issues around `memory-core`, `chokidar`, and `sqlite-vec`. After a config/model correction and a gateway/server restart, the instance became heavily degraded. We then tried newer builds including `2026.4.29`, but the symptoms remained. Rolling back to `2026.4.23` restored practical stability.

I am not claiming the config/model correction is the root cause; it may simply have been the restart that exposed the regression. The observed failure pattern points more strongly to gateway event-loop/session-list/provider-loading behavior.

## Local evidence from the last affected attempt (`2026.4.29`)

From `journalctl --user -u openclaw-gateway.service` around 2026-04-30 19:07-19:10 ART:

```text
19:07:55 [ws] res sessions.list 38151ms
19:07:55 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=64s eventLoopDelayP99Ms=24679.3 eventLoopDelayMaxMs=24679.3 eventLoopUtilization=1 cpuCoreRatio=1.085 active=0 waiting=0 queued=0
19:08:16 [ws] res models.list 59604ms
19:08:16 [ws] handshake timeout
19:08:33 gateway SIGUSR1 restart
19:09:21 [ws] res sessions.list 13437ms
19:09:21 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=39s eventLoopDelayP99Ms=14327.7 eventLoopDelayMaxMs=14327.7 eventLoopUtilization=0.993 cpuCoreRatio=1.035 active=0 waiting=0 queued=0
19:10:02 Stopping openclaw-gateway.service - OpenClaw Gateway (v2026.4.29)
19:10:05 Stopped openclaw-gateway.service - OpenClaw Gateway (v2026.4.29), CPU 2min44s, memory peak 1.6G
```

After rollback to `2026.4.23`:

```text
19:15:34 Started openclaw-gateway.service - OpenClaw Gateway (v2026.4.23)
19:16:46 gateway ready (6 plugins; 66.8s)
19:19:12 sessions.list 518ms
19:19:32 sessions.list 454ms, chat.history 521ms, models.list 316ms
19:23:52 sessions.list 694ms, chat.history 1103ms, health 1131ms
19:30-19:33 sessions.list roughly 330-690ms, chat.history roughly 50-500ms
```

Current verification after rollback:

```text
OpenClaw 2026.4.23 (a979721)
Gateway probe: ok
Gateway status: systemd active, connectivity probe ok
```

## Local state that may amplify the bug

This instance has a relatively large session/transcript footprint. Current session directory sizes:

```text
2.5G  /root/.openclaw/agents/scout-localidades/sessions
379M  /root/.openclaw/agents/main/sessions
349M  /root/.openclaw/agents/scout-artistas/sessions
312M  /root/.openclaw/agents/bruno/sessions
266M  /root/.openclaw/agents/validator/sessions
151M  /root/.openclaw/agents/research/sessions
150M  /root/.openclaw/agents/frankie/sessions
```

There is also a previously archived checkpoint bundle outside the hot sessions path:

```text
4.5G /root/.openclaw/archive/session-checkpoints-2026-04-27-incident
```

This likely amplifies `sessions.list` / transcript scanning / Control UI behavior, but it does not seem to be the sole cause: the same local state is usable again on `2026.4.23`.

## Related issues that look relevant

These existing issues seem strongly related:

- #74345 — Event-loop saturation and ACP session leak on 2026.4.27
- #74328 — Gateway main thread CPU-bound at ~100% on v2026.4.26/current main
- #75287 — Gateway reloads provider plugins repeatedly and saturates event loop under Control UI polling
- #64004 — Control UI remains slow although `sessions.list` returns quickly
- #57715 — `sessions.list` slow: N+1 transcript fallback + full row build before limit
- #75236 — Cursor pagination for `sessions.list`
- #73510 — Stuck sessions cause permanent gateway hang with no auto-recovery
- #74692 / #74883 / #75109 — `2026.4.27` runtime-deps issues around `sqlite-vec`, `chokidar`, and auto-loaded `memory-core`

## Expected behavior

A stable release newer than `2026.4.23` should not saturate the gateway event loop or make Control UI/session/model surfaces take 10-60 seconds on the same local state that remains usable on `2026.4.23`.

## Actual behavior

On tested versions after `2026.4.23`, especially `2026.4.27`/`2026.4.29`, the gateway becomes heavily degraded:

- `sessions.list`: 13-38s
- `models.list`: ~59s
- `node.list`/`device.pair.list`: can also take ~14s+
- WebSocket handshake timeouts
- event-loop delay warnings with p99/max >14-24s
- eventLoopUtilization ~1
- CPU around one saturated core

Rollback to `2026.4.23` restores usable behavior.

## Question

Is this expected to be covered by the fixes for the issues above, or should this be tracked as a separate regression? I can provide more logs/details if useful, but I wanted to report the concrete version-to-version behavior and timings from a real production-like OpenClaw state.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gateway event-loop saturation and very slow sessions.list/models.list on all tested versions after 2026.4.23; rollback restores stability #75297

Summary

Environment

What happened

Local evidence from the last affected attempt (`2026.4.29`)

Local state that may amplify the bug

Related issues that look relevant

Expected behavior

Actual behavior

Question

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Gateway event-loop saturation and very slow sessions.list/models.list on all tested versions after 2026.4.23; rollback restores stability #75297

Description

Summary

Environment

What happened

Local evidence from the last affected attempt (2026.4.29)

Local state that may amplify the bug

Related issues that look relevant

Expected behavior

Actual behavior

Question

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Local evidence from the last affected attempt (`2026.4.29`)