Skip to content

[Bug] sessions.list/send gateway timeout after event loop stall (v2026.5.2) #76421

@ryswork1993

Description

@ryswork1993

Bug Description

Gateway WebSocket calls (sessions.list, sessions.send, sessions_history, etc.) timeout after 10s when the event loop is blocked by compaction.

Environment

  • Version: OpenClaw 2026.5.2 (8b2a6e5)
  • Node: v22.22.2
  • Host: Linux 6.17, 10GB RAM, 4-core
  • Gateway: ws://127.0.0.1:18902, bind=lan, mode=local
  • Agents configured: 9 agents (main, pm1, director1, frontend1, backend1, qa1, clerk, devops1, reviewer)

Symptoms

  1. sessions.list / sessions_send timeout — Tool calls to sessions_list, sessions_send, sessions_history all fail with gateway timeout after 10000ms
  2. Event loop severely blocked — Log shows repeated liveness warnings with eventLoopDelayP99Ms up to 15000ms+, eventLoopUtilization=1
  3. sessions.list normal response time: ~800ms → spikes to 72,826ms+ during compaction stall, eventually times out
  4. Root cause: Compaction of a large transcript (main session with extensive history) blocks the event loop for 10-15 seconds, causing all WebSocket requests to queue up and timeout

Log Evidence

[2026-05-03T11:06:40.759+08:00] liveness warning: reasons=event_loop_delay,cpu eventLoopDelayP99Ms=1756.4 eventLoopDelayMaxMs=2115 eventLoopUtilization=0.858
[2026-05-03T11:06:54.565+08:00] [tools] sessions_list failed: gateway timeout after 10000ms
[2026-05-03T11:11:56.319+08:00] agent cleanup timed out: runId=... sessionId=...
[2026-05-03T11:18:43.947+08:00] sessions.list 101231ms
[2026-05-03T11:18:49.607+08:00] sessions.usage 89006ms

Current Compaction Config

agents.defaults.compaction.maxActiveTranscriptBytes: "15mb"
agents.defaults.compaction.truncateAfterCompaction: true

Impact

  • Inter-agent communication (main → pm1, main → director1) completely broken during compaction
  • Heartbeat / cron tasks fail when they depend on sessions_list
  • Gateway probe/status commands fail (openclaw status hangs)

Questions / Requests

  1. Can compaction be made non-blocking (run in background thread/worker)?
  2. Is there a way to limit compaction CPU impact so it does not freeze the gateway?
  3. Should sessions.list/usage have their own timeout/queue management separate from compaction stalls?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions