Summary
The OpenClaw Gateway event loop is severely blocked by synchronous fs.readFileSync() calls during agent session preparation. This manifests as WebSocket preauth handshake timeouts (same symptom as #74135, different root cause) and Telegram becoming completely unresponsive for minutes at a time.
The #74135 fix (fix(gateway): refresh model catalog off request path) addressed model-catalog blocking, but the session-transcript read path is a separate, still-unfixed blocker.
Environment
- OS: Raspberry Pi, Linux 6.12.75+rpt-rpi-v8, arm64/aarch64
- Node.js: v24.14.1
- OpenClaw: 2026.4.29 (a448042)
- Gateway bound on LAN, port 18789
- memorySearch: enabled (memory-core plugin, dreaming enabled)
Root cause
session-utils.fs-BgGqlqA-.js → readSessionMessages() uses synchronous fs.readFileSync(file).split(...) + JSON parsing of entire transcript files. This is called during agent prompt-token estimation / preflight compaction and during multiple gateway methods (chat.history, session preview, session event sequencing).
With large or accumulated session transcripts, this blocks the Node.js event loop for tens of seconds, preventing any WebSocket handshakes, Telegram API calls, or RPCs from being processed.
Evidence
Agent prep stage trace (from gateway logs)
[agent/embedded] [trace:embedded-run] prep stages:
runId=63650e03 sessionId=24fb8e6e phase=stream-ready
totalMs=125276
stages=
workspace-sandbox:224ms,
skills:16ms,
core-plugin-tools:40499ms, ← 40 seconds
bootstrap-context:4571ms,
bundle-tools:6077ms,
system-prompt:31392ms, ← 31 seconds
session-resource-loader:10390ms,
agent-session:19ms,
stream-setup:32088ms ← 32 seconds
Total agent preparation: 125 seconds before any model call.
Second run (same session, different trigger)
[agent/embedded] [trace:embedded-run] startup stages:
phase=attempt-dispatch totalMs=83348
stages=
model-resolution:42519ms, ← 42 seconds
auth:21119ms,
attempt-dispatch:19703ms
Event loop diagnostics
[diagnostic] liveness warning:
reasons=event_loop_delay,event_loop_utilization,cpu
interval=93s
eventLoopDelayP99Ms=699.4
eventLoopDelayMaxMs=69860.3 ← 69.8 second max delay
eventLoopUtilization=0.914
cpuCoreRatio=0.932
[diagnostic] liveness warning:
reasons=event_loop_delay,event_loop_utilization,cpu
interval=32s
eventLoopDelayP99Ms=18555.6
eventLoopDelayMaxMs=18555.6 ← 18.5 second delay
eventLoopUtilization=1
cpuCoreRatio=1.05
Resulting symptoms
Session store size correlation
The blocking duration correlates directly with session store size. Before cleanup:
main/sessions/: 116 MB
large JSONL files: 14 files > 1MB
trajectory files: 21 files
deleted/reset files: 86 files (not GC'd by `sessions cleanup`)
sessions.json index: 722 KB (read synchronously)
After manual deletion of .deleted.* and .reset.* files: 116 MB → 36 MB.
Note: openclaw sessions cleanup --dry-run reported 0 files to remove despite 86 physical .deleted.*/.reset.* files on disk — the cleanup command only manages the index, not the physical files.
Suggested fix direction
- Convert
readSessionMessages() to async I/O (fs.promises.readFile)
- Avoid full-file reads where partial/streaming reads suffice
- Cache token counts to avoid repeated full transcript reads during preflight compaction
- Have
sessions cleanup also remove physical .deleted.*/.reset.* files (separate issue filed)
Related
Summary
The OpenClaw Gateway event loop is severely blocked by synchronous
fs.readFileSync()calls during agent session preparation. This manifests as WebSocket preauth handshake timeouts (same symptom as #74135, different root cause) and Telegram becoming completely unresponsive for minutes at a time.The #74135 fix (
fix(gateway): refresh model catalog off request path) addressed model-catalog blocking, but the session-transcript read path is a separate, still-unfixed blocker.Environment
Root cause
session-utils.fs-BgGqlqA-.js→readSessionMessages()uses synchronousfs.readFileSync(file).split(...)+ JSON parsing of entire transcript files. This is called during agent prompt-token estimation / preflight compaction and during multiple gateway methods (chat.history, session preview, session event sequencing).With large or accumulated session transcripts, this blocks the Node.js event loop for tens of seconds, preventing any WebSocket handshakes, Telegram API calls, or RPCs from being processed.
Evidence
Agent prep stage trace (from gateway logs)
Total agent preparation: 125 seconds before any model call.
Second run (same session, different trigger)
Event loop diagnostics
Resulting symptoms
openclaw logs --followdrops mid-stream (also observed in Gateway intermittently stalls: WebSocket preauth handshakes time out late during model catalog/provider discovery #74135)openclaw tuidisconnects with "gateway not reachable"Session store size correlation
The blocking duration correlates directly with session store size. Before cleanup:
After manual deletion of
.deleted.*and.reset.*files: 116 MB → 36 MB.Note:
openclaw sessions cleanup --dry-runreported 0 files to remove despite 86 physical.deleted.*/.reset.*files on disk — the cleanup command only manages the index, not the physical files.Suggested fix direction
readSessionMessages()to async I/O (fs.promises.readFile)sessions cleanupalso remove physical.deleted.*/.reset.*files (separate issue filed)Related