Skip to content

fix: synchronous session transcript reads block Gateway event loop (WS handshake timeouts, Telegram unresponsive) #75656

@DerFlash

Description

@DerFlash

Summary

The OpenClaw Gateway event loop is severely blocked by synchronous fs.readFileSync() calls during agent session preparation. This manifests as WebSocket preauth handshake timeouts (same symptom as #74135, different root cause) and Telegram becoming completely unresponsive for minutes at a time.

The #74135 fix (fix(gateway): refresh model catalog off request path) addressed model-catalog blocking, but the session-transcript read path is a separate, still-unfixed blocker.

Environment

  • OS: Raspberry Pi, Linux 6.12.75+rpt-rpi-v8, arm64/aarch64
  • Node.js: v24.14.1
  • OpenClaw: 2026.4.29 (a448042)
  • Gateway bound on LAN, port 18789
  • memorySearch: enabled (memory-core plugin, dreaming enabled)

Root cause

session-utils.fs-BgGqlqA-.jsreadSessionMessages() uses synchronous fs.readFileSync(file).split(...) + JSON parsing of entire transcript files. This is called during agent prompt-token estimation / preflight compaction and during multiple gateway methods (chat.history, session preview, session event sequencing).

With large or accumulated session transcripts, this blocks the Node.js event loop for tens of seconds, preventing any WebSocket handshakes, Telegram API calls, or RPCs from being processed.

Evidence

Agent prep stage trace (from gateway logs)

[agent/embedded] [trace:embedded-run] prep stages:
  runId=63650e03 sessionId=24fb8e6e phase=stream-ready
  totalMs=125276
  stages=
    workspace-sandbox:224ms,
    skills:16ms,
    core-plugin-tools:40499ms,   ← 40 seconds
    bootstrap-context:4571ms,
    bundle-tools:6077ms,
    system-prompt:31392ms,       ← 31 seconds
    session-resource-loader:10390ms,
    agent-session:19ms,
    stream-setup:32088ms         ← 32 seconds

Total agent preparation: 125 seconds before any model call.

Second run (same session, different trigger)

[agent/embedded] [trace:embedded-run] startup stages:
  phase=attempt-dispatch totalMs=83348
  stages=
    model-resolution:42519ms,   ← 42 seconds
    auth:21119ms,
    attempt-dispatch:19703ms

Event loop diagnostics

[diagnostic] liveness warning:
  reasons=event_loop_delay,event_loop_utilization,cpu
  interval=93s
  eventLoopDelayP99Ms=699.4
  eventLoopDelayMaxMs=69860.3   ← 69.8 second max delay
  eventLoopUtilization=0.914
  cpuCoreRatio=0.932
[diagnostic] liveness warning:
  reasons=event_loop_delay,event_loop_utilization,cpu
  interval=32s
  eventLoopDelayP99Ms=18555.6
  eventLoopDelayMaxMs=18555.6   ← 18.5 second delay
  eventLoopUtilization=1
  cpuCoreRatio=1.05

Resulting symptoms

Session store size correlation

The blocking duration correlates directly with session store size. Before cleanup:

main/sessions/:        116 MB
large JSONL files:     14 files > 1MB
trajectory files:      21 files
deleted/reset files:   86 files (not GC'd by `sessions cleanup`)
sessions.json index:   722 KB (read synchronously)

After manual deletion of .deleted.* and .reset.* files: 116 MB → 36 MB.

Note: openclaw sessions cleanup --dry-run reported 0 files to remove despite 86 physical .deleted.*/.reset.* files on disk — the cleanup command only manages the index, not the physical files.

Suggested fix direction

  1. Convert readSessionMessages() to async I/O (fs.promises.readFile)
  2. Avoid full-file reads where partial/streaming reads suffice
  3. Cache token counts to avoid repeated full transcript reads during preflight compaction
  4. Have sessions cleanup also remove physical .deleted.*/.reset.* files (separate issue filed)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions