Summary
When a session accumulates a large trajectory file (50MB+), pi-trajectory-flush exceeds its 10s timeout. After timeout, the cleanup continues running in the background but the event loop remains 100% saturated for 25+ minutes, making the gateway completely unresponsive to new messages.
Environment
- OpenClaw: 2026.5.2 (installed via npm)
- OS: macOS 15.4, Apple Silicon (arm64)
- Node: v22
- Gateway: local, port 18789
- Model: deepseek/deepseek-v4-flash
Root Cause Analysis
The chain of failures
-
Session accumulates massive trajectory: A Feishu session processed a wiki reorganization task involving 6,743 files. Each tool output (file lists, directory structures) was recorded as a trajectory event, resulting in a 51MB trajectory file with 749 events.
-
Flush exceeds timeout: At turn end, pi-trajectory-flush tries to drain the queued file writer. With 50MB+ of pending writes, it exceeds the hardcoded 10s timeout in runAgentCleanupStep.
-
Timeout doesn't abort the flush: The Promise.race in runAgentCleanupStep only logs a warning — the underlying trajectoryRecorder.flush() promise continues running indefinitely.
-
Event loop saturation: The safeJsonStringify serialization + async file write chain blocks the Node.js event loop at 100% utilization, with P99 delays reaching 34,728ms.
-
Gateway unresponsive: New messages arrive as queued instead of immediate. The session lane remains occupied by cleanup maintenance. Total downtime: 25+ minutes until forced restart (SIGKILL required).
Code paths involved
runAgentCleanupStep (attempt.tool-run-context-B2TarhD3.js:440): Hardcoded 10s timeout, no abort mechanism
QueuedFileWriter.flush() (runtime-qu4g1jFz.js): Drains entire promise chain, no backpressure
safeJsonStringify (safe-json-DCDclho7.js:80): Synchronous serialization of large event objects
createTrajectoryRuntimeRecorder (runtime-qu4g1jFz.js:143): maxFileBytes=52428800 (50MB cap exists but doesn't prevent large files)
Key diagnostic logs
agent cleanup timed out: runId=ffdf596f-... sessionId=425f129b-... step=pi-trajectory-flush timeoutMs=10000
liveness warning: eventLoopDelayP99Ms=34728.8 eventLoopUtilization=1 active=0 waiting=0 queued=1
liveness warning: eventLoopDelayP99Ms=27799.8 eventLoopUtilization=1 active=0 waiting=0 queued=1 (25 min later, still stalled)
Session data scale
agent/main/sessions/: 220MB total
8f42a5aa-*.trajectory.jsonl: 51MB (749 events)
7695a95f-*.trajectory.jsonl: 24MB (572 events)
425f129b-*.trajectory.jsonl: 17MB (480 events)
Related issues
Proposed solutions
-
Make cleanup abortable: Pass an AbortSignal to runAgentCleanupStep so the flush can be stopped after timeout, rather than continuing in the background.
-
Streaming/batched writes for trajectory: Replace per-event appendFile with a WriteStream that buffers writes and yields the event loop between batches.
-
Dynamic timeout: Scale cleanup timeout based on pending queue size (e.g., 10s base + 1s per 100 queued events).
-
Trajectory rotation: Start a new trajectory file when the current one exceeds N MB (e.g., 10MB), preventing any single file from growing too large.
-
Short-term workaround: Expose OPENCLAW_TRAJECTORY_CLEANUP_TIMEOUT_MS env var to allow users to increase the timeout for large sessions.
Summary
When a session accumulates a large trajectory file (50MB+),
pi-trajectory-flushexceeds its 10s timeout. After timeout, the cleanup continues running in the background but the event loop remains 100% saturated for 25+ minutes, making the gateway completely unresponsive to new messages.Environment
Root Cause Analysis
The chain of failures
Session accumulates massive trajectory: A Feishu session processed a wiki reorganization task involving 6,743 files. Each tool output (file lists, directory structures) was recorded as a trajectory event, resulting in a 51MB trajectory file with 749 events.
Flush exceeds timeout: At turn end,
pi-trajectory-flushtries to drain the queued file writer. With 50MB+ of pending writes, it exceeds the hardcoded 10s timeout inrunAgentCleanupStep.Timeout doesn't abort the flush: The
Promise.raceinrunAgentCleanupSteponly logs a warning — the underlyingtrajectoryRecorder.flush()promise continues running indefinitely.Event loop saturation: The
safeJsonStringifyserialization + async file write chain blocks the Node.js event loop at 100% utilization, with P99 delays reaching 34,728ms.Gateway unresponsive: New messages arrive as
queuedinstead ofimmediate. The session lane remains occupied by cleanup maintenance. Total downtime: 25+ minutes until forced restart (SIGKILL required).Code paths involved
runAgentCleanupStep(attempt.tool-run-context-B2TarhD3.js:440): Hardcoded 10s timeout, no abort mechanismQueuedFileWriter.flush()(runtime-qu4g1jFz.js): Drains entire promise chain, no backpressuresafeJsonStringify(safe-json-DCDclho7.js:80): Synchronous serialization of large event objectscreateTrajectoryRuntimeRecorder(runtime-qu4g1jFz.js:143):maxFileBytes=52428800(50MB cap exists but doesn't prevent large files)Key diagnostic logs
Session data scale
Related issues
Proposed solutions
Make cleanup abortable: Pass an
AbortSignaltorunAgentCleanupStepso the flush can be stopped after timeout, rather than continuing in the background.Streaming/batched writes for trajectory: Replace per-event
appendFilewith aWriteStreamthat buffers writes and yields the event loop between batches.Dynamic timeout: Scale cleanup timeout based on pending queue size (e.g., 10s base + 1s per 100 queued events).
Trajectory rotation: Start a new trajectory file when the current one exceeds N MB (e.g., 10MB), preventing any single file from growing too large.
Short-term workaround: Expose
OPENCLAW_TRAJECTORY_CLEANUP_TIMEOUT_MSenv var to allow users to increase the timeout for large sessions.