Skip to content

pi-trajectory-flush: 50MB trajectory file blocks event loop for 25+ minutes after flush timeout #77124

@loyur

Description

@loyur

Summary

When a session accumulates a large trajectory file (50MB+), pi-trajectory-flush exceeds its 10s timeout. After timeout, the cleanup continues running in the background but the event loop remains 100% saturated for 25+ minutes, making the gateway completely unresponsive to new messages.

Environment

  • OpenClaw: 2026.5.2 (installed via npm)
  • OS: macOS 15.4, Apple Silicon (arm64)
  • Node: v22
  • Gateway: local, port 18789
  • Model: deepseek/deepseek-v4-flash

Root Cause Analysis

The chain of failures

  1. Session accumulates massive trajectory: A Feishu session processed a wiki reorganization task involving 6,743 files. Each tool output (file lists, directory structures) was recorded as a trajectory event, resulting in a 51MB trajectory file with 749 events.

  2. Flush exceeds timeout: At turn end, pi-trajectory-flush tries to drain the queued file writer. With 50MB+ of pending writes, it exceeds the hardcoded 10s timeout in runAgentCleanupStep.

  3. Timeout doesn't abort the flush: The Promise.race in runAgentCleanupStep only logs a warning — the underlying trajectoryRecorder.flush() promise continues running indefinitely.

  4. Event loop saturation: The safeJsonStringify serialization + async file write chain blocks the Node.js event loop at 100% utilization, with P99 delays reaching 34,728ms.

  5. Gateway unresponsive: New messages arrive as queued instead of immediate. The session lane remains occupied by cleanup maintenance. Total downtime: 25+ minutes until forced restart (SIGKILL required).

Code paths involved

  • runAgentCleanupStep (attempt.tool-run-context-B2TarhD3.js:440): Hardcoded 10s timeout, no abort mechanism
  • QueuedFileWriter.flush() (runtime-qu4g1jFz.js): Drains entire promise chain, no backpressure
  • safeJsonStringify (safe-json-DCDclho7.js:80): Synchronous serialization of large event objects
  • createTrajectoryRuntimeRecorder (runtime-qu4g1jFz.js:143): maxFileBytes=52428800 (50MB cap exists but doesn't prevent large files)

Key diagnostic logs

agent cleanup timed out: runId=ffdf596f-... sessionId=425f129b-... step=pi-trajectory-flush timeoutMs=10000
liveness warning: eventLoopDelayP99Ms=34728.8 eventLoopUtilization=1 active=0 waiting=0 queued=1
liveness warning: eventLoopDelayP99Ms=27799.8 eventLoopUtilization=1 active=0 waiting=0 queued=1  (25 min later, still stalled)

Session data scale

agent/main/sessions/: 220MB total
  8f42a5aa-*.trajectory.jsonl: 51MB (749 events)
  7695a95f-*.trajectory.jsonl: 24MB (572 events)
  425f129b-*.trajectory.jsonl: 17MB (480 events)

Related issues

Proposed solutions

  1. Make cleanup abortable: Pass an AbortSignal to runAgentCleanupStep so the flush can be stopped after timeout, rather than continuing in the background.

  2. Streaming/batched writes for trajectory: Replace per-event appendFile with a WriteStream that buffers writes and yields the event loop between batches.

  3. Dynamic timeout: Scale cleanup timeout based on pending queue size (e.g., 10s base + 1s per 100 queued events).

  4. Trajectory rotation: Start a new trajectory file when the current one exceeds N MB (e.g., 10MB), preventing any single file from growing too large.

  5. Short-term workaround: Expose OPENCLAW_TRAJECTORY_CLEANUP_TIMEOUT_MS env var to allow users to increase the timeout for large sessions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions