Skip to content

fix: prevent event loop saturation from trajectory flush (setImmediate yield + 10MB cap + 30s timeout)#77133

Closed
loyur wants to merge 1 commit into
openclaw:mainfrom
loyur:fix/pi-trajectory-flush-event-loop-yield
Closed

fix: prevent event loop saturation from trajectory flush (setImmediate yield + 10MB cap + 30s timeout)#77133
loyur wants to merge 1 commit into
openclaw:mainfrom
loyur:fix/pi-trajectory-flush-event-loop-yield

Conversation

@loyur

@loyur loyur commented May 4, 2026

Copy link
Copy Markdown

Problem

When a session accumulates a large trajectory (50MB+, 700+ events), pi-trajectory-flush blocks the event loop for 25+ minutes after the 10s timeout fires. The timeout warns but doesn't stop the flush, and the event loop stays at 100% utilization with P99 delays of 34 seconds — making the gateway completely unresponsive.

agent cleanup timed out: step=pi-trajectory-flush timeoutMs=10000
liveness warning: eventLoopDelayP99Ms=34728.8 eventLoopUtilization=1
liveness warning: eventLoopDelayP99Ms=27799.8 eventLoopUtilization=1  ← 25 min later

Root Cause

QueuedFileWriter chains writes into an ever-growing promise chain without ever yielding the event loop. With 700+ individual appendFile calls, the chain consumes 100% of the event loop. After the cleanup timeout fires (10s), the chain continues running — the Promise.race in runAgentCleanupStep only logs a warning, it doesn't abort the cleanup.

Changes

1. queued-file-writer.ts — Yield event loop between writes

Add setImmediate between each queued write so the event loop gets control back:

queue = queue
  .then(() => ready)
  .then(() => new Promise<void>((resolve) => setImmediate(resolve)))  // ← yield
  .then(() => safeAppendFile(filePath, line, options))
  .catch(() => undefined);

This prevents the promise chain from monopolizing the event loop. Each write still happens sequentially, but other work (message dispatch, WebSocket events) can be processed between writes.

2. paths.ts — Reduce trajectory file cap

TRAJECTORY_RUNTIME_FILE_MAX_BYTES: 50MB → 10MB

A single session shouldn't produce 50MB of trajectory. 10MB (~140 events at 70KB avg) is sufficient for debugging while keeping flush time manageable.

3. run-cleanup-timeout.ts — Increase cleanup timeout

AGENT_CLEANUP_STEP_TIMEOUT_MS: 10s → 30s

With the event loop yielding (change #1), flushes complete faster. But 10s is still tight for large sessions. 30s provides adequate margin.

Verification

  • Tested locally on macOS with OpenClaw 2026.5.2
  • Applied equivalent patches to compiled bundle
  • Gateway restarted cleanly, Feishu WebSocket reconnected
  • No event loop saturation observed after fix
  • Existing unit tests pass without modification

Fixes #75839
Related: #76340, #77115, #76421

Three changes to prevent the trajectory flush from blocking the
event loop for 25+ minutes under heavy session load:

1. queued-file-writer: yield event loop via setImmediate between
   each queued write to prevent the promise chain from consuming
   100% of the event loop. Without this yield, 700+ queued writes
   (50MB trajectory) block the event loop continuously.

2. trajectory paths: reduce TRAJECTORY_RUNTIME_FILE_MAX_BYTES from
   50MB to 10MB to prevent any single trajectory file from growing
   large enough to cause multi-second flush delays.

3. run-cleanup-timeout: increase AGENT_CLEANUP_STEP_TIMEOUT_MS
   from 10s to 30s to give large trajectories more time to flush
   before the timeout warning fires.

Closes #75839
@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: XS labels May 4, 2026
@clawsweeper

clawsweeper Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

Thanks for the context here. I swept through the related work, and this is now duplicate or superseded.

Close as superseded: merged replacement work now covers the useful trajectory-flush changes on current main, while this branch’s remaining constant-only diff is less compatible than the landed implementation.

Canonical path: Keep the merged bounded-capture/yielding writer and configurable-timeout implementations, and close this unmerged branch instead of reviving its broader constant changes.

So I’m closing this here because the remaining work is already tracked in the canonical issue.

Review details

Best possible solution:

Keep the merged bounded-capture/yielding writer and configurable-timeout implementations, and close this unmerged branch instead of reviving its broader constant changes.

Do we have a high-confidence way to reproduce the issue?

No. I did not run a large-trajectory benchmark; current main no longer has the exact unbounded live-capture/no-yield/fixed-only timeout path because the trajectory writer, capture budget, and timeout resolver have been replaced.

Is this the best way to solve the issue?

No. This branch is no longer the best solution because it lowers the export cap globally and only raises a constant; the merged replacements preserve export compatibility and add an opt-in timeout knob.

Security review:

Security review cleared: The diff changes local file-write scheduling and timeout/size constants only; it adds no dependencies, network calls, permissions, or secret-handling changes.

What I checked:

Likely related people:

  • steipete: Authored and merged the bounded trajectory runtime replacement that yields queued sidecar writes, separates live-capture and export limits, and updates trajectory tests/docs. (role: replacement fix author and recent area contributor; confidence: high; commits: 474bea162b4d, 817b5812e10a; files: src/agents/queued-file-writer.ts, src/trajectory/runtime.ts, src/trajectory/paths.ts)
  • BunsDev: Authored the merged cleanup-timeout replacement that adds trajectory-specific and general cleanup timeout env overrides, plus docs/tests/changelog coverage. (role: timeout replacement author and issue closer; confidence: high; commits: 5d4a8b00721a; files: src/agents/run-cleanup-timeout.ts, src/agents/run-cleanup-timeout.test.ts, docs/tools/trajectory.md)
  • scoootscooob: GitHub path history shows the default-on trajectory capture and export surface was introduced in the earlier trajectory bundle export PR. (role: introduced trajectory capture/export surface; confidence: medium; commits: a3d9c53db299; files: src/trajectory/runtime.ts, src/trajectory/paths.ts, docs/tools/trajectory.md)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 333f65fc8a12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sessions.list latency around 10s and fixed 10s pi-trajectory-flush timeout under moderate session load

1 participant