fix: prevent event loop saturation from trajectory flush (setImmediate yield + 10MB cap + 30s timeout) by loyur · Pull Request #77133 · openclaw/openclaw

loyur · 2026-05-04T05:49:29Z

Problem

When a session accumulates a large trajectory (50MB+, 700+ events), pi-trajectory-flush blocks the event loop for 25+ minutes after the 10s timeout fires. The timeout warns but doesn't stop the flush, and the event loop stays at 100% utilization with P99 delays of 34 seconds — making the gateway completely unresponsive.

agent cleanup timed out: step=pi-trajectory-flush timeoutMs=10000
liveness warning: eventLoopDelayP99Ms=34728.8 eventLoopUtilization=1
liveness warning: eventLoopDelayP99Ms=27799.8 eventLoopUtilization=1  ← 25 min later

Root Cause

QueuedFileWriter chains writes into an ever-growing promise chain without ever yielding the event loop. With 700+ individual appendFile calls, the chain consumes 100% of the event loop. After the cleanup timeout fires (10s), the chain continues running — the Promise.race in runAgentCleanupStep only logs a warning, it doesn't abort the cleanup.

Changes

1. `queued-file-writer.ts` — Yield event loop between writes

Add setImmediate between each queued write so the event loop gets control back:

queue = queue
  .then(() => ready)
  .then(() => new Promise<void>((resolve) => setImmediate(resolve)))  // ← yield
  .then(() => safeAppendFile(filePath, line, options))
  .catch(() => undefined);

This prevents the promise chain from monopolizing the event loop. Each write still happens sequentially, but other work (message dispatch, WebSocket events) can be processed between writes.

2. `paths.ts` — Reduce trajectory file cap

TRAJECTORY_RUNTIME_FILE_MAX_BYTES: 50MB → 10MB

A single session shouldn't produce 50MB of trajectory. 10MB (~140 events at 70KB avg) is sufficient for debugging while keeping flush time manageable.

3. `run-cleanup-timeout.ts` — Increase cleanup timeout

AGENT_CLEANUP_STEP_TIMEOUT_MS: 10s → 30s

With the event loop yielding (change #1), flushes complete faster. But 10s is still tight for large sessions. 30s provides adequate margin.

Verification

Tested locally on macOS with OpenClaw 2026.5.2
Applied equivalent patches to compiled bundle
Gateway restarted cleanly, Feishu WebSocket reconnected
No event loop saturation observed after fix
Existing unit tests pass without modification

Fixes #75839
Related: #76340, #77115, #76421

Three changes to prevent the trajectory flush from blocking the event loop for 25+ minutes under heavy session load: 1. queued-file-writer: yield event loop via setImmediate between each queued write to prevent the promise chain from consuming 100% of the event loop. Without this yield, 700+ queued writes (50MB trajectory) block the event loop continuously. 2. trajectory paths: reduce TRAJECTORY_RUNTIME_FILE_MAX_BYTES from 50MB to 10MB to prevent any single trajectory file from growing large enough to cause multi-second flush delays. 3. run-cleanup-timeout: increase AGENT_CLEANUP_STEP_TIMEOUT_MS from 10s to 30s to give large trajectories more time to flush before the timeout warning fires. Closes #75839

clawsweeper · 2026-05-04T05:52:47Z

Thanks for the context here. I swept through the related work, and this is now duplicate or superseded.

Close as superseded: merged replacement work now covers the useful trajectory-flush changes on current main, while this branch’s remaining constant-only diff is less compatible than the landed implementation.

Canonical path: Keep the merged bounded-capture/yielding writer and configurable-timeout implementations, and close this unmerged branch instead of reviving its broader constant changes.

So I’m closing this here because the remaining work is already tracked in the canonical issue.

Review details

Best possible solution:

Keep the merged bounded-capture/yielding writer and configurable-timeout implementations, and close this unmerged branch instead of reviving its broader constant changes.

Do we have a high-confidence way to reproduce the issue?

No. I did not run a large-trajectory benchmark; current main no longer has the exact unbounded live-capture/no-yield/fixed-only timeout path because the trajectory writer, capture budget, and timeout resolver have been replaced.

Is this the best way to solve the issue?

No. This branch is no longer the best solution because it lowers the export cap globally and only raises a constant; the merged replacements preserve export compatibility and add an opt-in timeout knob.

Security review:

Security review cleared: The diff changes local file-write scheduling and timeout/size constants only; it adds no dependencies, network calls, permissions, or secret-handling changes.

What I checked:

PR branch diff: The branch changes only src/agents/queued-file-writer.ts, src/agents/run-cleanup-timeout.ts, and src/trajectory/paths.ts: a queued-write yield, a 30s cleanup timeout constant, and a single 10 MiB trajectory file cap. (5dad918f652c)
Current main queues bounded yielding writes: Current main routes runtime trajectory writes through getQueuedFileWriter with maxFileBytes, maxQueuedBytes, and yieldBeforeWrite: true, which covers the central writer-yield and queue-bounding part of this PR. (src/trajectory/runtime.ts:256, 333f65fc8a12)
Current main has configurable trajectory cleanup timeout: Current main keeps the 10s default but resolves pi-trajectory-flush through OPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MS, then the general cleanup timeout env, instead of hard-coding a 30s constant. (src/agents/run-cleanup-timeout.ts:31, 333f65fc8a12)
Superseding bounded-runtime PR: Merged PR fix: bound trajectory runtime flush #77154 explicitly says it replaces this PR, bounds runtime trajectory payloads before write, stops live capture at 10 MiB, yields before sidecar appends, and preserves 50 MiB export compatibility. (474bea162b4d)
Superseding timeout PR: Merged PR fix(agents): make trajectory cleanup timeout configurable #81622 identifies this PR as the closest open trajectory PR, notes that current main already has the writer-yield/docs-cap pieces, and implements the remaining timeout behavior as a configurable override. (5d4a8b00721a)
Docs now match the replacement behavior: The public trajectory docs now document the 10 MiB live-capture stop, 50 MiB export acceptance, and OPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MS=30000 as an opt-in operator setting. Public docs: docs/tools/trajectory.md. (docs/tools/trajectory.md:171, 333f65fc8a12)

Likely related people:

steipete: Authored and merged the bounded trajectory runtime replacement that yields queued sidecar writes, separates live-capture and export limits, and updates trajectory tests/docs. (role: replacement fix author and recent area contributor; confidence: high; commits: 474bea162b4d, 817b5812e10a; files: src/agents/queued-file-writer.ts, src/trajectory/runtime.ts, src/trajectory/paths.ts)
BunsDev: Authored the merged cleanup-timeout replacement that adds trajectory-specific and general cleanup timeout env overrides, plus docs/tests/changelog coverage. (role: timeout replacement author and issue closer; confidence: high; commits: 5d4a8b00721a; files: src/agents/run-cleanup-timeout.ts, src/agents/run-cleanup-timeout.test.ts, docs/tools/trajectory.md)
scoootscooob: GitHub path history shows the default-on trajectory capture and export surface was introduced in the earlier trajectory bundle export PR. (role: introduced trajectory capture/export surface; confidence: medium; commits: a3d9c53db299; files: src/trajectory/runtime.ts, src/trajectory/paths.ts, docs/tools/trajectory.md)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 333f65fc8a12.

openclaw-barnacle Bot added agents Agent runtime and tooling size: XS labels May 4, 2026

steipete mentioned this pull request May 4, 2026

fix: bound trajectory runtime flush #77154

Merged

loyur mentioned this pull request May 4, 2026

Gateway FD leak: ~14,000 file descriptors leaked after 7h uptime causing spawn EBADF and event loop saturation #77327

Closed

clawsweeper Bot mentioned this pull request May 10, 2026

[Performance Regression] Blocked event loop and timeouts after v2026.4.23 #76340

Closed

BunsDev mentioned this pull request May 14, 2026

fix(agents): make trajectory cleanup timeout configurable #81622

Merged

25 tasks

clawsweeper Bot closed this May 15, 2026

galiniliev mentioned this pull request May 17, 2026

fix(agents): add trajectory flush timeout diagnostics #82962

Merged

25 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prevent event loop saturation from trajectory flush (setImmediate yield + 10MB cap + 30s timeout)#77133

fix: prevent event loop saturation from trajectory flush (setImmediate yield + 10MB cap + 30s timeout)#77133
loyur wants to merge 1 commit into
openclaw:mainfrom
loyur:fix/pi-trajectory-flush-event-loop-yield

loyur commented May 4, 2026

Uh oh!

clawsweeper Bot commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

loyur commented May 4, 2026

Problem

Root Cause

Changes

1. queued-file-writer.ts — Yield event loop between writes

2. paths.ts — Reduce trajectory file cap

3. run-cleanup-timeout.ts — Increase cleanup timeout

Verification

Uh oh!

clawsweeper Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `queued-file-writer.ts` — Yield event loop between writes

2. `paths.ts` — Reduce trajectory file cap

3. `run-cleanup-timeout.ts` — Increase cleanup timeout

clawsweeper Bot commented May 4, 2026 •

edited

Loading