fix: prevent event loop saturation from trajectory flush (setImmediate yield + 10MB cap + 30s timeout)#77133
Conversation
Three changes to prevent the trajectory flush from blocking the event loop for 25+ minutes under heavy session load: 1. queued-file-writer: yield event loop via setImmediate between each queued write to prevent the promise chain from consuming 100% of the event loop. Without this yield, 700+ queued writes (50MB trajectory) block the event loop continuously. 2. trajectory paths: reduce TRAJECTORY_RUNTIME_FILE_MAX_BYTES from 50MB to 10MB to prevent any single trajectory file from growing large enough to cause multi-second flush delays. 3. run-cleanup-timeout: increase AGENT_CLEANUP_STEP_TIMEOUT_MS from 10s to 30s to give large trajectories more time to flush before the timeout warning fires. Closes #75839
|
Thanks for the context here. I swept through the related work, and this is now duplicate or superseded. Close as superseded: merged replacement work now covers the useful trajectory-flush changes on current main, while this branch’s remaining constant-only diff is less compatible than the landed implementation. Canonical path: Keep the merged bounded-capture/yielding writer and configurable-timeout implementations, and close this unmerged branch instead of reviving its broader constant changes. So I’m closing this here because the remaining work is already tracked in the canonical issue. Review detailsBest possible solution: Keep the merged bounded-capture/yielding writer and configurable-timeout implementations, and close this unmerged branch instead of reviving its broader constant changes. Do we have a high-confidence way to reproduce the issue? No. I did not run a large-trajectory benchmark; current main no longer has the exact unbounded live-capture/no-yield/fixed-only timeout path because the trajectory writer, capture budget, and timeout resolver have been replaced. Is this the best way to solve the issue? No. This branch is no longer the best solution because it lowers the export cap globally and only raises a constant; the merged replacements preserve export compatibility and add an opt-in timeout knob. Security review: Security review cleared: The diff changes local file-write scheduling and timeout/size constants only; it adds no dependencies, network calls, permissions, or secret-handling changes. What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 333f65fc8a12. |
Problem
When a session accumulates a large trajectory (50MB+, 700+ events),
pi-trajectory-flushblocks the event loop for 25+ minutes after the 10s timeout fires. The timeout warns but doesn't stop the flush, and the event loop stays at 100% utilization with P99 delays of 34 seconds — making the gateway completely unresponsive.Root Cause
QueuedFileWriterchains writes into an ever-growing promise chain without ever yielding the event loop. With 700+ individualappendFilecalls, the chain consumes 100% of the event loop. After the cleanup timeout fires (10s), the chain continues running — thePromise.raceinrunAgentCleanupSteponly logs a warning, it doesn't abort the cleanup.Changes
1.
queued-file-writer.ts— Yield event loop between writesAdd
setImmediatebetween each queued write so the event loop gets control back:This prevents the promise chain from monopolizing the event loop. Each write still happens sequentially, but other work (message dispatch, WebSocket events) can be processed between writes.
2.
paths.ts— Reduce trajectory file capTRAJECTORY_RUNTIME_FILE_MAX_BYTES: 50MB → 10MBA single session shouldn't produce 50MB of trajectory. 10MB (~140 events at 70KB avg) is sufficient for debugging while keeping flush time manageable.
3.
run-cleanup-timeout.ts— Increase cleanup timeoutAGENT_CLEANUP_STEP_TIMEOUT_MS: 10s → 30sWith the event loop yielding (change #1), flushes complete faster. But 10s is still tight for large sessions. 30s provides adequate margin.
Verification
Fixes #75839
Related: #76340, #77115, #76421