fix(codex): prevent false idle timeouts under notification queue delay#87070
fix(codex): prevent false idle timeouts under notification queue delay#87070Marvinthebored wants to merge 1 commit into
Conversation
openclaw#86948) The completion-idle timer (60 s) fires based on when notifications are *processed* inside the chained notification queue, not when they are *received* from the app-server binary over stdio. Since ea16a5e added a `setImmediate` yield between each queued dispatch, every notification is delayed by at least one event-loop tick. Under heavy load (tool-call-intensive turns) the ticks accumulate, and the `turn/completed` notification can arrive from the binary but sit in the queue for longer than 60 s — at which point the idle timer fires and aborts the turn. Fix: 1. Touch `turnCompletionLastActivityAt` and `turnAttemptLastProgressAt` at *enqueue* time (when readline delivers the line) so the idle timer sees live binary I/O even when the queue is backed up. 2. Guard all three idle-timeout handlers (`fireTurnCompletionIdleTimeout`, `fireTurnAttemptIdleTimeout`, `fireTurnTerminalIdleTimeout`) with the existing `terminalTurnNotificationQueued` flag so they never fire when `turn/completed` has already been received but not yet processed. Closes openclaw#86948 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Codex review: needs maintainer review before merge. Reviewed May 26, 2026, 7:19 PM ET / 23:19 UTC. Summary PR surface: Source +18. Total +18 across 1 file. Reproducibility: yes. Source inspection and the PR's copied live logs give a high-confidence reproduction path: current main waits for processed notification activity while Review metrics: 2 noteworthy metrics.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land the focused Codex app-server fix after maintainer availability review, ideally with exact proof-field formatting and a targeted queue-delay regression test or explicit acceptance that the live proof is enough for this race. Do we have a high-confidence way to reproduce the issue? Yes. Source inspection and the PR's copied live logs give a high-confidence reproduction path: current main waits for processed notification activity while Is this the best way to solve the issue? Yes. The proposed change is a narrow owner-boundary fix that reuses existing terminal-queued state and updates active-turn receipt timestamps instead of changing provider routing or fallback policy; a focused regression test would make it stronger. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 068d88c14261. Label changesLabel justifications:
Evidence reviewedPR surface: Source +18. Total +18 across 1 file. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
ClawSweeper PR egg ✨ Hatched: 🌱 uncommon Velvet Test Hopper Hatch commandComment Hatchability rules:
Rarity: 🌱 uncommon. What is this egg doing here?
|
|
Thanks for the focused fix. I used the useful part of this patch in #87096, but this PR also made the queued-terminal flag suppress the full turn-attempt watchdog. That changes the failure mode from a false short idle timeout into a possible permanent run/lane hang if turn/completed is received while an earlier notification projection is stuck behind an awaited channel/progress callback. The replacement keeps the receive-time activity touch and suppresses only the short completion/terminal idle guards; the broader attempt watchdog remains independent so a wedged queue can still release the run. Closing this in favor of #87096 with a regression test for terminal completion queued behind projection. Thanks again for finding the race. |
Summary
terminalTurnNotificationQueuedin all three idle timeout handlers so they don't fire whenturn/completedis already queued but not yet processedturnCompletionLastActivityAt/turnAttemptLastProgressAtat enqueue time inenqueueNotification, not just at process timeCompanion to #87022 — that PR bounds the fallout (client retirement, failover policy) when a timeout fires; this PR prevents the false timeout from firing in the first place.
Root cause
waitForCodexNotificationDispatchTurn()yields viasetImmediate(introduced inea16a5e9e1/ #82333). Under event-loop saturation,turn/completedsits in the notification queue past the 60s idle threshold. The idle timer checks process-time timestamps, but the notification hasn't been processed yet — only enqueued. TheterminalTurnNotificationQueuedflag (added in4cbf616d30) already marks this at enqueue time, but the idle timeout handlers don't check it.Changes
extensions/codex/src/app-server/run-attempt.ts:fireTurnAttemptIdleTimeout: addterminalTurnNotificationQueuedto guard conditionfireTurnCompletionIdleTimeout: addterminalTurnNotificationQueuedto guard conditionfireTurnTerminalIdleTimeout: addterminalTurnNotificationQueuedto guard conditionenqueueNotification: touchturnCompletionLastActivityAtandturnAttemptLastProgressAtat receive time for notifications matching the active turn, so the idle timer sees live binary I/O even when the queue is backed upReal behavior proof
Before (on
5a7d5c6def, includes #87022): Two false idle timeouts in 25 minutes, TUI channel,openai/gpt-5.5:Monitor CSV (pre-fix):
After (fix applied, gateway restarted at 23:56): Turn ran 4m 38s+ past the 60s threshold, zero false timeouts:
Zero timeouts, CPU 0.1–1.4%, fds stable at 89–90, TCP stable at 6–7.
Separate issue noted
During post-fix testing, a genuine binary stall was observed — the codex binary stops sending notifications after
rawResponseItem/completedandturn/completednever arrives (377s+). This fix correctly does NOT suppress that timeout, sinceterminalTurnNotificationQueuedis never set whenturn/completedgenuinely never arrives. Filed separately.Refs #86948, #87022.
cc @steipete @Peetiegonzalez