Skip to content

fix(codex): fail fast after quiescent turn completion stalls#82172

Merged
steipete merged 3 commits into
openclaw:mainfrom
funmerlin:fix/codex-quiescent-turn-completion-stall
May 15, 2026
Merged

fix(codex): fail fast after quiescent turn completion stalls#82172
steipete merged 3 commits into
openclaw:mainfrom
funmerlin:fix/codex-quiescent-turn-completion-stall

Conversation

@funmerlin

Copy link
Copy Markdown
Contributor

Summary

  • fail fast when the last non-assistant current-turn item completes and Codex never follows with turn/completed
  • keep the completed-assistant watchdog path separate so assistant-item release semantics stay unchanged
  • add a regression test for the remaining silent-turn gap that was not covered by the existing post-tool or completed-assistant tests

Refs #82171.

Verification

  • pnpm test extensions/codex/src/app-server/run-attempt.test.ts -t 'keeps the post-tool completion watchdog armed across dynamic tool completion bookkeeping|times out promptly when the last completed non-assistant current-turn item is not followed by turn completion|does not release the session after only a raw assistant response item|releases the session when a completed agent message item goes quiet'
  • pnpm build

@openclaw-barnacle openclaw-barnacle Bot added extensions: codex size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 15, 2026
@clawsweeper

clawsweeper Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Summary
The PR re-arms the Codex app-server short completion idle watchdog after the final non-assistant current-turn item completes and adds a regression test for that stall.

Reproducibility: yes. by source inspection: the current-main branch disarms the short completion watchdog after the reduced final item/completed sequence, so it waits for the longer terminal guard. I did not execute the test because this review was constrained to read-only inspection.

Real behavior proof
Needs real behavior proof before merge: The PR body provides test and build commands only, with no after-fix proof from a real Codex/OpenClaw run. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, ask a maintainer to comment @clawsweeper re-review.

Next step before merge
Contributor action is needed for real behavior proof before normal maintainer review or merge; there is no narrow automated repair to make on the branch.

Security
Cleared: The diff only changes Codex app-server timeout control flow and a colocated test; it does not touch dependencies, CI, secrets, permissions, downloads, or publishing paths.

Review details

Best possible solution:

Land this after maintainer review and redacted real-run proof, keeping the change scoped to Codex app-server watchdog semantics and its regression coverage.

Do we have a high-confidence way to reproduce the issue?

Yes by source inspection: the current-main branch disarms the short completion watchdog after the reduced final item/completed sequence, so it waits for the longer terminal guard. I did not execute the test because this review was constrained to read-only inspection.

Is this the best way to solve the issue?

Yes for the code direction: re-arming the existing short watchdog at the final non-assistant current-turn completion is the narrowest fix for the identified gap. Merge should still wait for real behavior proof from the contributor's setup.

What I checked:

  • Current-main stall path: Current main updates active turn items before watchdog handling, then disarms the short completion watchdog for current-turn non-terminal notifications unless the notification is a tracked OpenClaw dynamic-tool completion; there is no last-item-completed exception on main. (extensions/codex/src/app-server/run-attempt.ts:1247, 1b87ba8ca57b)
  • Short-timeout behavior: When armed, the short completion watchdog marks the run timed out and aborts with the codex app-server turn idle timed out waiting for turn/completed message, which is the behavior the PR is trying to trigger for the final-item silence case. (extensions/codex/src/app-server/run-attempt.ts:1006, 1b87ba8ca57b)
  • PR implementation: The head diff adds shouldRearmCompletionIdleWatchAfterLastCurrentTurnItem, reuses the tracked dynamic-tool completion check, and skips the later disarm branch when the final non-assistant completion should keep the short watchdog armed. (extensions/codex/src/app-server/run-attempt.ts:1254, 6be7297e3480)
  • Regression coverage in PR: The PR adds a test that starts a turn, sends item/started, sends the matching item/completed, withholds turn/completed, and expects the short completion timeout plus best-effort turn/interrupt. (extensions/codex/src/app-server/run-attempt.test.ts:2458, 6be7297e3480)
  • Related report context: The linked report at Codex app-server can stall after the last current-turn item completes without turn/completed #82171 describes the reduced reproduction shape and distinguishes it from the already-covered tracked dynamic-tool and completed-assistant cases.
  • Real behavior proof gap: The PR body lists a targeted test command and pnpm build, but no terminal output, runtime log, screenshot, recording, or linked artifact showing the after-fix behavior in a real setup; the PR is also labeled triage: needs-real-behavior-proof. (6be7297e3480)

Likely related people:

  • @steipete: Recent current-main commits in the Codex app-server watchdog path introduced and adjusted the silent-turn timeout behavior this PR extends. (role: recent area contributor; confidence: high; commits: 1e31bd2ac29d, 934fc6ceeb10; files: extensions/codex/src/app-server/run-attempt.ts, extensions/codex/src/app-server/run-attempt.test.ts, docs/plugins/codex-harness.md)

Remaining risk / open question:

  • The contributor has not provided after-fix real behavior proof from a live Codex/OpenClaw run; tests and build output alone do not satisfy the external PR proof gate.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 1b87ba8ca57b.

@steipete

steipete commented May 15, 2026

Copy link
Copy Markdown
Contributor

Maintainer proof for landing #82172.

Behavior addressed: Codex app-server can go quiet after the last non-assistant current-turn item/completed without sending turn/completed; current main waits for the outer attempt timeout instead of the short turn-completion idle watchdog.

Before proof on current main:

  • Command: node scripts/run-vitest.mjs extensions/codex/src/app-server/run-attempt.test.ts -t 'times out promptly when the last completed non-assistant current-turn item is not followed by turn completion' --reporter=verbose with only the new regression test temporarily applied.
  • Result: failed as expected; actual promptError was codex app-server attempt timed out, not codex app-server turn idle timed out waiting for turn/completed.

After-fix proof on PR logic:

  • Command: node scripts/run-vitest.mjs extensions/codex/src/app-server/run-attempt.test.ts -t 'keeps the post-tool completion watchdog armed across dynamic tool completion bookkeeping|times out promptly when the last completed non-assistant current-turn item is not followed by turn completion|does not release the session after only a raw assistant response item|releases the session when a completed agent message item goes quiet|keeps waiting when a current-turn item is still active' --reporter=verbose
  • Result: 5 passed, 115 skipped.
  • Command: git diff --check
  • Result: passed.

Real environment tested:

  • Local OpenClaw CLI turn using the real local embedded agent path and WebSocket app-server transport, with an isolated config and a minimal local WebSocket app-server that responds to initialize, thread/start, and turn/start, then emits item/started plus final non-assistant item/completed and deliberately omits turn/completed.
  • Command shape: pnpm openclaw agent --local --agent main --message 'prove watchdog' --model codex/gpt-5.5 --json --timeout 5 with plugins.entries.codex.config.appServer.transport=websocket and turnCompletionIdleTimeoutMs=50.
  • Observed transport sequence: initialize -> thread/start -> turn/start -> item/started -> item/completed -> turn/interrupt.
  • Observed runtime log: codex app-server turn idle timed out waiting for completion.
  • Observed JSON result: meta.aborted: true, agentMeta.provider: codex, agentHarnessId: codex, toolSummary.tools: [sessions_list].

CI/proof gaps:

  • Did not run full suite locally; this is a narrow Codex app-server watchdog change with focused regression and transport-level proof.
  • GitHub Real behavior proof check failed only because the external PR body lacked structured proof; applying maintainer proof: override based on the above local real-run proof.

Thanks @funmerlin.

Re-review progress:

@steipete steipete added proof: override Maintainer override for the external PR real behavior proof gate. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 15, 2026
@steipete steipete force-pushed the fix/codex-quiescent-turn-completion-stall branch from b352ae6 to d73b187 Compare May 15, 2026 15:14
@steipete steipete merged commit 127156a into openclaw:main May 15, 2026
83 of 84 checks passed
@steipete

Copy link
Copy Markdown
Contributor

Landed in 127156a from source head d73b187.

Verification before merge:

  • node scripts/run-vitest.mjs extensions/codex/src/app-server/run-attempt.test.ts -t 'keeps the post-tool completion watchdog armed across dynamic tool completion bookkeeping|times out promptly when the last completed non-assistant current-turn item is not followed by turn completion|does not release the session after only a raw assistant response item|releases the session when a completed agent message item goes quiet|keeps waiting when a current-turn item is still active' --reporter=verbose: 5 passed, 115 skipped on the rebased head.
  • git diff --check origin/main..HEAD: passed.
  • Real behavior proof: local OpenClaw CLI turn over WebSocket app-server transport emitted turn/interrupt after the short idle watchdog when the fake app-server sent final non-assistant item/completed and omitted turn/completed.

Thanks @funmerlin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

extensions: codex proof: override Maintainer override for the external PR real behavior proof gate. size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants