-
-
Notifications
You must be signed in to change notification settings - Fork 79.1k
Codex app-server emits notification:turn/started then goes silent; embedded run wedges for the full stuck-session recovery window #85251
Copy link
Copy link
Open
Labels
P1High-priority user-facing bug, regression, or broken workflow.High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.ClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.ClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.ClawSweeper found a high-confidence source-level issue reproduction.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.Channel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.Very strong issue quality with high-confidence source-level or clear reproduction.
Metadata
Metadata
Assignees
Labels
P1High-priority user-facing bug, regression, or broken workflow.High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.ClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.ClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.ClawSweeper found a high-confidence source-level issue reproduction.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.Channel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.Very strong issue quality with high-confidence source-level or clear reproduction.
Type
Fields
Give feedbackNo fields configured for issues without a type.
Summary
Codex app-server emits
notification:turn/startedfor a turn and then goes completely silent — no deltas, noturn/completed, noturn/error. The session sits inembedded_runstate indefinitely until OpenClaw'sstuck session recoveryfires (default 360s) and force-aborts it. Operator-visible symptom: agent received the message but never replies.Environment
2026.5.20(/opt/homebrew/lib/node_modules/openclaw)node /opt/homebrew/bin/codex app-server --enable goals --listen unix://app-server.sockopenai-codexvia OAuth (ChatGPT account)system-architect,librarian,codexReproduction (observed 2026-05-22 13:46–13:52 local)
Two librarian sessions started at 13:46:09 after a fresh gateway kickstart and immediately wedged on
codex_app_server:notification:turn/started. Both stalled for ~360s with no progress events until stuck-session recovery aborted them:sessionId=d729f8f8-fe2a-40e4-8778-8be979511d1fsessionKey=agent:librarian:telegram:direct:6689123501sessionId=681d1bb8-4291-4977-b0ca-69e059beaf66sessionKey=agent:librarian:mainsessionId=8a6d0f15-3d6b-4e7d-91e8-3b60ee451207sessionKey=agent:system-architect:telegram:group:-1003731083010:topic:479— different stall mode (activeTool=bashexec hang) but same Codex backend.Recovery log lines:
Correlated
[diagnostics/liveness] liveness warning: reasons=event_loop_delay … eventLoopDelayMaxMs=5469immediately after the first stall — gateway main thread blocked ~5s, likely while waiting on a Codex socket response that never came.Why this matters
When the Codex app-server hangs after
turn/started, the user sees a completely silent failure — gateway and OpenClaw look healthy, telegram outbound works, but their request just disappears for 6 minutes until recovery. The recovery does kill the wedged run, but does not retry the user's request, so the message is effectively lost.Suspected cause
Open hypotheses:
agents.defaults.models.*alias but the underlying provider entry is missing (related: my other bug today involved a hot reload addingopenai/gpt-5.4-mini,google/gemini-3.x,xai/grok-4.20-0309-*aliases without registering them undermodels.providers). Codex may have accepted the turn and then deadlocked when resolving the model.node /opt/homebrew/bin/codex app-serverprocesses from 5-8 days ago all reparented to launchd withapp-server.sockbindings. The gateway's choice of which socket to talk to may be hitting a dead one. (Filing this separately if it stays after socket re-resolution improvements.)turn/started→turn/completedround-trip. Recovery ismin_age=300s, which is much longer than any reasonable user-visible patience.Expected behavior
Either: (a) Codex app-server should emit
turn/error(or close the stream) within bounded time if it cannot make progress; or (b) the gateway side should have a per-turn watchdog that surfacesevent:codex_turn_timeoutto the user and lets them retry, rather than burying the failure under a genericstuck session recovery.Proposed fix shape
In
dist/cli/codex-runtime.ts(or equivalent), add aturn-started-without-progresswatchdog:notification:turn/startedis received.notification:turn/deltaornotification:turn/completedarrives withinchannels.codex.turnProgressThresholdMs(default e.g. 60s), emit a syntheticturn/errorto the embedded run consumer withcode=CODEX_TURN_STALLED.Also worth investigating whether long-lived stale
app-serversockets need broker-side health probes before being selected.Workaround
launchctl kickstart -k gui/$(id -u)/ai.openclaw.gateway— clears all wedged sessions, spawns fresh Codex socket connections. Today this restored service within ~10s.Related