[Bug]: Codex app-server stalls after `item/completed`, then aborts without recovery/status

### Bug type

Behavior bug (incorrect output/state without crash)

### Beta release blocker

No

### Summary

OpenClaw `2026.5.18` still loses productive Codex app-server turns when the last observed current-turn notification is `item/completed` and no `turn/completed` follows.

The already-merged fixes for #78756 and #82171 appear to be present in this installation. The current behavior is therefore not a missing-fix case, but a remaining recovery/turn-semantics problem:

- the session lane enters `processing`
- diagnostics report `active_work_without_progress`
- `lastProgress=codex_app_server:notification:item/completed`
- `recovery=none`
- after `turnCompletionIdleTimeoutMs`, OpenClaw aborts the run
- no useful visible recovery/status is delivered for the failed work
- already-started work is not resumed

This makes chat lanes look silent or stuck and can drop real work after a completed tool call.

### Steps to reproduce

1. Run OpenClaw with a user-facing chat lane, reproduced here in Discord and Telegram direct chat.
2. Configure an OpenAI GPT model to use the Codex app-server runtime.
3. Disable model fallbacks to avoid hiding the Codex failure behind Anthropic fallback.
4. Set `plugins.entries.codex.config.appServer.turnCompletionIdleTimeoutMs` to `180000` to prove which watchdog fires.
5. In Discord, ask the agent to do a multi-step file-producing task, for example building a static multi-page web presence from existing project drafts.
6. Observe that the assistant completes one tool item and then no `turn/completed` arrives.
7. Watch diagnostics until the completion-idle timeout fires.

Relevant redacted config used during the test:

```json
{
  "agents": {
    "defaults": {
      "model": {
        "primary": "openai/gpt-5.5",
        "fallbacks": []
      },
      "timeoutSeconds": 900
    }
  },
  "plugins": {
    "entries": {
      "codex": {
        "config": {
          "appServer": {
            "turnCompletionIdleTimeoutMs": 180000
          }
        }
      }
    }
  }
}
```

Discord reproduction sequence from the session JSONL:

```text
2026-05-19T08:07:43.604Z user prompt from Discord
2026-05-19T08:07:43.995Z assistant toolCall: bash mkdir -p /home/casper/.openclaw/workspace/artifacts/maria-ward-smartphone-start/site/assets/img
2026-05-19T08:07:44.092Z toolResult: completed exitCode 0 durationMs 0
```

No subsequent assistant work was written for the requested site build before timeout. The only filesystem result was directory creation.

### Expected behavior

OpenClaw should not silently drop a productive Codex app-server turn after a completed tool item if the turn is still expected to continue.

At minimum, if OpenClaw decides the app-server turn is unrecoverably incomplete because `turn/completed` never arrived, it should:

- release the session lane
- send a visible channel status explaining the failed turn
- preserve enough state to allow the user to retry/resume
- avoid misleading explanations such as user/UI interruption when the log cause is `turn_completion_idle_timeout`
- avoid losing already-started work without a user-visible failure/recovery message

Better behavior would distinguish:

- completed tool call followed by expected assistant continuation
- genuinely terminal item completion
- missing/late `turn/completed`
- app-server still computing vs. app-server protocol dead-air

### Actual behavior

The run is aborted after the completion idle timeout. Diagnostics explicitly say `recovery=none`.

In the Discord reproduction, only a directory was created; no requested site files were produced. The user saw typing/activity disappear and no useful recovery surfaced.

Subsequent status questions can create confusing assistant explanations that imply a user/UI abort, even though the durable gateway evidence for the original run points to `turn_completion_idle_timeout`.

### OpenClaw version

OpenClaw 2026.5.18 (50a2481)

### Operating system

Ubuntu

### Install method

npm global

### Model

gpt-5.5

### Provider / routing chain

openai-codex/gpt-5.5 -> Codex app-server harness -> OpenClaw embedded run -> Discord/Telegram chat lane

### Additional provider/model setup details

Fallbacks were disabled during the primary test:

```json
"fallbacks": []
```

This was intentional to avoid an Anthropic fallback hiding the Codex app-server failure.

`turnCompletionIdleTimeoutMs` was deliberately raised to `180000` during testing. The same pattern had previously been observed around the default shorter idle behavior; raising the timeout made it clear which watchdog fired.

Earlier tests with fallbacks enabled caused additional confusing behavior: OpenClaw fell back to Anthropic, then hit context overflow/compaction and separate `message` tool delivery errors.

Related issues/PRs:

- #78756: Codex app-server turns time out after 60s despite meaningful progress
- #79667: fix(codex): ignore account updates for turn liveness
- #82171: Codex app-server can stall after the last current-turn item completes without turn/completed
- #82172: fix(codex): fail fast after quiescent turn completion stalls

### Logs, screenshots, and evidence

```shell
Gateway log excerpts:


2026-05-19T08:04:04.805Z [agent/embedded]
strict-agentic execution contract active:
runId=fa6f5365-411f-4028-8985-a9ec7a9b35a4
sessionId=ac54314e-d1ad-4145-b8fe-932309953759
provider=openai-codex/gpt-5.5 harness=codex



2026-05-19T08:07:04.822Z [diagnostic]
stalled session:
sessionId=ac54314e-d1ad-4145-b8fe-932309953759
sessionKey=agent:main:discord:channel:1497109509825626232
state=processing age=142s queueDepth=1
reason=active_work_without_progress
classification=stalled_agent_run
activeWorkKind=embedded_run
lastProgress=codex_app_server:notification:item/completed
lastProgressAge=141s
recovery=none



2026-05-19T08:07:34.819Z [diagnostic]
stalled session:
sessionId=ac54314e-d1ad-4145-b8fe-932309953759
sessionKey=agent:main:discord:channel:1497109509825626232
state=processing age=172s queueDepth=1
reason=active_work_without_progress
classification=stalled_agent_run
activeWorkKind=embedded_run
lastProgress=codex_app_server:notification:item/completed
lastProgressAge=171s
recovery=none



2026-05-19T08:07:43.435Z [agent/embedded]
codex app-server turn idle timed out waiting for completion
{
  threadId: "019e3f2d-b7f2-7443-ab96-4e72fe219fe1",
  turnId: "019e3f43-8034-7001-88af-70ffeb9bdb43",
  idleMs: 180003,
  timeoutMs: 180000,
  lastActivityReason: "notification:item/completed",
  lastNotificationMethod: "item/completed"
}



2026-05-19T08:07:43.457Z [agent/embedded]
codex app-server client retired after timed-out turn
{
  threadId: "019e3f2d-b7f2-7443-ab96-4e72fe219fe1",
  turnId: "019e3f43-8034-7001-88af-70ffeb9bdb43",
  reason: "turn_completion_idle_timeout",
  clearedSharedClient: true
}



2026-05-19T08:07:44.198Z [agent/embedded]
embedded run failover decision
{
  runId: "fa6f5365-411f-4028-8985-a9ec7a9b35a4",
  stage: "assistant",
  decision: "surface_error",
  failoverReason: "timeout",
  profileFailureReason: "timeout",
  provider: "openai-codex",
  model: "gpt-5.5",
  fallbackConfigured: false,
  timedOut: true,
  aborted: true
}


While diagnosing the Discord stall from Telegram, the Telegram direct session itself hit the same failure mode.


2026-05-19T08:14:59.977Z [agent/embedded]
strict-agentic execution contract active:
runId=6e9f7eb1-5418-4d5c-aabc-df8a1e7f7619
sessionId=9578d939-b2fd-4ec9-b65b-8a93348ca570
provider=openai-codex/gpt-5.5 harness=codex



2026-05-19T08:17:38.070Z [diagnostic]
stalled session:
sessionId=9578d939-b2fd-4ec9-b65b-8a93348ca570
sessionKey=agent:main:telegram:direct:287384854
state=processing age=129s queueDepth=1
reason=active_work_without_progress
classification=stalled_agent_run
activeWorkKind=embedded_run
lastProgress=codex_app_server:notification:item/completed
lastProgressAge=129s
recovery=none



2026-05-19T08:18:08.068Z [diagnostic]
stalled session:
sessionId=9578d939-b2fd-4ec9-b65b-8a93348ca570
sessionKey=agent:main:telegram:direct:287384854
state=processing age=159s queueDepth=1
reason=active_work_without_progress
classification=stalled_agent_run
activeWorkKind=embedded_run
lastProgress=codex_app_server:notification:item/completed
lastProgressAge=159s
recovery=none



2026-05-19T08:18:29.525Z [agent/embedded]
codex app-server turn idle timed out waiting for completion
{
  threadId: "019e3ef4-0e36-7b32-b9e1-36b98cc115a8",
  turnId: "019e3f4d-7f38-74e2-82fc-2557e24a98b1",
  idleMs: 180001,
  timeoutMs: 180000,
  lastActivityReason: "notification:item/completed",
  lastNotificationMethod: "item/completed"
}



2026-05-19T08:18:30.061Z [agent/embedded]
embedded run failover decision
{
  runId: "6e9f7eb1-5418-4d5c-aabc-df8a1e7f7619",
  stage: "assistant",
  decision: "surface_error",
  failoverReason: "timeout",
  profileFailureReason: "timeout",
  provider: "openai-codex",
  model: "gpt-5.5",
  fallbackConfigured: false,
  timedOut: true,
  aborted: true
}
```

### Impact and severity

Severity: high for user-facing chat lanes using Codex app-server.

Impact:

- User-facing Discord/Telegram lanes can appear silent or stuck.
- Real work may be dropped after a completed tool call.
- Diagnostics say `recovery=none`, leaving no clear user-facing recovery path.
- The failure can be confused with a user/UI abort even though logs show `turn_completion_idle_timeout`.
- Increasing `turnCompletionIdleTimeoutMs` only delays the abort; it does not solve recovery.


### Additional information

Why #78756 and #82171 do not fully cover this:

The fixes appear to be present and working in a narrow sense:

- account/rate-limit updates are not prolonging this stall indefinitely
- the session does not wait for the 30-minute terminal cap
- the configured completion-idle watchdog fires

However, that still leaves a correctness/recovery gap:

- productive work can be aborted after the last observed `item/completed`
- no useful visible recovery is emitted
- no resume/retry path is provided
- the lane is not self-healing in a user-meaningful way

This looks like a remaining bug adjacent to #82171: the fail-fast behavior prevents long hangs, but it does not provide correct turn semantics or recovery when `turn/completed` is missing.

Suggested fix direction:

1. Preserve and expose a structured recovery result when `turn_completion_idle_timeout` fires after `item/completed`.
2. Emit a visible channel message when a user-facing lane aborts due to missing `turn/completed`, including the last completed item/tool and retry guidance.
3. Add a retry/resume mechanism that restarts the turn with a compact summary of already-completed tool calls and their results.
4. Improve app-server protocol handling so that if the final observed current-turn item is a tool result, OpenClaw does not treat silence as terminal without preserving recovery.
5. Add diagnostics that distinguish:
   - `turn/completed` missing after assistant final text
   - `turn/completed` missing after tool result where more assistant work is expected
   - raw response completion stalls
   - user/UI aborts

Workaround in this environment: avoid the Codex app-server runtime for user-facing chat lanes until this recovery gap is fixed. For OpenAI GPT models, forcing `harness=pi` is only viable if the OpenAI provider credentials have `api.responses.write`; otherwise the normal OpenAI Responses API path fails with HTTP 401.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Codex app-server stalls after `item/completed`, then aborts without recovery/status #84076

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Codex app-server stalls after item/completed, then aborts without recovery/status #84076

Description

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Bug]: Codex app-server stalls after `item/completed`, then aborts without recovery/status #84076