Discord embedded-run prep wedge before strict-agentic, recovery skips sessionId=unknown lanes

# Discord embedded-run prep wedge before strict-agentic, recovery skips sessionId=unknown lanes

## Summary

- **Version**: OpenClaw 2026.5.18 (50a2481).
- **Channel**: Discord (`@openclaw/discord`)
- **Runtime / providers reproduced**: `harness=codex` (`openai-codex/gpt-5.5`), `harness=pi` (Pi-compat over codex OAuth, `openai/gpt-5.5`), and at least one confirmed wedge on **`provider=google model=gemini-3.5-flash`** (channel sticky-pinned via session.json) — so the wedge is **not provider-specific**.
- **Symptom**:
  - `before_dispatch` / `embedded_run:started` observed
  - stalls before `[agent/embedded] strict-agentic execution contract active`
  - no `llm_input`, no `before_tool_call`, no `agent_end`
  - **known-session lane**: `stuck session recovery` fires at age=360s with `action=abort_embedded_run aborted=true drained=false forceCleared=true released=1`
  - **unknown-session lane**: `recovery=none` continues past 270s until gateway restart — no recovery observed
  - Gateway restart clears in-flight stuck lanes but new dispatches re-wedge within minutes — appears persistent in the gateway runtime state.
- **References (already landed in 2026.5.18, this happens after them)**:
  - #82782 / `91ae1a6c03` — split embedded attempt dispatch timing
  - #82891 / `8a060b2904` — release embedded session write lock before model I/O
  - `e30be460e1` — shortened stalled Codex recovery window

## Environment

- macOS 26.3 (arm64), Node 22.22.1
- Profile: secondary (`~/.openclaw-hidetoshi/`)
- Auth: `openai-codex:*` OAuth profile only (no `OPENAI_API_KEY`)
- 21 plugins (incl. observability for the diagnostic markers)
- Pi-compat route configured via model-level override:
  ```json
  {
    "agents.defaults.model.primary": "openai/gpt-5.5",
    "agents.defaults.models": {
      "openai/gpt-5.5":   { "agentRuntime": { "id": "pi" } },
      "openai/gpt-5.4":   { "agentRuntime": { "id": "pi" } },
      "openai/chat-latest": { "agentRuntime": { "id": "pi" } }
    }
  }
  ```
  Verified resolving: `/status` shows `Runtime: OpenClaw Pi Default` + `🔑 oauth (openai-codex:<email>)`; successful turns log `provider=openai-codex/gpt-5.5 harness=pi`.

## Reproduction signal — primary case (full lifecycle through 360s auto-recovery, `harness=codex`)

24-line redacted slice from a single channel covering one successful turn, a wedged turn on the same session, the 30s-cadence stall diagnostics, and the 360s abort:

```
# Prior sessionId=unknown wedge being cleared by lane-suspension TTL — different recovery path:
13:14:34.648 [diagnostic] stalled session: sessionId=unknown sessionKey=…/channel:<chan-1>
              state=processing age=149s queueDepth=1 reason=active_work_without_progress
              classification=stalled_agent_run activeWorkKind=embedded_run
              lastProgress=embedded_run:started lastProgressAge=148s recovery=none
13:14:36.183 [diagnostic] lane wait exceeded: lane=main waitedMs=1626821 …
13:14:36.188 [diagnostic] lane wait exceeded: lane=main waitedMs=149154 queueAhead=1 activeAhead=0 activeNow=1
13:14:36.192 [session-suspension] auto-resumed lane after suspension TTL    ← recovery via TTL

# Successful turn — full marker sequence:
13:14:37.049 [agent/embedded] strict-agentic execution contract active:
              runId=4fcbf81b… sessionId=0e9608aa… provider=openai-codex/gpt-5.5 harness=codex
13:14:40.516 [observability] llm_input sessionKey=…/channel:<chan-1> provider=openai-codex model=gpt-5.5
13:14:50.717 [observability] agent_end sessionKey=…/channel:<chan-1> success=true durationMs=13664

# Wedged turn — reuses sessionId=0e9608aa…, no contract activation follows:
13:19:26.153 [observability] before_dispatch …/channel:<chan-1> conversationId=channel:slash:…
13:19:34.810 [observability] before_dispatch …/channel:<chan-1> conversationId=channel:<chan-1>
              (no `strict-agentic execution contract active`, no `llm_input`,
               no tool calls, no agent_end — for 360s)

# 30s-cadence stall diagnostics (gateway.err.log):
13:21:35.159 [diagnostic] long-running session: sessionId=0e9608aa… age=120s queueDepth=1
              reason=queued_behind_active_work classification=long_running activeWorkKind=embedded_run
              lastProgress=embedded_run:started lastProgressAge=118s recovery=none
13:22:05.159 [diagnostic] stalled session: … age=150s … recovery=none
13:22:35.161 [diagnostic] stalled session: … age=180s … recovery=none
13:23:05.162 [diagnostic] stalled session: … age=210s … recovery=none
13:23:35.166 [diagnostic] stalled session: … age=240s … recovery=none
13:24:05.163 [diagnostic] stalled session: … age=270s … recovery=none
13:24:35.164 [diagnostic] stalled session: … age=300s … recovery=none
13:25:05.168 [diagnostic] stalled session: … age=330s … recovery=none
13:25:35.170 [diagnostic] stalled session: … age=360s … recovery=checking    ← threshold reached

# Auto-recovery fires:
13:25:50.196 [diagnostic] stuck session recovery: sessionId=0e9608aa… age=360s
              action=abort_embedded_run aborted=true drained=false released=1
13:25:50.199 [diagnostic] stuck session recovery outcome: status=aborted
              action=abort_embedded_run … activeWorkKind=embedded_run
              lane=session:agent:hidetoshi:discord:channel:<chan-1>
              aborted=true drained=false forceCleared=true released=1

# Next dispatch was a user /new, 5.5 min later:
13:31:38.913 [observability] before_dispatch …/channel:<chan-1> conversationId=channel:slash:…
13:31:39.310 [observability] session_end …/channel:<chan-1> reason=new hadBinding=false
```

(Full file: [`stuck-turn-window-filtered.log`](./stuck-turn-window-filtered.log).)

## Reproduction signal — same wedge on `harness=pi` with `sessionId=unknown` (no recovery)

After a gateway restart that picked up `agentRuntime.id: "pi"` (subsequent successful turns logged `harness=pi`), two Discord channels wedged with identical signature but `sessionId=unknown` — `recovery=none` for the full observation window:

```
15:06:45.110 before_dispatch …/channel:<chan-B>
15:07:50.477 before_dispatch …/channel:<chan-A>
              (no contract activation, no llm_input, no agent_end follows either)

15:08:47 [diagnostic] stalled session: sessionId=unknown sessionKey=…/<chan-B> age=122s recovery=none
…30s cadence continues, both lanes…
15:11:17 [diagnostic] stalled session: sessionId=unknown sessionKey=…/<chan-B> age=272s recovery=none
15:11:17 [diagnostic] stalled session: sessionId=unknown sessionKey=…/<chan-A> age=207s recovery=none
15:11:23 [gateway] SIGTERM (manual restart — would not have hit the 360s threshold)
```

Both lanes had `lastProgress=embedded_run:started lastProgressAge≈age` and `recovery=none` throughout. The diagnostic emitter has enough identity to log the channel/session-key and keep the lane wedged, but the recovery path appears to require `sessionId=known` — so these lanes never get aborted automatically.

(Full file: [`stuck-turn-window-pi-harness.log`](./stuck-turn-window-pi-harness.log).)

## Notes and questions

1. **Wedge zone is pre-runtime.** The lane reaches `embedded_run:started` but never `strict-agentic execution contract active`. Whatever blocks sits in the embedded-run prep path (workspace-sandbox / runtime-plugins / hooks / model-resolution / auth / context-engine / attempt-workspace / attempt-prompt) — the same prep stages I see traced in `[trace:embedded-run] prep stages` lines elsewhere. Is there a code path in there that can block-forever without a timeout enforcer?

2. **Affects both harnesses.** Same signature on `harness=codex` and `harness=pi`, on the same gateway, same channel. This isn't a runtime bug — it's in the layer underneath both. The three adjacent fixes (`#82782`, `#82891`, `e30be460e1`) are already in 2026.5.18 and don't cover this case.

3. **`sessionId=unknown` falls outside recovery.** The 360s `stuck session recovery` only fires when a sessionId has been registered. Wedges that hang before sessionId registration appear to be uncoverable from the diagnostic — the lane has channel/session-key identity but recovery skips it. This looks like either a recovery coverage gap (recovery should key on lane / sessionKey too) or a missing "cannot recover because missing session id" diagnostic reason.

4. **`drained=false forceCleared=true`** on the recovery outcome — the lane wasn't drained, only force-cleared. Is the embedded-run / codex-app-server child cleaned up cleanly when this happens, or could there be a leak that contributes to the gateway accumulating stale state over time? (For context: this gateway has been seeing wedges roughly every few hours of active Discord use.)

## Adjacent observation (possibly related — not proven)

In the same window as the primary wedge, `lane=main` showed accumulated congestion:

- **`[diagnostic] lane wait exceeded: lane=main waitedMs=1626821`** (27 minutes of accumulated wait) at 13:14:36, cleared by `[session-suspension] auto-resumed lane after suspension TTL`
- Cascade of failing-fast embedded runs on `lane=main` 13:14:59–13:15:30:
  - five workspace-lead agents (`baikinman`, `design-library-lead`, `takeshi`, `design-lead`, `content-ops-lead`) fail with `No API key found for provider "openai-codex"` in ~1s each; each emits `[diagnostic] lane task error: lane=main durationMs=~1140`
  - `infra-lead` failing-fast with xai grok 403 billing error (`agent_end success=true durationMs=696` — billing-error fast-fail papered as success at observability layer); recurs every ~30 min on the gateway
  - `[fetch-timeout] fetch timeout after 10000ms (elapsed 13365ms) operation=fetchWithTimeout url=https://discord.com/api/v10/users/@me`
- ~4 min after the cascade, the primary wedge fires.

These failing-fast paths emit lane-task errors but the observability layer logs `agent_end success=true` for some — possibly polluting shared scheduler state (lane queue, codex-app-server pool, auth cache, workspace-sandbox prep) without surfacing a leak. Couldn't prove causation without source-level tracing — flagging in case it points to a shared lock / pool / queue path that the maintainer would recognize.

## What I can share on request

- Full unfiltered merged log window (gateway + err) for both cases
- The `lane=main` cascade window (workspace-lead + xai heartbeat failures)
- `models status`, `config get agents.defaults.models`, `config get channelModels` outputs
- `[trace:embedded-run] startup stages` / `prep stages` line samples from successful turns for comparison
- Longer-window repro if a "leave-it-running" repro would be useful


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discord embedded-run prep wedge before strict-agentic, recovery skips sessionId=unknown lanes #84477

Discord embedded-run prep wedge before strict-agentic, recovery skips sessionId=unknown lanes

Summary

Environment

Reproduction signal — primary case (full lifecycle through 360s auto-recovery, `harness=codex`)

Reproduction signal — same wedge on `harness=pi` with `sessionId=unknown` (no recovery)

Notes and questions

Adjacent observation (possibly related — not proven)

What I can share on request

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Discord embedded-run prep wedge before strict-agentic, recovery skips sessionId=unknown lanes #84477

Description

Discord embedded-run prep wedge before strict-agentic, recovery skips sessionId=unknown lanes

Summary

Environment

Reproduction signal — primary case (full lifecycle through 360s auto-recovery, harness=codex)

Reproduction signal — same wedge on harness=pi with sessionId=unknown (no recovery)

Notes and questions

Adjacent observation (possibly related — not proven)

What I can share on request

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Reproduction signal — primary case (full lifecycle through 360s auto-recovery, `harness=codex`)

Reproduction signal — same wedge on `harness=pi` with `sessionId=unknown` (no recovery)