Discord embedded-run prep wedge before strict-agentic, recovery skips sessionId=unknown lanes
Summary
- Version: OpenClaw 2026.5.18 (50a2481).
- Channel: Discord (
@openclaw/discord)
- Runtime / providers reproduced:
harness=codex (openai-codex/gpt-5.5), harness=pi (Pi-compat over codex OAuth, openai/gpt-5.5), and at least one confirmed wedge on provider=google model=gemini-3.5-flash (channel sticky-pinned via session.json) — so the wedge is not provider-specific.
- Symptom:
before_dispatch / embedded_run:started observed
- stalls before
[agent/embedded] strict-agentic execution contract active
- no
llm_input, no before_tool_call, no agent_end
- known-session lane:
stuck session recovery fires at age=360s with action=abort_embedded_run aborted=true drained=false forceCleared=true released=1
- unknown-session lane:
recovery=none continues past 270s until gateway restart — no recovery observed
- Gateway restart clears in-flight stuck lanes but new dispatches re-wedge within minutes — appears persistent in the gateway runtime state.
- References (already landed in 2026.5.18, this happens after them):
Environment
- macOS 26.3 (arm64), Node 22.22.1
- Profile: secondary (
~/.openclaw-hidetoshi/)
- Auth:
openai-codex:* OAuth profile only (no OPENAI_API_KEY)
- 21 plugins (incl. observability for the diagnostic markers)
- Pi-compat route configured via model-level override:
{
"agents.defaults.model.primary": "openai/gpt-5.5",
"agents.defaults.models": {
"openai/gpt-5.5": { "agentRuntime": { "id": "pi" } },
"openai/gpt-5.4": { "agentRuntime": { "id": "pi" } },
"openai/chat-latest": { "agentRuntime": { "id": "pi" } }
}
}
Verified resolving: /status shows Runtime: OpenClaw Pi Default + 🔑 oauth (openai-codex:<email>); successful turns log provider=openai-codex/gpt-5.5 harness=pi.
Reproduction signal — primary case (full lifecycle through 360s auto-recovery, harness=codex)
24-line redacted slice from a single channel covering one successful turn, a wedged turn on the same session, the 30s-cadence stall diagnostics, and the 360s abort:
# Prior sessionId=unknown wedge being cleared by lane-suspension TTL — different recovery path:
13:14:34.648 [diagnostic] stalled session: sessionId=unknown sessionKey=…/channel:<chan-1>
state=processing age=149s queueDepth=1 reason=active_work_without_progress
classification=stalled_agent_run activeWorkKind=embedded_run
lastProgress=embedded_run:started lastProgressAge=148s recovery=none
13:14:36.183 [diagnostic] lane wait exceeded: lane=main waitedMs=1626821 …
13:14:36.188 [diagnostic] lane wait exceeded: lane=main waitedMs=149154 queueAhead=1 activeAhead=0 activeNow=1
13:14:36.192 [session-suspension] auto-resumed lane after suspension TTL ← recovery via TTL
# Successful turn — full marker sequence:
13:14:37.049 [agent/embedded] strict-agentic execution contract active:
runId=4fcbf81b… sessionId=0e9608aa… provider=openai-codex/gpt-5.5 harness=codex
13:14:40.516 [observability] llm_input sessionKey=…/channel:<chan-1> provider=openai-codex model=gpt-5.5
13:14:50.717 [observability] agent_end sessionKey=…/channel:<chan-1> success=true durationMs=13664
# Wedged turn — reuses sessionId=0e9608aa…, no contract activation follows:
13:19:26.153 [observability] before_dispatch …/channel:<chan-1> conversationId=channel:slash:…
13:19:34.810 [observability] before_dispatch …/channel:<chan-1> conversationId=channel:<chan-1>
(no `strict-agentic execution contract active`, no `llm_input`,
no tool calls, no agent_end — for 360s)
# 30s-cadence stall diagnostics (gateway.err.log):
13:21:35.159 [diagnostic] long-running session: sessionId=0e9608aa… age=120s queueDepth=1
reason=queued_behind_active_work classification=long_running activeWorkKind=embedded_run
lastProgress=embedded_run:started lastProgressAge=118s recovery=none
13:22:05.159 [diagnostic] stalled session: … age=150s … recovery=none
13:22:35.161 [diagnostic] stalled session: … age=180s … recovery=none
13:23:05.162 [diagnostic] stalled session: … age=210s … recovery=none
13:23:35.166 [diagnostic] stalled session: … age=240s … recovery=none
13:24:05.163 [diagnostic] stalled session: … age=270s … recovery=none
13:24:35.164 [diagnostic] stalled session: … age=300s … recovery=none
13:25:05.168 [diagnostic] stalled session: … age=330s … recovery=none
13:25:35.170 [diagnostic] stalled session: … age=360s … recovery=checking ← threshold reached
# Auto-recovery fires:
13:25:50.196 [diagnostic] stuck session recovery: sessionId=0e9608aa… age=360s
action=abort_embedded_run aborted=true drained=false released=1
13:25:50.199 [diagnostic] stuck session recovery outcome: status=aborted
action=abort_embedded_run … activeWorkKind=embedded_run
lane=session:agent:hidetoshi:discord:channel:<chan-1>
aborted=true drained=false forceCleared=true released=1
# Next dispatch was a user /new, 5.5 min later:
13:31:38.913 [observability] before_dispatch …/channel:<chan-1> conversationId=channel:slash:…
13:31:39.310 [observability] session_end …/channel:<chan-1> reason=new hadBinding=false
(Full file: stuck-turn-window-filtered.log.)
Reproduction signal — same wedge on harness=pi with sessionId=unknown (no recovery)
After a gateway restart that picked up agentRuntime.id: "pi" (subsequent successful turns logged harness=pi), two Discord channels wedged with identical signature but sessionId=unknown — recovery=none for the full observation window:
15:06:45.110 before_dispatch …/channel:<chan-B>
15:07:50.477 before_dispatch …/channel:<chan-A>
(no contract activation, no llm_input, no agent_end follows either)
15:08:47 [diagnostic] stalled session: sessionId=unknown sessionKey=…/<chan-B> age=122s recovery=none
…30s cadence continues, both lanes…
15:11:17 [diagnostic] stalled session: sessionId=unknown sessionKey=…/<chan-B> age=272s recovery=none
15:11:17 [diagnostic] stalled session: sessionId=unknown sessionKey=…/<chan-A> age=207s recovery=none
15:11:23 [gateway] SIGTERM (manual restart — would not have hit the 360s threshold)
Both lanes had lastProgress=embedded_run:started lastProgressAge≈age and recovery=none throughout. The diagnostic emitter has enough identity to log the channel/session-key and keep the lane wedged, but the recovery path appears to require sessionId=known — so these lanes never get aborted automatically.
(Full file: stuck-turn-window-pi-harness.log.)
Notes and questions
-
Wedge zone is pre-runtime. The lane reaches embedded_run:started but never strict-agentic execution contract active. Whatever blocks sits in the embedded-run prep path (workspace-sandbox / runtime-plugins / hooks / model-resolution / auth / context-engine / attempt-workspace / attempt-prompt) — the same prep stages I see traced in [trace:embedded-run] prep stages lines elsewhere. Is there a code path in there that can block-forever without a timeout enforcer?
-
Affects both harnesses. Same signature on harness=codex and harness=pi, on the same gateway, same channel. This isn't a runtime bug — it's in the layer underneath both. The three adjacent fixes (#82782, #82891, e30be460e1) are already in 2026.5.18 and don't cover this case.
-
sessionId=unknown falls outside recovery. The 360s stuck session recovery only fires when a sessionId has been registered. Wedges that hang before sessionId registration appear to be uncoverable from the diagnostic — the lane has channel/session-key identity but recovery skips it. This looks like either a recovery coverage gap (recovery should key on lane / sessionKey too) or a missing "cannot recover because missing session id" diagnostic reason.
-
drained=false forceCleared=true on the recovery outcome — the lane wasn't drained, only force-cleared. Is the embedded-run / codex-app-server child cleaned up cleanly when this happens, or could there be a leak that contributes to the gateway accumulating stale state over time? (For context: this gateway has been seeing wedges roughly every few hours of active Discord use.)
Adjacent observation (possibly related — not proven)
In the same window as the primary wedge, lane=main showed accumulated congestion:
[diagnostic] lane wait exceeded: lane=main waitedMs=1626821 (27 minutes of accumulated wait) at 13:14:36, cleared by [session-suspension] auto-resumed lane after suspension TTL
- Cascade of failing-fast embedded runs on
lane=main 13:14:59–13:15:30:
- five workspace-lead agents (
baikinman, design-library-lead, takeshi, design-lead, content-ops-lead) fail with No API key found for provider "openai-codex" in ~1s each; each emits [diagnostic] lane task error: lane=main durationMs=~1140
infra-lead failing-fast with xai grok 403 billing error (agent_end success=true durationMs=696 — billing-error fast-fail papered as success at observability layer); recurs every ~30 min on the gateway
[fetch-timeout] fetch timeout after 10000ms (elapsed 13365ms) operation=fetchWithTimeout url=https://discord.com/api/v10/users/@me
- ~4 min after the cascade, the primary wedge fires.
These failing-fast paths emit lane-task errors but the observability layer logs agent_end success=true for some — possibly polluting shared scheduler state (lane queue, codex-app-server pool, auth cache, workspace-sandbox prep) without surfacing a leak. Couldn't prove causation without source-level tracing — flagging in case it points to a shared lock / pool / queue path that the maintainer would recognize.
What I can share on request
- Full unfiltered merged log window (gateway + err) for both cases
- The
lane=main cascade window (workspace-lead + xai heartbeat failures)
models status, config get agents.defaults.models, config get channelModels outputs
[trace:embedded-run] startup stages / prep stages line samples from successful turns for comparison
- Longer-window repro if a "leave-it-running" repro would be useful
Discord embedded-run prep wedge before strict-agentic, recovery skips sessionId=unknown lanes
Summary
@openclaw/discord)harness=codex(openai-codex/gpt-5.5),harness=pi(Pi-compat over codex OAuth,openai/gpt-5.5), and at least one confirmed wedge onprovider=google model=gemini-3.5-flash(channel sticky-pinned via session.json) — so the wedge is not provider-specific.before_dispatch/embedded_run:startedobserved[agent/embedded] strict-agentic execution contract activellm_input, nobefore_tool_call, noagent_endstuck session recoveryfires at age=360s withaction=abort_embedded_run aborted=true drained=false forceCleared=true released=1recovery=nonecontinues past 270s until gateway restart — no recovery observed91ae1a6c03— split embedded attempt dispatch timing8a060b2904— release embedded session write lock before model I/Oe30be460e1— shortened stalled Codex recovery windowEnvironment
~/.openclaw-hidetoshi/)openai-codex:*OAuth profile only (noOPENAI_API_KEY){ "agents.defaults.model.primary": "openai/gpt-5.5", "agents.defaults.models": { "openai/gpt-5.5": { "agentRuntime": { "id": "pi" } }, "openai/gpt-5.4": { "agentRuntime": { "id": "pi" } }, "openai/chat-latest": { "agentRuntime": { "id": "pi" } } } }/statusshowsRuntime: OpenClaw Pi Default+🔑 oauth (openai-codex:<email>); successful turns logprovider=openai-codex/gpt-5.5 harness=pi.Reproduction signal — primary case (full lifecycle through 360s auto-recovery,
harness=codex)24-line redacted slice from a single channel covering one successful turn, a wedged turn on the same session, the 30s-cadence stall diagnostics, and the 360s abort:
(Full file:
stuck-turn-window-filtered.log.)Reproduction signal — same wedge on
harness=piwithsessionId=unknown(no recovery)After a gateway restart that picked up
agentRuntime.id: "pi"(subsequent successful turns loggedharness=pi), two Discord channels wedged with identical signature butsessionId=unknown—recovery=nonefor the full observation window:Both lanes had
lastProgress=embedded_run:started lastProgressAge≈ageandrecovery=nonethroughout. The diagnostic emitter has enough identity to log the channel/session-key and keep the lane wedged, but the recovery path appears to requiresessionId=known— so these lanes never get aborted automatically.(Full file:
stuck-turn-window-pi-harness.log.)Notes and questions
Wedge zone is pre-runtime. The lane reaches
embedded_run:startedbut neverstrict-agentic execution contract active. Whatever blocks sits in the embedded-run prep path (workspace-sandbox / runtime-plugins / hooks / model-resolution / auth / context-engine / attempt-workspace / attempt-prompt) — the same prep stages I see traced in[trace:embedded-run] prep stageslines elsewhere. Is there a code path in there that can block-forever without a timeout enforcer?Affects both harnesses. Same signature on
harness=codexandharness=pi, on the same gateway, same channel. This isn't a runtime bug — it's in the layer underneath both. The three adjacent fixes (#82782,#82891,e30be460e1) are already in 2026.5.18 and don't cover this case.sessionId=unknownfalls outside recovery. The 360sstuck session recoveryonly fires when a sessionId has been registered. Wedges that hang before sessionId registration appear to be uncoverable from the diagnostic — the lane has channel/session-key identity but recovery skips it. This looks like either a recovery coverage gap (recovery should key on lane / sessionKey too) or a missing "cannot recover because missing session id" diagnostic reason.drained=false forceCleared=trueon the recovery outcome — the lane wasn't drained, only force-cleared. Is the embedded-run / codex-app-server child cleaned up cleanly when this happens, or could there be a leak that contributes to the gateway accumulating stale state over time? (For context: this gateway has been seeing wedges roughly every few hours of active Discord use.)Adjacent observation (possibly related — not proven)
In the same window as the primary wedge,
lane=mainshowed accumulated congestion:[diagnostic] lane wait exceeded: lane=main waitedMs=1626821(27 minutes of accumulated wait) at 13:14:36, cleared by[session-suspension] auto-resumed lane after suspension TTLlane=main13:14:59–13:15:30:baikinman,design-library-lead,takeshi,design-lead,content-ops-lead) fail withNo API key found for provider "openai-codex"in ~1s each; each emits[diagnostic] lane task error: lane=main durationMs=~1140infra-leadfailing-fast with xai grok 403 billing error (agent_end success=true durationMs=696— billing-error fast-fail papered as success at observability layer); recurs every ~30 min on the gateway[fetch-timeout] fetch timeout after 10000ms (elapsed 13365ms) operation=fetchWithTimeout url=https://discord.com/api/v10/users/@meThese failing-fast paths emit lane-task errors but the observability layer logs
agent_end success=truefor some — possibly polluting shared scheduler state (lane queue, codex-app-server pool, auth cache, workspace-sandbox prep) without surfacing a leak. Couldn't prove causation without source-level tracing — flagging in case it points to a shared lock / pool / queue path that the maintainer would recognize.What I can share on request
lane=maincascade window (workspace-lead + xai heartbeat failures)models status,config get agents.defaults.models,config get channelModelsoutputs[trace:embedded-run] startup stages/prep stagesline samples from successful turns for comparison