You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In @openclaw/codex 2026.5.7 (and still in 2026.5.12), the codex_app_server:notification handler appears to do synchronous work heavy enough to starve the Node event loop when invoked during an embedded_run cycle on a agentRuntime: { id: "codex" } agent. Two notification types are confirmed wedge triggers:
account/rateLimits/updated — fired during a cron-triggered embedded_run. Hard wedge, instant: gateway becomes unreachable to AWS health checks within seconds.
mcpServer/startupStatus/updated — fired when the codex plugin spawns MCP servers for a codex-runtime agent's tool-use chat turn. Wedge within seconds of the notification, even on a single chat turn with normal workspace-bootstrap reads + memory_search + tool use.
These are distinct from the bedrock-mantle model-ref claim bug fixed by #81511 in 2026.5.12. They're also distinct from the orphan-process leak tracked in #44790.
Affected versions
Reproduced on OpenClaw + @openclaw/codex2026.5.7 and 2026.5.12 (stable). Not addressed by #81511.
Trigger: a daily cron 0 9 * * * Australia/Sydney on a agentRuntime: { id: "codex" } agent (valvest, with openai/gpt-5.5). Within ~4 seconds of the cron firing:
Then zero further journal entries for the remaining ~8 hours of that boot. The OS was dead at the journal level from 09:01 onward; AWS StatusCheckFailed=5 per 5-min window continuously. Recovery required lightsail stop-instance --force + start-instance.
Variant #4: chat + MCP tool use + mcpServer/startupStatus/updated
Trigger: a routine chat DM ("What's today's workout?") to a @openclaw/codex-runtime agent that uses MCP (valpeak, with GitHub MCP). During the normal turn (memory_search + workspace bootstrap-file reads — MEMORY.md, SOUL.md, training-plan.md, daily notes — and the agent's first MCP tool call):
Slow heap growth → OOM at ~2.0 GB after ~42 h uptime — separate, see #44790
Notification handlers observed to fire harmlessly (no wedge): rawResponseItem/completed. So it's not allcodex_app_server:notification types — it appears to be specifically the ones that do non-trivial sync work in the dispatch path.
Reproduction
Install @openclaw/codex and configure an agent with agentRuntime: { id: "codex" } and model: openai/gpt-5.5.
For variant Login fails with 'WebSocket Error (socket hang up)' ECONNRESET #2: schedule a cron that triggers an embedded_run on that agent (e.g. a daily report job). Wait for account/rateLimits/updated to fire (eventually inevitable — happens on every Codex auth state change).
For variant Images not passed to Claude CLI - only path reference in text #4: configure that agent to use an MCP server (e.g. GitHub MCP) for tool calls. Send a chat DM that triggers a tool-use turn. The first MCP server spawn fires mcpServer/startupStatus/updated.
The cron-driven path (variant #2) is unrecoverable because the gateway never re-enters the loop to drain Telegram polls and respond to AWS health checks. The chat-driven path (variant #4) presents identically in our deployment.
Hypothesis
The codex_app_server:notification dispatch path in @openclaw/codex does synchronous work (parse + dispatch + log + side-effects like cache updates) on the event-loop thread. For light notification types (rawResponseItem/completed) the work is small enough to fit in a normal loop tick; for account/rateLimits/updated and mcpServer/startupStatus/updated it's large enough to block for multiple seconds. Combined with a concurrent embedded_run (which is doing its own LLM-shaped work), the loop crosses the saturation threshold and never recovers.
Adjacent: openai/codex#17501 discusses exposing MCP startup notifications to JSONL — different surface, but suggests the notification path is rich enough on the OpenAI side to carry real payload.
Mitigation we're running
Until upstream resolves:
plugins.entries.codex.enabled: true (after 2026.5.12 upgrade — needed for active-memory's resolver to behave)
All 4 chat agents kept on amazon-bedrock/us.anthropic.claude-sonnet-4-6 (no agentRuntime) — codex plugin loaded but no agents codex-runtime-routed
All codex-routed crons in ~/.openclaw/cron/jobs.json disabled
Move the heavy work out of the notification dispatch path:
Either dispatch the notification to a worker thread / microtask queue, returning from the handler synchronously
Or split each handler so only the minimum bookkeeping happens synchronously and the rest defers
Light-touch alternative: add an event-loop budget check around the synchronous payload-processing call and bail early if eventLoopUtilization is already elevated — better to drop a notification than wedge the gateway.
Summary
In
@openclaw/codex2026.5.7 (and still in 2026.5.12), thecodex_app_server:notificationhandler appears to do synchronous work heavy enough to starve the Node event loop when invoked during anembedded_runcycle on aagentRuntime: { id: "codex" }agent. Two notification types are confirmed wedge triggers:account/rateLimits/updated— fired during a cron-triggered embedded_run. Hard wedge, instant: gateway becomes unreachable to AWS health checks within seconds.mcpServer/startupStatus/updated— fired when the codex plugin spawns MCP servers for a codex-runtime agent's tool-use chat turn. Wedge within seconds of the notification, even on a single chat turn with normal workspace-bootstrap reads + memory_search + tool use.These are distinct from the
bedrock-mantlemodel-ref claim bug fixed by #81511 in 2026.5.12. They're also distinct from the orphan-process leak tracked in #44790.Affected versions
Reproduced on OpenClaw +
@openclaw/codex2026.5.7 and 2026.5.12 (stable). Not addressed by#81511.Evidence
Variant #2: cron +
account/rateLimits/updatedTrigger: a daily cron
0 9 * * *Australia/Sydney on aagentRuntime: { id: "codex" }agent (valvest, withopenai/gpt-5.5). Within ~4 seconds of the cron firing:Then zero further journal entries for the remaining ~8 hours of that boot. The OS was dead at the journal level from 09:01 onward; AWS
StatusCheckFailed=5per 5-min window continuously. Recovery requiredlightsail stop-instance --force+start-instance.Variant #4: chat + MCP tool use +
mcpServer/startupStatus/updatedTrigger: a routine chat DM (
"What's today's workout?") to a@openclaw/codex-runtime agent that uses MCP (valpeak, with GitHub MCP). During the normal turn (memory_search + workspace bootstrap-file reads —MEMORY.md,SOUL.md,training-plan.md, daily notes — and the agent's first MCP tool call):eventLoopDelayMaxMs=3147.8from a single notification dispatch. Combined with the heavy first-turn tool use, enough to cross the wedge threshold.Risk ladder observed
embedded_runinvokingamazon-bedrock/*with codex loaded (pre-2026.5.12)account/rateLimits/updatedmcpServer/startupStatus/updatedrawResponseItem/completedobserved)Notification handlers observed to fire harmlessly (no wedge):
rawResponseItem/completed. So it's not allcodex_app_server:notificationtypes — it appears to be specifically the ones that do non-trivial sync work in the dispatch path.Reproduction
@openclaw/codexand configure an agent withagentRuntime: { id: "codex" }andmodel: openai/gpt-5.5.embedded_runon that agent (e.g. a daily report job). Wait foraccount/rateLimits/updatedto fire (eventually inevitable — happens on every Codex auth state change).mcpServer/startupStatus/updated.The cron-driven path (variant #2) is unrecoverable because the gateway never re-enters the loop to drain Telegram polls and respond to AWS health checks. The chat-driven path (variant #4) presents identically in our deployment.
Hypothesis
The
codex_app_server:notificationdispatch path in@openclaw/codexdoes synchronous work (parse + dispatch + log + side-effects like cache updates) on the event-loop thread. For light notification types (rawResponseItem/completed) the work is small enough to fit in a normal loop tick; foraccount/rateLimits/updatedandmcpServer/startupStatus/updatedit's large enough to block for multiple seconds. Combined with a concurrentembedded_run(which is doing its own LLM-shaped work), the loop crosses the saturation threshold and never recovers.Adjacent: openai/codex#17501 discusses exposing MCP startup notifications to JSONL — different surface, but suggests the notification path is rich enough on the OpenAI side to carry real payload.
Mitigation we're running
Until upstream resolves:
plugins.entries.codex.enabled: true(after 2026.5.12 upgrade — needed for active-memory's resolver to behave)amazon-bedrock/us.anthropic.claude-sonnet-4-6(noagentRuntime) — codex plugin loaded but no agents codex-runtime-routed~/.openclaw/cron/jobs.jsondisabledopenai/gpt-5.5for soak observation, but the canary can't actually exercise variant Images not passed to Claude CLI - only path reference in text #4 because it has no MCPFull forensic trail and four-variant decomposition: ValantisV/OpenClaw-Personal#57.
Suspected fix shape
Move the heavy work out of the notification dispatch path:
Light-touch alternative: add an event-loop budget check around the synchronous payload-processing call and bail early if
eventLoopUtilizationis already elevated — better to drop a notification than wedge the gateway.