[Bug]: @openclaw/codex notification handlers (account/rateLimits/updated, mcpServer/startupStatus/updated) synchronously block Node event loop

## Summary

In `@openclaw/codex` 2026.5.7 (and still in 2026.5.12), the `codex_app_server:notification` handler appears to do synchronous work heavy enough to **starve the Node event loop** when invoked during an `embedded_run` cycle on a `agentRuntime: { id: "codex" }` agent. Two notification types are confirmed wedge triggers:

1. `account/rateLimits/updated` — fired during a cron-triggered embedded_run. **Hard wedge, instant**: gateway becomes unreachable to AWS health checks within seconds.
2. `mcpServer/startupStatus/updated` — fired when the codex plugin spawns MCP servers for a codex-runtime agent's tool-use chat turn. **Wedge within seconds** of the notification, even on a single chat turn with normal workspace-bootstrap reads + memory_search + tool use.

These are distinct from the `bedrock-mantle` model-ref claim bug fixed by [#81511](https://github.com/openclaw/openclaw/pull/81511) in 2026.5.12. They're also distinct from the orphan-process leak tracked in [#44790](https://github.com/openclaw/openclaw/issues/44790).

## Affected versions

Reproduced on OpenClaw + `@openclaw/codex` **2026.5.7** and **2026.5.12** (stable). Not addressed by `#81511`.

## Evidence

### Variant #2: cron + `account/rateLimits/updated`

Trigger: a daily cron `0 9 * * *` Australia/Sydney on a `agentRuntime: { id: "codex" }` agent (valvest, with `openai/gpt-5.5`). Within ~4 seconds of the cron firing:

```
2026-05-12 09:01:09 [fetch-timeout] fetch timeout after 10000ms operation=fetchWithTimeout
  url=https://api.telegram.org/bot.../getMe

2026-05-12 09:01:22 [diagnostic] liveness warning:
  reasons=event_loop_delay,cpu
  eventLoopDelayP99Ms=2831.2 eventLoopDelayMaxMs=8011.1
  eventLoopUtilization=0.934 cpuCoreRatio=0.99
  work=[active=agent:valvest:cron:fd0c69ed-...(processing/embedded_run,q=0,age=4s
    last=codex_app_server:notification:account/rateLimits/updated)]
```

Then **zero further journal entries for the remaining ~8 hours** of that boot. The OS was dead at the journal level from 09:01 onward; AWS `StatusCheckFailed=5` per 5-min window continuously. Recovery required `lightsail stop-instance --force` + `start-instance`.

### Variant #4: chat + MCP tool use + `mcpServer/startupStatus/updated`

Trigger: a routine chat DM (`"What's today's workout?"`) to a `@openclaw/codex`-runtime agent that uses MCP (valpeak, with GitHub MCP). During the normal turn (memory_search + workspace bootstrap-file reads — `MEMORY.md`, `SOUL.md`, `training-plan.md`, daily notes — and the agent's first MCP tool call):

```
2026-05-14 06:10:39 [agent/embedded] workspace bootstrap file MEMORY.md is 14561 chars
  (limit 12000); truncating in injected context

2026-05-14 06:10:46 [diagnostic] liveness warning:
  reasons=event_loop_delay
  eventLoopDelayP99Ms=264.6 eventLoopDelayMaxMs=3147.8
  eventLoopUtilization=0.564
  work=[active=agent:valpeak:main(processing/embedded_run,q=1,age=2s
    last=codex_app_server:notification:mcpServer/startupStatus/updated)]

2026-05-14 06:11:11 [ws] ⇄ res ✓ sessions.list 70ms ...
[then 12 minutes of nothing — wedged hard]
```

`eventLoopDelayMaxMs=3147.8` from a **single notification** dispatch. Combined with the heavy first-turn tool use, enough to cross the wedge threshold.

### Risk ladder observed

| Trigger | Notification | Outcome |
|---|---|---|
| Active-memory `embedded_run` invoking `amazon-bedrock/*` with codex loaded (pre-2026.5.12) | n/a (resolver hijack → bedrock-mantle 404 loop) | Slow death spiral (~8 h) **— FIXED by #81511 in 2026.5.12** |
| Cron on codex-runtime agent | `account/rateLimits/updated` | **Hard wedge, instant** — this report, variant #2 |
| Chat on codex-runtime agent + MCP tool use | `mcpServer/startupStatus/updated` | **Wedge within seconds** — this report, variant #4 |
| Steady-state idle, codex+acpx loaded | (only `rawResponseItem/completed` observed) | Slow heap growth → OOM at ~2.0 GB after ~42 h uptime — **separate, see [#44790](https://github.com/openclaw/openclaw/issues/44790)** |

Notification handlers observed to fire harmlessly (no wedge): `rawResponseItem/completed`. So it's not *all* `codex_app_server:notification` types — it appears to be specifically the ones that do non-trivial sync work in the dispatch path.

## Reproduction

1. Install `@openclaw/codex` and configure an agent with `agentRuntime: { id: "codex" }` and `model: openai/gpt-5.5`.
2. **For variant #2:** schedule a cron that triggers an `embedded_run` on that agent (e.g. a daily report job). Wait for `account/rateLimits/updated` to fire (eventually inevitable — happens on every Codex auth state change).
3. **For variant #4:** configure that agent to use an MCP server (e.g. GitHub MCP) for tool calls. Send a chat DM that triggers a tool-use turn. The first MCP server spawn fires `mcpServer/startupStatus/updated`.

The cron-driven path (variant #2) is unrecoverable because the gateway never re-enters the loop to drain Telegram polls and respond to AWS health checks. The chat-driven path (variant #4) presents identically in our deployment.

## Hypothesis

The `codex_app_server:notification` dispatch path in `@openclaw/codex` does synchronous work (parse + dispatch + log + side-effects like cache updates) on the event-loop thread. For light notification types (`rawResponseItem/completed`) the work is small enough to fit in a normal loop tick; for `account/rateLimits/updated` and `mcpServer/startupStatus/updated` it's large enough to block for multiple seconds. Combined with a concurrent `embedded_run` (which is doing its own LLM-shaped work), the loop crosses the saturation threshold and never recovers.

Adjacent: [openai/codex#17501](https://github.com/openai/codex/issues/17501) discusses *exposing* MCP startup notifications to JSONL — different surface, but suggests the notification path is rich enough on the OpenAI side to carry real payload.

## Mitigation we're running

Until upstream resolves:
- `plugins.entries.codex.enabled: true` (after 2026.5.12 upgrade — needed for active-memory's resolver to behave)
- All 4 chat agents kept on `amazon-bedrock/us.anthropic.claude-sonnet-4-6` (no `agentRuntime`) — codex plugin loaded but no agents codex-runtime-routed
- All codex-routed crons in `~/.openclaw/cron/jobs.json` disabled
- One canary agent (oikos — no MCP, no Telegram traffic) currently on `openai/gpt-5.5` for soak observation, but the canary can't actually exercise variant #4 because it has no MCP

Full forensic trail and four-variant decomposition: [ValantisV/OpenClaw-Personal#57](https://github.com/ValantisV/OpenClaw-Personal/issues/57).

## Suspected fix shape

Move the heavy work out of the notification dispatch path:
- Either dispatch the notification to a worker thread / microtask queue, returning from the handler synchronously
- Or split each handler so only the minimum bookkeeping happens synchronously and the rest defers

Light-touch alternative: add an event-loop budget check around the synchronous payload-processing call and bail early if `eventLoopUtilization` is already elevated — better to drop a notification than wedge the gateway.

Trigger	Notification	Outcome
Active-memory `embedded_run` invoking `amazon-bedrock/*` with codex loaded (pre-2026.5.12)	n/a (resolver hijack → bedrock-mantle 404 loop)	Slow death spiral (~8 h) — FIXED by #81511 in 2026.5.12
Cron on codex-runtime agent	`account/rateLimits/updated`	Hard wedge, instant — this report, variant #2
Chat on codex-runtime agent + MCP tool use	`mcpServer/startupStatus/updated`	Wedge within seconds — this report, variant #4
Steady-state idle, codex+acpx loaded	(only `rawResponseItem/completed` observed)	Slow heap growth → OOM at ~2.0 GB after ~42 h uptime — separate, see #44790

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: @openclaw/codex notification handlers (account/rateLimits/updated, mcpServer/startupStatus/updated) synchronously block Node event loop #81936

Summary

Affected versions

Evidence

Variant #2: cron + `account/rateLimits/updated`

Variant #4: chat + MCP tool use + `mcpServer/startupStatus/updated`

Risk ladder observed

Reproduction

Hypothesis

Mitigation we're running

Suspected fix shape

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: @openclaw/codex notification handlers (account/rateLimits/updated, mcpServer/startupStatus/updated) synchronously block Node event loop #81936

Description

Summary

Affected versions

Evidence

Variant #2: cron + account/rateLimits/updated

Variant #4: chat + MCP tool use + mcpServer/startupStatus/updated

Risk ladder observed

Reproduction

Hypothesis

Mitigation we're running

Suspected fix shape

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Variant #2: cron + `account/rateLimits/updated`

Variant #4: chat + MCP tool use + `mcpServer/startupStatus/updated`