TL;DR
Two related but distinct bugs reproduced live on 2026-05-15 against OpenClaw 2026.5.12 + Codex CLI/app-server 0.130.0. Filed in one issue per Krill's guidance on the OpenClaw Discord support thread (2026-05-15, "Codex app-server turn idle timed out").
- (A) Telegram isolated-ingress spool drain is serially HOL-blocked by the in-flight agent turn. A long thinking-heavy turn in one chat freezes the spool drain for all other chats of the same
accountId/agent — they don't reach the embedded run queue until the in-flight turn finishes.
- (B) Codex app-server stops emitting JSON-RPC notifications mid-turn, after a tool round-trip, causing a 30 min terminal idle timeout. Internally Codex keeps processing (1000+ log entries; multiple
response.completed, custom_tool_call_input.delta events). Externally OpenClaw sees no events between notification:rawResponseItem/completed and the watchdog firing 30 min later. User-facing result is a partial assistant text followed by Request timed out before a response was generated. This is the same symptom reported by me in Krill's Discord thread; the new evidence below is what we got from a clean fresh-app-server repro after wiping all per-agent codex-home/ dirs.
Environment
|
|
| OpenClaw |
2026.5.12 (f066dd2) |
| Codex CLI / app-server |
0.130.0 (/root/.openclaw/npm/node_modules/@openai/codex/bin/codex.js, only install on PATH) |
| Node |
22.22.1 |
| OS |
Linux 5.15.0-174-generic x64, Ubuntu 22.04.5 LTS |
| Gateway |
systemd user service, loopback ws://127.0.0.1:18789 |
| Channels |
Telegram (10 isolated accounts incl. arkadiy, nikita) + WhatsApp (1) |
| Auth |
ChatGPT Subscription OAuth openai-codex:pashaganson@gmail.com (Weekly 36% used, Short-term 1% used — not rate-limited) |
| Runtime selector |
per-model agentRuntime.id: "codex" on all openai/* models. No top-level embeddedHarness. plugins.entries.codex.enabled = true. |
| Streaming |
channels.telegram.streaming = { mode: "partial", preview.toolProgress: true, progress.toolProgress: true, render: "rich" } |
| Model |
openai/gpt-5.5, thinking: "high", fastMode: true, textVerbosity: "medium" |
| Active Memory plugin |
enabled, default config |
Pre-repro install audit + cleanup we did
Per Krill's first-pass checklist on Discord we audited the install and found several stale-state things, which we cleaned up before this repro. Calling them out so they aren't re-suggested:
- 3 different Codex CLI binaries on the host: openclaw bundle 0.130.0, system-global
/usr/lib/node_modules/@openai/codex@0.130.0, and snap codex@0.114.0 in /snap/bin. Only the bundle was actually being launched by openclaw (via absolute path), but the others were latent risk. Removed snap and system-global; kept only the bundle.
- 5 of 10 per-agent
~/.openclaw/agents/<agent>/agent/codex-home/ had no auth.json (vanechka, kirya, elena, nikita, dasha). The other 5 did.
- One
codex app-server process was being reused across all 10 agents, with CODEX_HOME pinned to ~/.openclaw/agents/angela/agent/codex-home regardless of which agent owned the turn. Per-agent isolation was effectively not isolating.
codex-home-main had grown to ~1 GB (logs_2.sqlite, state_5.sqlite, shell snapshots). Other codex-home dirs up to ~200 MB each.
Cleanup: stopped gateway, moved all 10 codex-home dirs to a timestamped backup (preserving agent/auth-profiles.json OAuth profiles), removed /root/.codex (personal CLI home, unused by openclaw), ran openclaw doctor --fix, restarted gateway. After this, codex app-server now spawns lazily per agent with the correct isolated CODEX_HOME. Both bugs below reproduced anyway, so neither is caused by the stale state we cleaned up.
(A) Telegram isolated-ingress HOL blocking
Confirmed by Krill from source: the spool drain loop is
for (update of updates) {
await bot.handleUpdate(update); // includes the full agent turn
delete spooledFile;
}
acp.maxConcurrentSessions: 8 is ACP-only; agents.defaults.maxConcurrent and messages.queue only apply after dequeue — none of them decouple this drain loop.
Live repro at 2026-05-15T18:20:09Z
User 854067528 sent 3 messages to agent arkadiy at ~18:20:10Z (light → heavy → light), one per channel: group topic, DM, DM topic.
| Inbound (UTC) |
Channel |
Body size |
Spool wait |
18:20:09.142Z |
group -1003794846986:topic:9 |
14 chars |
~0s (1st arrival, spool empty) |
18:20:40.091Z |
DM 854067528 |
147 chars |
~30s (file 408 sat in spool 18:20:10 → 18:20:40) |
18:23:21.479Z |
DM topic 854067528:806808 |
16 chars |
~3m10s (file 409 sat in spool 18:20:11 → ~18:23:21) |
Total drain time for the 3-message burst: ~3 minutes, all because the 1st turn (a 14-char тест message answered by gpt-5.5 with thinking: high) ran for 3m30s and HOL-blocked the spool.
Spool drain timeline captured live in the attached spool-drain-monitor.log (background watcher snapshotting /root/.openclaw/telegram/ingress-spool-arkadiy/ every 30s).
Impact
For a multi-channel personal-assistant install (10 agents, dozens of chats per agent), one long turn anywhere will freeze ALL inbound reception for that agent across every channel — including unrelated quick messages, status checks, and other users. The agent looks dead to anyone trying to ping it while busy.
Ask
- Decouple the spool drain from
bot.handleUpdate agent-turn completion. The drain should enqueue updates into a downstream queue and return promptly, letting the spooled file be deleted; agent execution then runs from the downstream queue with its own concurrency knob.
- Or, document an explicit config switch that does this. Currently we don't know of one.
(B) Codex app-server stops emitting events mid-turn → 30 min terminal idle timeout
Live repro
agent:nikita:telegram:direct:854067528. User sent one analytics request at 17:55:51Z: "Сделай аналитику по маминому питанию подробную" (35 chars). After ~70s of normal lifecycle (1 tool round-trip), codex app-server went silent toward OpenClaw for 30:26, until the OpenClaw watchdog fired.
Runtime event timeline (from /export-trajectory)
Smoking-gun events
User-facing result in Telegram: that partial 285-token assistant message, followed by:
Request timed out before a response was generated. Please try again, or increase agents.defaults.timeoutSeconds in your config.
What Codex was actually doing internally
I dumped ~/.openclaw/agents/nikita/agent/codex-home/logs_2.sqlite while the wedge was in progress. Of 5110 internal log entries, the same threadId 019e2cc7-f734 ran 1000 internal events, the last being:
TRACE codex_core::session::turn post sampling token usage turn_id=019e2cc7-f773-...
at 1778867799 (= 18:03:19Z). Activity inside Codex went silent at 18:03:19Z, but the JSON-RPC stdio stream toward openclaw stopped earlier than that (last runtime event reached openclaw at 17:56:12Z, ~7 minutes earlier). So two layers of silence:
- 17:56:12Z → ~18:03:19Z (~7 min): codex internally still active (model sampling, custom_tool_call_input.delta x355, response.completed x6, etc.), but its
codex_app_server::outgoing_message stream stopped emitting after the last rawResponseItem/completed it had pushed to openclaw.
- ~18:03:19Z onward: codex itself fully idle. Process state
S (sleeping) in futex_wait. No new internal log entries. No CPU activity.
Note lastNotificationItemType: "custom_tool_call_output" — the last codex notification successfully delivered was the result of a tool call. The model would normally then run another sampling round to consume that tool result and either issue more tool calls or finalize. Codex did the model round-trip internally (1000 more internal log entries) but never emitted any further rawResponseItem/started, item/completed, turn/completed etc. over JSON-RPC.
Success-case trajectory diff (same harness, different turn)
A simultaneous arkadiy DM turn (agent:arkadiy:telegram:direct:854067528:thread:854067528:807447, thread 019e2cad) completed cleanly at 18:23 with finishReason=stop, output 1.2k tokens, 51% context fill, no truncation, full lifecycle events emitted:
| Event type |
SUCCESS (arkadiy) |
FAIL (nikita) |
| user.message |
2 |
1 |
| assistant.message |
38 |
13 |
| tool.call / tool.result |
34/34 |
11/11 |
| prompt.submitted |
2 |
1 |
| context.compiled |
2 |
1 |
| model.completed |
2 |
1 (forced, with timedOut:true) |
| session.ended |
2 |
1 (with timedOut:true) |
turn.terminal_idle_timeout |
0 |
1 |
Curious side-detail in the FAIL transcript: tool.call/result events seq=11..39 (28 transcript events) all stamped within 60ms of the watchdog at 18:26:38.880-944Z, i.e. they appear flushed in a burst at session-end rather than streamed in real time. The 30-minute gap in the runtime stream between seq=5 (last real-time event) and seq=6 (watchdog) is unbroken.
Ask
- Investigate why
codex_app_server::outgoing_message stops emitting JSON-RPC notifications after a rawResponseItem/completed for an item of type custom_tool_call_output, while the underlying session loop continues to run sampling rounds and tool calls internally.
- Or: surface a watchdog inside codex itself that detects "internal sampling rounds happening but outgoing stream is silent for N seconds" and either heals or fails the turn fast, instead of letting the openclaw-side 30-min idle watchdog be the only safety net.
- Bonus: would also be useful to log the raw
finish_reason from the OpenAI Responses API alongside the normalized stop/error/aborted/toolUse, so we can rule out finish_reason=length (truncation) cases for separate reports we're investigating.
Artifact bundle
Files attached as a secret gist:
SUCCESS-manifest.json, SUCCESS-runtime-events.jsonl — clean turn trajectory (arkadiy, 18:20-18:23Z)
FAIL-manifest.json, FAIL-runtime-events.jsonl — wedged turn trajectory (nikita, 17:56-18:26Z)
FAIL-tool-chronology.jsonl — all tool.call/tool.result timestamps from the wedged turn
openclaw-log-slice-1755-1830.log — filtered openclaw gateway log for both repro windows
spool-drain-monitor.log — 30-second snapshots of ingress-spool-arkadiy/ + ingress-spool-nikita/ + codex app-server PIDs across the repro window
Diagnostics zip from openclaw gateway diagnostics export (33 KiB, payload-free, sanitized) available on request — happy to attach to the issue if useful.
Reproducer (short form)
- Fresh OpenClaw 2026.5.12 install with native Codex harness (
agentRuntime.id="codex"), ChatGPT Subscription OAuth, Telegram channel with multiple chats per agent
- Pick any agent, e.g.
nikita
- From a Telegram client send a thinking-heavy multi-tool prompt (e.g. an analytics request that requires Memory Search + several Bash tool calls)
- Within ~10 sec from another chat tied to the same agent, send 2 more simple messages
- Expected: agent replies to all 3 within reasonable wall-clock time, terminal events emitted normally
- Actual: light messages sit in
~/.openclaw/telegram/ingress-spool-<agent>/ for the duration of the heavy turn (bug A), AND for some thinking-heavy tool-using turns the codex app-server runs the model+tools internally but stops emitting JSON-RPC notifications mid-turn, leading to the 30-minute idle timeout (bug B)
I can also enable diagnostics.flags=["*"] + logging.level=debug and re-capture with a full event trace if the runtime-event JSONL above is not enough.
Originally discussed with Krill on the OpenClaw Discord thread linked at the top.
TL;DR
Two related but distinct bugs reproduced live on 2026-05-15 against OpenClaw 2026.5.12 + Codex CLI/app-server 0.130.0. Filed in one issue per Krill's guidance on the OpenClaw Discord support thread (2026-05-15, "Codex app-server turn idle timed out").
accountId/agent — they don't reach the embedded run queue until the in-flight turn finishes.response.completed,custom_tool_call_input.deltaevents). Externally OpenClaw sees no events betweennotification:rawResponseItem/completedand the watchdog firing 30 min later. User-facing result is a partial assistant text followed byRequest timed out before a response was generated. This is the same symptom reported by me in Krill's Discord thread; the new evidence below is what we got from a clean fresh-app-server repro after wiping all per-agent codex-home/ dirs.Environment
2026.5.12(f066dd2)0.130.0(/root/.openclaw/npm/node_modules/@openai/codex/bin/codex.js, only install on PATH)22.22.1Linux 5.15.0-174-generic x64, Ubuntu 22.04.5 LTSws://127.0.0.1:18789arkadiy,nikita) + WhatsApp (1)openai-codex:pashaganson@gmail.com(Weekly 36% used, Short-term 1% used — not rate-limited)agentRuntime.id: "codex"on allopenai/*models. No top-levelembeddedHarness.plugins.entries.codex.enabled = true.channels.telegram.streaming = { mode: "partial", preview.toolProgress: true, progress.toolProgress: true, render: "rich" }openai/gpt-5.5,thinking: "high",fastMode: true,textVerbosity: "medium"Pre-repro install audit + cleanup we did
Per Krill's first-pass checklist on Discord we audited the install and found several stale-state things, which we cleaned up before this repro. Calling them out so they aren't re-suggested:
/usr/lib/node_modules/@openai/codex@0.130.0, andsnap codex@0.114.0in/snap/bin. Only the bundle was actually being launched by openclaw (via absolute path), but the others were latent risk. Removed snap and system-global; kept only the bundle.~/.openclaw/agents/<agent>/agent/codex-home/had noauth.json(vanechka,kirya,elena,nikita,dasha). The other 5 did.codex app-serverprocess was being reused across all 10 agents, withCODEX_HOMEpinned to~/.openclaw/agents/angela/agent/codex-homeregardless of which agent owned the turn. Per-agent isolation was effectively not isolating.codex-home-mainhad grown to ~1 GB (logs_2.sqlite,state_5.sqlite, shell snapshots). Other codex-home dirs up to ~200 MB each.Cleanup: stopped gateway, moved all 10 codex-home dirs to a timestamped backup (preserving
agent/auth-profiles.jsonOAuth profiles), removed/root/.codex(personal CLI home, unused by openclaw), ranopenclaw doctor --fix, restarted gateway. After this, codex app-server now spawns lazily per agent with the correct isolatedCODEX_HOME. Both bugs below reproduced anyway, so neither is caused by the stale state we cleaned up.(A) Telegram isolated-ingress HOL blocking
Confirmed by Krill from source: the spool drain loop is
acp.maxConcurrentSessions: 8is ACP-only;agents.defaults.maxConcurrentandmessages.queueonly apply after dequeue — none of them decouple this drain loop.Live repro at 2026-05-15T18:20:09Z
User
854067528sent 3 messages to agentarkadiyat ~18:20:10Z (light → heavy → light), one per channel: group topic, DM, DM topic.18:20:09.142Z-1003794846986:topic:918:20:40.091Z85406752818:23:21.479Z854067528:806808Total drain time for the 3-message burst: ~3 minutes, all because the 1st turn (a 14-char
тестmessage answered bygpt-5.5withthinking: high) ran for 3m30s and HOL-blocked the spool.Spool drain timeline captured live in the attached
spool-drain-monitor.log(background watcher snapshotting/root/.openclaw/telegram/ingress-spool-arkadiy/every 30s).Impact
For a multi-channel personal-assistant install (10 agents, dozens of chats per agent), one long turn anywhere will freeze ALL inbound reception for that agent across every channel — including unrelated quick messages, status checks, and other users. The agent looks dead to anyone trying to ping it while busy.
Ask
bot.handleUpdateagent-turn completion. The drain should enqueue updates into a downstream queue and return promptly, letting the spooled file be deleted; agent execution then runs from the downstream queue with its own concurrency knob.(B) Codex app-server stops emitting events mid-turn → 30 min terminal idle timeout
Live repro
agent:nikita:telegram:direct:854067528. User sent one analytics request at17:55:51Z: "Сделай аналитику по маминому питанию подробную" (35 chars). After ~70s of normal lifecycle (1 tool round-trip), codex app-server went silent toward OpenClaw for 30:26, until the OpenClaw watchdog fired.Runtime event timeline (from
/export-trajectory)Smoking-gun events
User-facing result in Telegram: that partial 285-token assistant message, followed by:
What Codex was actually doing internally
I dumped
~/.openclaw/agents/nikita/agent/codex-home/logs_2.sqlitewhile the wedge was in progress. Of 5110 internal log entries, the samethreadId019e2cc7-f734 ran 1000 internal events, the last being:at
1778867799(= 18:03:19Z). Activity inside Codex went silent at 18:03:19Z, but the JSON-RPC stdio stream toward openclaw stopped earlier than that (last runtime event reached openclaw at 17:56:12Z, ~7 minutes earlier). So two layers of silence:codex_app_server::outgoing_messagestream stopped emitting after the lastrawResponseItem/completedit had pushed to openclaw.S (sleeping)infutex_wait. No new internal log entries. No CPU activity.Note
lastNotificationItemType: "custom_tool_call_output"— the last codex notification successfully delivered was the result of a tool call. The model would normally then run another sampling round to consume that tool result and either issue more tool calls or finalize. Codex did the model round-trip internally (1000 more internal log entries) but never emitted any furtherrawResponseItem/started,item/completed,turn/completedetc. over JSON-RPC.Success-case trajectory diff (same harness, different turn)
A simultaneous arkadiy DM turn (
agent:arkadiy:telegram:direct:854067528:thread:854067528:807447,thread 019e2cad) completed cleanly at 18:23 withfinishReason=stop, output 1.2k tokens, 51% context fill, no truncation, full lifecycle events emitted:timedOut:true)timedOut:true)turn.terminal_idle_timeoutCurious side-detail in the FAIL transcript: tool.call/result events
seq=11..39(28 transcript events) all stamped within 60ms of the watchdog at 18:26:38.880-944Z, i.e. they appear flushed in a burst at session-end rather than streamed in real time. The 30-minute gap in the runtime stream betweenseq=5(last real-time event) andseq=6(watchdog) is unbroken.Ask
codex_app_server::outgoing_messagestops emitting JSON-RPC notifications after arawResponseItem/completedfor an item of typecustom_tool_call_output, while the underlying session loop continues to run sampling rounds and tool calls internally.finish_reasonfrom the OpenAI Responses API alongside the normalizedstop/error/aborted/toolUse, so we can rule outfinish_reason=length(truncation) cases for separate reports we're investigating.Artifact bundle
Files attached as a secret gist:
SUCCESS-manifest.json,SUCCESS-runtime-events.jsonl— clean turn trajectory (arkadiy, 18:20-18:23Z)FAIL-manifest.json,FAIL-runtime-events.jsonl— wedged turn trajectory (nikita, 17:56-18:26Z)FAIL-tool-chronology.jsonl— all tool.call/tool.result timestamps from the wedged turnopenclaw-log-slice-1755-1830.log— filtered openclaw gateway log for both repro windowsspool-drain-monitor.log— 30-second snapshots ofingress-spool-arkadiy/+ingress-spool-nikita/+ codex app-server PIDs across the repro windowDiagnostics zip from
openclaw gateway diagnostics export(33 KiB, payload-free, sanitized) available on request — happy to attach to the issue if useful.Reproducer (short form)
agentRuntime.id="codex"), ChatGPT Subscription OAuth, Telegram channel with multiple chats per agentnikita~/.openclaw/telegram/ingress-spool-<agent>/for the duration of the heavy turn (bug A), AND for some thinking-heavy tool-using turns the codex app-server runs the model+tools internally but stops emitting JSON-RPC notifications mid-turn, leading to the 30-minute idle timeout (bug B)I can also enable
diagnostics.flags=["*"]+logging.level=debugand re-capture with a full event trace if the runtime-event JSONL above is not enough.Originally discussed with Krill on the OpenClaw Discord thread linked at the top.