Skip to content

2026.5.12: Telegram isolated-ingress HOL blocking + Codex app-server stalls mid-turn after custom_tool_call_output → 30 min idle timeout #82274

@PashaGanson

Description

@PashaGanson

TL;DR

Two related but distinct bugs reproduced live on 2026-05-15 against OpenClaw 2026.5.12 + Codex CLI/app-server 0.130.0. Filed in one issue per Krill's guidance on the OpenClaw Discord support thread (2026-05-15, "Codex app-server turn idle timed out").

  • (A) Telegram isolated-ingress spool drain is serially HOL-blocked by the in-flight agent turn. A long thinking-heavy turn in one chat freezes the spool drain for all other chats of the same accountId/agent — they don't reach the embedded run queue until the in-flight turn finishes.
  • (B) Codex app-server stops emitting JSON-RPC notifications mid-turn, after a tool round-trip, causing a 30 min terminal idle timeout. Internally Codex keeps processing (1000+ log entries; multiple response.completed, custom_tool_call_input.delta events). Externally OpenClaw sees no events between notification:rawResponseItem/completed and the watchdog firing 30 min later. User-facing result is a partial assistant text followed by Request timed out before a response was generated. This is the same symptom reported by me in Krill's Discord thread; the new evidence below is what we got from a clean fresh-app-server repro after wiping all per-agent codex-home/ dirs.

Environment

OpenClaw 2026.5.12 (f066dd2)
Codex CLI / app-server 0.130.0 (/root/.openclaw/npm/node_modules/@openai/codex/bin/codex.js, only install on PATH)
Node 22.22.1
OS Linux 5.15.0-174-generic x64, Ubuntu 22.04.5 LTS
Gateway systemd user service, loopback ws://127.0.0.1:18789
Channels Telegram (10 isolated accounts incl. arkadiy, nikita) + WhatsApp (1)
Auth ChatGPT Subscription OAuth openai-codex:pashaganson@gmail.com (Weekly 36% used, Short-term 1% used — not rate-limited)
Runtime selector per-model agentRuntime.id: "codex" on all openai/* models. No top-level embeddedHarness. plugins.entries.codex.enabled = true.
Streaming channels.telegram.streaming = { mode: "partial", preview.toolProgress: true, progress.toolProgress: true, render: "rich" }
Model openai/gpt-5.5, thinking: "high", fastMode: true, textVerbosity: "medium"
Active Memory plugin enabled, default config

Pre-repro install audit + cleanup we did

Per Krill's first-pass checklist on Discord we audited the install and found several stale-state things, which we cleaned up before this repro. Calling them out so they aren't re-suggested:

  1. 3 different Codex CLI binaries on the host: openclaw bundle 0.130.0, system-global /usr/lib/node_modules/@openai/codex@0.130.0, and snap codex@0.114.0 in /snap/bin. Only the bundle was actually being launched by openclaw (via absolute path), but the others were latent risk. Removed snap and system-global; kept only the bundle.
  2. 5 of 10 per-agent ~/.openclaw/agents/<agent>/agent/codex-home/ had no auth.json (vanechka, kirya, elena, nikita, dasha). The other 5 did.
  3. One codex app-server process was being reused across all 10 agents, with CODEX_HOME pinned to ~/.openclaw/agents/angela/agent/codex-home regardless of which agent owned the turn. Per-agent isolation was effectively not isolating.
  4. codex-home-main had grown to ~1 GB (logs_2.sqlite, state_5.sqlite, shell snapshots). Other codex-home dirs up to ~200 MB each.

Cleanup: stopped gateway, moved all 10 codex-home dirs to a timestamped backup (preserving agent/auth-profiles.json OAuth profiles), removed /root/.codex (personal CLI home, unused by openclaw), ran openclaw doctor --fix, restarted gateway. After this, codex app-server now spawns lazily per agent with the correct isolated CODEX_HOME. Both bugs below reproduced anyway, so neither is caused by the stale state we cleaned up.


(A) Telegram isolated-ingress HOL blocking

Confirmed by Krill from source: the spool drain loop is

for (update of updates) {
  await bot.handleUpdate(update);  // includes the full agent turn
  delete spooledFile;
}

acp.maxConcurrentSessions: 8 is ACP-only; agents.defaults.maxConcurrent and messages.queue only apply after dequeue — none of them decouple this drain loop.

Live repro at 2026-05-15T18:20:09Z

User 854067528 sent 3 messages to agent arkadiy at ~18:20:10Z (light → heavy → light), one per channel: group topic, DM, DM topic.

Inbound (UTC) Channel Body size Spool wait
18:20:09.142Z group -1003794846986:topic:9 14 chars ~0s (1st arrival, spool empty)
18:20:40.091Z DM 854067528 147 chars ~30s (file 408 sat in spool 18:20:10 → 18:20:40)
18:23:21.479Z DM topic 854067528:806808 16 chars ~3m10s (file 409 sat in spool 18:20:11 → ~18:23:21)

Total drain time for the 3-message burst: ~3 minutes, all because the 1st turn (a 14-char тест message answered by gpt-5.5 with thinking: high) ran for 3m30s and HOL-blocked the spool.

Spool drain timeline captured live in the attached spool-drain-monitor.log (background watcher snapshotting /root/.openclaw/telegram/ingress-spool-arkadiy/ every 30s).

Impact

For a multi-channel personal-assistant install (10 agents, dozens of chats per agent), one long turn anywhere will freeze ALL inbound reception for that agent across every channel — including unrelated quick messages, status checks, and other users. The agent looks dead to anyone trying to ping it while busy.

Ask

  • Decouple the spool drain from bot.handleUpdate agent-turn completion. The drain should enqueue updates into a downstream queue and return promptly, letting the spooled file be deleted; agent execution then runs from the downstream queue with its own concurrency knob.
  • Or, document an explicit config switch that does this. Currently we don't know of one.

(B) Codex app-server stops emitting events mid-turn → 30 min terminal idle timeout

Live repro

agent:nikita:telegram:direct:854067528. User sent one analytics request at 17:55:51Z: "Сделай аналитику по маминому питанию подробную" (35 chars). After ~70s of normal lifecycle (1 tool round-trip), codex app-server went silent toward OpenClaw for 30:26, until the OpenClaw watchdog fired.

Runtime event timeline (from /export-trajectory)

seq=1 17:56:03.053Z session.started
seq=2 17:56:03.054Z context.compiled
seq=3 17:56:03.065Z prompt.submitted
seq=4 17:56:11.092Z tool.call           // last lifecycle event from codex
seq=5 17:56:12.730Z tool.result         //   "
//   ─── 30 min 26 s of silence on the JSON-RPC stdio stream ───
seq=6 18:26:38.855Z turn.terminal_idle_timeout
seq=7 18:26:38.871Z model.completed      (timedOut=true, aborted=true, promptError="codex app-server attempt timed out")
seq=8 18:26:38.874Z session.ended        (timedOut=true, promptError="codex app-server turn idle timed out waiting for turn/completed")

Smoking-gun events

// seq=6 — the watchdog event
{
  "type": "turn.terminal_idle_timeout",
  "ts":   "2026-05-15T18:26:38.855Z",
  "sessionKey": "agent:nikita:telegram:direct:854067528",
  "threadId":   "019e2cc7-f734-7302-b9a8-d8de60ab84f1",
  "turnId":     "019e2cc7-f773-7162-8d1d-59aa148293e5",
  "provider":   "openai",
  "modelId":    "gpt-5.5",
  "modelApi":   "openai-responses",
  "data": {
    "idleMs":   1800001,
    "timeoutMs": 1800000,
    "lastActivityReason":       "notification:rawResponseItem/completed",
    "lastNotificationMethod":   "rawResponseItem/completed",
    "lastNotificationItemType": "custom_tool_call_output"   //  <-- key
  }
}
// seq=7 — model.completed forced by openclaw watchdog
{
  "type": "model.completed",
  "ts":   "2026-05-15T18:26:38.871Z",
  "data": {
    "threadId":    "019e2cc7-f734-7302-b9a8-d8de60ab84f1",
    "turnId":      "019e2cc7-f773-7162-8d1d-59aa148293e5",
    "timedOut":    true,
    "aborted":     true,
    "yieldDetected": false,
    "promptError": "codex app-server attempt timed out",
    "usage": { "input": 8099, "output": 285, "cacheRead": 36864, "total": 45248 },
    "assistantTexts": [
      "Данные есть с 28 апреля по 15 мая, плюс два веса: 86.5 → 84.5 кг. Важный нюанс: часть дней явно неполные, поэтому отделю «по записанному» от выводов по реальному рациону."
    ]
  }
}

User-facing result in Telegram: that partial 285-token assistant message, followed by:

Request timed out before a response was generated. Please try again, or increase agents.defaults.timeoutSeconds in your config.

What Codex was actually doing internally

I dumped ~/.openclaw/agents/nikita/agent/codex-home/logs_2.sqlite while the wedge was in progress. Of 5110 internal log entries, the same threadId 019e2cc7-f734 ran 1000 internal events, the last being:

TRACE codex_core::session::turn  post sampling token usage turn_id=019e2cc7-f773-...

at 1778867799 (= 18:03:19Z). Activity inside Codex went silent at 18:03:19Z, but the JSON-RPC stdio stream toward openclaw stopped earlier than that (last runtime event reached openclaw at 17:56:12Z, ~7 minutes earlier). So two layers of silence:

  1. 17:56:12Z → ~18:03:19Z (~7 min): codex internally still active (model sampling, custom_tool_call_input.delta x355, response.completed x6, etc.), but its codex_app_server::outgoing_message stream stopped emitting after the last rawResponseItem/completed it had pushed to openclaw.
  2. ~18:03:19Z onward: codex itself fully idle. Process state S (sleeping) in futex_wait. No new internal log entries. No CPU activity.

Note lastNotificationItemType: "custom_tool_call_output" — the last codex notification successfully delivered was the result of a tool call. The model would normally then run another sampling round to consume that tool result and either issue more tool calls or finalize. Codex did the model round-trip internally (1000 more internal log entries) but never emitted any further rawResponseItem/started, item/completed, turn/completed etc. over JSON-RPC.

Success-case trajectory diff (same harness, different turn)

A simultaneous arkadiy DM turn (agent:arkadiy:telegram:direct:854067528:thread:854067528:807447, thread 019e2cad) completed cleanly at 18:23 with finishReason=stop, output 1.2k tokens, 51% context fill, no truncation, full lifecycle events emitted:

Event type SUCCESS (arkadiy) FAIL (nikita)
user.message 2 1
assistant.message 38 13
tool.call / tool.result 34/34 11/11
prompt.submitted 2 1
context.compiled 2 1
model.completed 2 1 (forced, with timedOut:true)
session.ended 2 1 (with timedOut:true)
turn.terminal_idle_timeout 0 1

Curious side-detail in the FAIL transcript: tool.call/result events seq=11..39 (28 transcript events) all stamped within 60ms of the watchdog at 18:26:38.880-944Z, i.e. they appear flushed in a burst at session-end rather than streamed in real time. The 30-minute gap in the runtime stream between seq=5 (last real-time event) and seq=6 (watchdog) is unbroken.

Ask

  • Investigate why codex_app_server::outgoing_message stops emitting JSON-RPC notifications after a rawResponseItem/completed for an item of type custom_tool_call_output, while the underlying session loop continues to run sampling rounds and tool calls internally.
  • Or: surface a watchdog inside codex itself that detects "internal sampling rounds happening but outgoing stream is silent for N seconds" and either heals or fails the turn fast, instead of letting the openclaw-side 30-min idle watchdog be the only safety net.
  • Bonus: would also be useful to log the raw finish_reason from the OpenAI Responses API alongside the normalized stop/error/aborted/toolUse, so we can rule out finish_reason=length (truncation) cases for separate reports we're investigating.

Artifact bundle

Files attached as a secret gist:

  • SUCCESS-manifest.json, SUCCESS-runtime-events.jsonl — clean turn trajectory (arkadiy, 18:20-18:23Z)
  • FAIL-manifest.json, FAIL-runtime-events.jsonl — wedged turn trajectory (nikita, 17:56-18:26Z)
  • FAIL-tool-chronology.jsonl — all tool.call/tool.result timestamps from the wedged turn
  • openclaw-log-slice-1755-1830.log — filtered openclaw gateway log for both repro windows
  • spool-drain-monitor.log — 30-second snapshots of ingress-spool-arkadiy/ + ingress-spool-nikita/ + codex app-server PIDs across the repro window

Diagnostics zip from openclaw gateway diagnostics export (33 KiB, payload-free, sanitized) available on request — happy to attach to the issue if useful.


Reproducer (short form)

  1. Fresh OpenClaw 2026.5.12 install with native Codex harness (agentRuntime.id="codex"), ChatGPT Subscription OAuth, Telegram channel with multiple chats per agent
  2. Pick any agent, e.g. nikita
  3. From a Telegram client send a thinking-heavy multi-tool prompt (e.g. an analytics request that requires Memory Search + several Bash tool calls)
  4. Within ~10 sec from another chat tied to the same agent, send 2 more simple messages
  5. Expected: agent replies to all 3 within reasonable wall-clock time, terminal events emitted normally
  6. Actual: light messages sit in ~/.openclaw/telegram/ingress-spool-<agent>/ for the duration of the heavy turn (bug A), AND for some thinking-heavy tool-using turns the codex app-server runs the model+tools internally but stops emitting JSON-RPC notifications mid-turn, leading to the 30-minute idle timeout (bug B)

I can also enable diagnostics.flags=["*"] + logging.level=debug and re-capture with a full event trace if the runtime-event JSONL above is not enough.

Originally discussed with Krill on the OpenClaw Discord thread linked at the top.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions