2026.5.12: Telegram isolated-ingress HOL blocking + Codex app-server stalls mid-turn after custom_tool_call_output → 30 min idle timeout

## TL;DR

Two related but distinct bugs reproduced live on 2026-05-15 against OpenClaw 2026.5.12 + Codex CLI/app-server 0.130.0. Filed in one issue per Krill's guidance on the OpenClaw Discord support thread (2026-05-15, "Codex app-server turn idle timed out").

- **(A) Telegram isolated-ingress spool drain is serially HOL-blocked by the in-flight agent turn.** A long thinking-heavy turn in one chat freezes the spool drain for all other chats of the same `accountId`/agent — they don't reach the embedded run queue until the in-flight turn finishes.
- **(B) Codex app-server stops emitting JSON-RPC notifications mid-turn, after a tool round-trip, causing a 30 min terminal idle timeout.** Internally Codex keeps processing (1000+ log entries; multiple `response.completed`, `custom_tool_call_input.delta` events). Externally OpenClaw sees no events between `notification:rawResponseItem/completed` and the watchdog firing 30 min later. User-facing result is a partial assistant text followed by `Request timed out before a response was generated`. This is the same symptom reported by me in Krill's Discord thread; the new evidence below is what we got from a clean fresh-app-server repro after wiping all per-agent codex-home/ dirs.

---

## Environment

| | |
|---|---|
| OpenClaw | `2026.5.12` (f066dd2) |
| Codex CLI / app-server | `0.130.0` (`/root/.openclaw/npm/node_modules/@openai/codex/bin/codex.js`, only install on PATH) |
| Node | `22.22.1` |
| OS | `Linux 5.15.0-174-generic x64`, Ubuntu 22.04.5 LTS |
| Gateway | systemd user service, loopback `ws://127.0.0.1:18789` |
| Channels | Telegram (10 isolated accounts incl. `arkadiy`, `nikita`) + WhatsApp (1) |
| Auth | ChatGPT Subscription OAuth `openai-codex:pashaganson@gmail.com` (Weekly 36% used, Short-term 1% used — not rate-limited) |
| Runtime selector | per-model `agentRuntime.id: "codex"` on all `openai/*` models. No top-level `embeddedHarness`. `plugins.entries.codex.enabled = true`. |
| Streaming | `channels.telegram.streaming = { mode: "partial", preview.toolProgress: true, progress.toolProgress: true, render: "rich" }` |
| Model | `openai/gpt-5.5`, `thinking: "high"`, `fastMode: true`, `textVerbosity: "medium"` |
| Active Memory plugin | enabled, default config |

---

## Pre-repro install audit + cleanup we did

Per Krill's first-pass checklist on Discord we audited the install and found several stale-state things, which we cleaned up before this repro. Calling them out so they aren't re-suggested:

1. **3 different Codex CLI binaries on the host**: openclaw bundle 0.130.0, system-global `/usr/lib/node_modules/@openai/codex@0.130.0`, and `snap codex@0.114.0` in `/snap/bin`. Only the bundle was actually being launched by openclaw (via absolute path), but the others were latent risk. Removed snap and system-global; kept only the bundle.
2. **5 of 10 per-agent `~/.openclaw/agents/<agent>/agent/codex-home/` had no `auth.json`** (`vanechka`, `kirya`, `elena`, `nikita`, `dasha`). The other 5 did.
3. **One `codex app-server` process was being reused across all 10 agents**, with `CODEX_HOME` pinned to `~/.openclaw/agents/angela/agent/codex-home` regardless of which agent owned the turn. Per-agent isolation was effectively not isolating.
4. `codex-home-main` had grown to ~1 GB (`logs_2.sqlite`, `state_5.sqlite`, shell snapshots). Other codex-home dirs up to ~200 MB each.

Cleanup: stopped gateway, moved all 10 codex-home dirs to a timestamped backup (preserving `agent/auth-profiles.json` OAuth profiles), removed `/root/.codex` (personal CLI home, unused by openclaw), ran `openclaw doctor --fix`, restarted gateway. After this, codex app-server now spawns lazily per agent with the correct isolated `CODEX_HOME`. **Both bugs below reproduced anyway**, so neither is caused by the stale state we cleaned up.

---

## (A) Telegram isolated-ingress HOL blocking

**Confirmed by Krill from source**: the spool drain loop is

```js
for (update of updates) {
  await bot.handleUpdate(update);  // includes the full agent turn
  delete spooledFile;
}
```

`acp.maxConcurrentSessions: 8` is ACP-only; `agents.defaults.maxConcurrent` and `messages.queue` only apply after dequeue — none of them decouple this drain loop.

### Live repro at 2026-05-15T18:20:09Z

User `854067528` sent 3 messages to agent `arkadiy` at ~18:20:10Z (light → heavy → light), one per channel: group topic, DM, DM topic.

| Inbound (UTC) | Channel | Body size | Spool wait |
|---|---|---|---|
| `18:20:09.142Z` | group `-1003794846986:topic:9` | 14 chars | ~0s (1st arrival, spool empty) |
| `18:20:40.091Z` | DM `854067528` | 147 chars | **~30s** (file 408 sat in spool 18:20:10 → 18:20:40) |
| `18:23:21.479Z` | DM topic `854067528:806808` | 16 chars | **~3m10s** (file 409 sat in spool 18:20:11 → ~18:23:21) |

Total drain time for the 3-message burst: ~3 minutes, all because the 1st turn (a 14-char `тест` message answered by `gpt-5.5` with `thinking: high`) ran for 3m30s and HOL-blocked the spool.

Spool drain timeline captured live in the attached `spool-drain-monitor.log` (background watcher snapshotting `/root/.openclaw/telegram/ingress-spool-arkadiy/` every 30s).

### Impact

For a multi-channel personal-assistant install (10 agents, dozens of chats per agent), one long turn anywhere will freeze ALL inbound reception for that agent across every channel — including unrelated quick messages, status checks, and other users. The agent looks dead to anyone trying to ping it while busy.

### Ask

- Decouple the spool drain from `bot.handleUpdate` agent-turn completion. The drain should enqueue updates into a downstream queue and return promptly, letting the spooled file be deleted; agent execution then runs from the downstream queue with its own concurrency knob.
- Or, document an explicit config switch that does this. Currently we don't know of one.

---

## (B) Codex app-server stops emitting events mid-turn → 30 min terminal idle timeout

### Live repro

`agent:nikita:telegram:direct:854067528`. User sent one analytics request at `17:55:51Z`: *"Сделай аналитику по маминому питанию подробную"* (35 chars). After ~70s of normal lifecycle (1 tool round-trip), codex app-server went silent toward OpenClaw for **30:26**, until the OpenClaw watchdog fired.

### Runtime event timeline (from `/export-trajectory`)

```jsonc
seq=1 17:56:03.053Z session.started
seq=2 17:56:03.054Z context.compiled
seq=3 17:56:03.065Z prompt.submitted
seq=4 17:56:11.092Z tool.call           // last lifecycle event from codex
seq=5 17:56:12.730Z tool.result         //   "
//   ─── 30 min 26 s of silence on the JSON-RPC stdio stream ───
seq=6 18:26:38.855Z turn.terminal_idle_timeout
seq=7 18:26:38.871Z model.completed      (timedOut=true, aborted=true, promptError="codex app-server attempt timed out")
seq=8 18:26:38.874Z session.ended        (timedOut=true, promptError="codex app-server turn idle timed out waiting for turn/completed")
```

### Smoking-gun events

```jsonc
// seq=6 — the watchdog event
{
  "type": "turn.terminal_idle_timeout",
  "ts":   "2026-05-15T18:26:38.855Z",
  "sessionKey": "agent:nikita:telegram:direct:854067528",
  "threadId":   "019e2cc7-f734-7302-b9a8-d8de60ab84f1",
  "turnId":     "019e2cc7-f773-7162-8d1d-59aa148293e5",
  "provider":   "openai",
  "modelId":    "gpt-5.5",
  "modelApi":   "openai-responses",
  "data": {
    "idleMs":   1800001,
    "timeoutMs": 1800000,
    "lastActivityReason":       "notification:rawResponseItem/completed",
    "lastNotificationMethod":   "rawResponseItem/completed",
    "lastNotificationItemType": "custom_tool_call_output"   //  <-- key
  }
}
```

```jsonc
// seq=7 — model.completed forced by openclaw watchdog
{
  "type": "model.completed",
  "ts":   "2026-05-15T18:26:38.871Z",
  "data": {
    "threadId":    "019e2cc7-f734-7302-b9a8-d8de60ab84f1",
    "turnId":      "019e2cc7-f773-7162-8d1d-59aa148293e5",
    "timedOut":    true,
    "aborted":     true,
    "yieldDetected": false,
    "promptError": "codex app-server attempt timed out",
    "usage": { "input": 8099, "output": 285, "cacheRead": 36864, "total": 45248 },
    "assistantTexts": [
      "Данные есть с 28 апреля по 15 мая, плюс два веса: 86.5 → 84.5 кг. Важный нюанс: часть дней явно неполные, поэтому отделю «по записанному» от выводов по реальному рациону."
    ]
  }
}
```

User-facing result in Telegram: that partial 285-token assistant message, followed by:

> `Request timed out before a response was generated. Please try again, or increase agents.defaults.timeoutSeconds in your config.`

### What Codex was actually doing internally

I dumped `~/.openclaw/agents/nikita/agent/codex-home/logs_2.sqlite` while the wedge was in progress. Of 5110 internal log entries, **the same `threadId` 019e2cc7-f734 ran 1000 internal events**, the last being:

```
TRACE codex_core::session::turn  post sampling token usage turn_id=019e2cc7-f773-...
```

at `1778867799` (= 18:03:19Z). Activity inside Codex went silent at 18:03:19Z, but the JSON-RPC stdio stream toward openclaw stopped earlier than that (last runtime event reached openclaw at 17:56:12Z, ~7 minutes earlier). So **two layers** of silence:

1. **17:56:12Z → ~18:03:19Z (~7 min)**: codex *internally* still active (model sampling, custom_tool_call_input.delta x355, response.completed x6, etc.), but its `codex_app_server::outgoing_message` stream stopped emitting after the last `rawResponseItem/completed` it had pushed to openclaw.
2. **~18:03:19Z onward**: codex itself fully idle. Process state `S (sleeping)` in `futex_wait`. No new internal log entries. No CPU activity.

Note `lastNotificationItemType: "custom_tool_call_output"` — the last codex notification successfully delivered was the **result** of a tool call. The model would normally then run another sampling round to consume that tool result and either issue more tool calls or finalize. Codex did the model round-trip internally (1000 more internal log entries) but never emitted any further `rawResponseItem/started`, `item/completed`, `turn/completed` etc. over JSON-RPC.

### Success-case trajectory diff (same harness, different turn)

A simultaneous arkadiy DM turn (`agent:arkadiy:telegram:direct:854067528:thread:854067528:807447`, `thread 019e2cad`) completed cleanly at 18:23 with `finishReason=stop`, output 1.2k tokens, 51% context fill, no truncation, full lifecycle events emitted:

| Event type | SUCCESS (arkadiy) | **FAIL (nikita)** |
|---|---|---|
| user.message | 2 | 1 |
| assistant.message | 38 | 13 |
| tool.call / tool.result | 34/34 | 11/11 |
| prompt.submitted | 2 | 1 |
| context.compiled | 2 | 1 |
| model.completed | 2 | 1 (forced, with `timedOut:true`) |
| session.ended | 2 | 1 (with `timedOut:true`) |
| **`turn.terminal_idle_timeout`** | 0 | **1** |

Curious side-detail in the FAIL transcript: tool.call/result events `seq=11..39` (28 transcript events) all stamped within **60ms of the watchdog at 18:26:38.880-944Z**, i.e. they appear flushed in a burst at session-end rather than streamed in real time. The 30-minute gap in the runtime stream between `seq=5` (last real-time event) and `seq=6` (watchdog) is unbroken.

### Ask

- Investigate why `codex_app_server::outgoing_message` stops emitting JSON-RPC notifications after a `rawResponseItem/completed` for an item of type `custom_tool_call_output`, while the underlying session loop continues to run sampling rounds and tool calls internally.
- Or: surface a watchdog inside codex itself that detects "internal sampling rounds happening but outgoing stream is silent for N seconds" and either heals or fails the turn fast, instead of letting the openclaw-side 30-min idle watchdog be the only safety net.
- Bonus: would also be useful to log the raw `finish_reason` from the OpenAI Responses API alongside the normalized `stop/error/aborted/toolUse`, so we can rule out `finish_reason=length` (truncation) cases for separate reports we're investigating.

---

## Artifact bundle

Files attached as a [secret gist](https://gist.github.com/PashaGanson/55c710b144841ad6de3d1054a1cb881e):

- `SUCCESS-manifest.json`, `SUCCESS-runtime-events.jsonl` — clean turn trajectory (arkadiy, 18:20-18:23Z)
- `FAIL-manifest.json`, `FAIL-runtime-events.jsonl` — wedged turn trajectory (nikita, 17:56-18:26Z)
- `FAIL-tool-chronology.jsonl` — all tool.call/tool.result timestamps from the wedged turn
- `openclaw-log-slice-1755-1830.log` — filtered openclaw gateway log for both repro windows
- `spool-drain-monitor.log` — 30-second snapshots of `ingress-spool-arkadiy/` + `ingress-spool-nikita/` + codex app-server PIDs across the repro window

Diagnostics zip from `openclaw gateway diagnostics export` (33 KiB, payload-free, sanitized) available on request — happy to attach to the issue if useful.

---

## Reproducer (short form)

1. Fresh OpenClaw 2026.5.12 install with native Codex harness (`agentRuntime.id="codex"`), ChatGPT Subscription OAuth, Telegram channel with multiple chats per agent
2. Pick any agent, e.g. `nikita`
3. From a Telegram client send a thinking-heavy multi-tool prompt (e.g. an analytics request that requires Memory Search + several Bash tool calls)
4. Within ~10 sec from another chat tied to the same agent, send 2 more simple messages
5. Expected: agent replies to all 3 within reasonable wall-clock time, terminal events emitted normally
6. Actual: light messages sit in `~/.openclaw/telegram/ingress-spool-<agent>/` for the duration of the heavy turn (bug A), AND for some thinking-heavy tool-using turns the codex app-server runs the model+tools internally but stops emitting JSON-RPC notifications mid-turn, leading to the 30-minute idle timeout (bug B)

I can also enable `diagnostics.flags=["*"]` + `logging.level=debug` and re-capture with a full event trace if the runtime-event JSONL above is not enough.

Originally discussed with Krill on the OpenClaw Discord thread linked at the top.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

2026.5.12: Telegram isolated-ingress HOL blocking + Codex app-server stalls mid-turn after custom_tool_call_output → 30 min idle timeout #82274

TL;DR

Environment

Pre-repro install audit + cleanup we did

(A) Telegram isolated-ingress HOL blocking

Live repro at 2026-05-15T18:20:09Z

Impact

Ask

(B) Codex app-server stops emitting events mid-turn → 30 min terminal idle timeout

Live repro

Runtime event timeline (from `/export-trajectory`)

Smoking-gun events

What Codex was actually doing internally

Success-case trajectory diff (same harness, different turn)

Ask

Artifact bundle

Reproducer (short form)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development


OpenClaw	`2026.5.12` (`f066dd2`)
Codex CLI / app-server	`0.130.0` (`/root/.openclaw/npm/node_modules/@openai/codex/bin/codex.js`, only install on PATH)
Node	`22.22.1`
OS	`Linux 5.15.0-174-generic x64`, Ubuntu 22.04.5 LTS
Gateway	systemd user service, loopback `ws://127.0.0.1:18789`
Channels	Telegram (10 isolated accounts incl. `arkadiy`, `nikita`) + WhatsApp (1)
Auth	ChatGPT Subscription OAuth `openai-codex:pashaganson@gmail.com` (Weekly 36% used, Short-term 1% used — not rate-limited)
Runtime selector	per-model `agentRuntime.id: "codex"` on all `openai/*` models. No top-level `embeddedHarness`. `plugins.entries.codex.enabled = true`.
Streaming	`channels.telegram.streaming = { mode: "partial", preview.toolProgress: true, progress.toolProgress: true, render: "rich" }`
Model	`openai/gpt-5.5`, `thinking: "high"`, `fastMode: true`, `textVerbosity: "medium"`
Active Memory plugin	enabled, default config

Inbound (UTC)	Channel	Body size	Spool wait
`18:20:09.142Z`	group `-1003794846986:topic:9`	14 chars	~0s (1st arrival, spool empty)
`18:20:40.091Z`	DM `854067528`	147 chars	~30s (file 408 sat in spool 18:20:10 → 18:20:40)
`18:23:21.479Z`	DM topic `854067528:806808`	16 chars	~3m10s (file 409 sat in spool 18:20:11 → ~18:23:21)

Event type	SUCCESS (arkadiy)	FAIL (nikita)
user.message	2	1
assistant.message	38	13
tool.call / tool.result	34/34	11/11
prompt.submitted	2	1
context.compiled	2	1
model.completed	2	1 (forced, with `timedOut:true`)
session.ended	2	1 (with `timedOut:true`)
`turn.terminal_idle_timeout`	0	1

Uh oh!

2026.5.12: Telegram isolated-ingress HOL blocking + Codex app-server stalls mid-turn after custom_tool_call_output → 30 min idle timeout #82274

Description

TL;DR

Environment

Pre-repro install audit + cleanup we did

(A) Telegram isolated-ingress HOL blocking

Live repro at 2026-05-15T18:20:09Z

Impact

Ask

(B) Codex app-server stops emitting events mid-turn → 30 min terminal idle timeout

Live repro

Runtime event timeline (from /export-trajectory)

Smoking-gun events

What Codex was actually doing internally

Success-case trajectory diff (same harness, different turn)

Ask

Artifact bundle

Reproducer (short form)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Runtime event timeline (from `/export-trajectory`)