Skip to content

[Bug]: codex/gpt-5.4 via bundled Codex harness hangs indefinitely on initialize in v2026.4.10 (macOS, codex-cli 0.118.0 and 0.120.0) #64744

@jankutschera

Description

@jankutschera

Summary

In OpenClaw 2026.4.10 (PR #64298, released 2026-04-10), using a codex/gpt-* model (bundled Codex plugin) causes inference to hang indefinitely with no error, no log output from the codex subsystem, and no timeout. The child codex app-server subprocess is spawned successfully and sits idle. The bug occurs with both stdio and websocket transports and with both codex-cli 0.118.0 and 0.120.0, ruling out a transport-layer or codex-cli version issue.

The existing openai-codex/gpt-5.4 OAuth path continues to work fine, so this is specific to the new bundled Codex harness.

Environment

  • openclaw 2026.4.10 (44e5b62)
  • codex-cli 0.118.0 and 0.120.0 (both reproduce)
  • macOS 26.4 (arm64), Node v25.9.0
  • Codex CLI auth mode: chatgpt (ChatGPT Pro OAuth), verified with codex login status
  • CODEX_INTERNAL_ORIGINATOR_OVERRIDE is not set (checked env and launchctl getenv), so PR fix: [codex-harness] parse Desktop app-server user agents #64666 parser bug is ruled out

Reproduction

Wicki agent config:

{
  "id": "main",
  "model": {
    "primary": "codex/gpt-5.4",
    "fallbacks": ["openai-codex/gpt-5.4", "ollama/minimax-m2.7:cloud"]
  }
}

Plugin config:

{
  "plugins": {
    "entries": {
      "codex": {
        "enabled": true,
        "config": { "discovery": { "enabled": true } }
      }
    }
  }
}

Run:

openclaw capability model run --local --model codex/gpt-5.4 --prompt "say hi"

Expected: A reply within a few seconds.
Actual: The command hangs for 4+ minutes until killed. No output. No error. No log entry from the codex subsystem in /tmp/openclaw/openclaw-YYYY-MM-DD.log.

ps during hang shows the child process spawned but idle:

jankutschera  65333   0.0  0.1   codex app-server --listen stdio://
jankutschera  65332   0.0  0.1   node /opt/homebrew/bin/codex app-server --listen stdio://
jankutschera  65089   0.0  1.1   openclaw-infer

What works (to narrow the scope)

  1. Direct Codex CLI inference works (so auth + ChatGPT Pro OAuth quota are fine):

    echo "say hi" | codex exec --model gpt-5.4 --skip-git-repo-check
    # => responds in ~3s
  2. Manual JSON-RPC handshake to codex app-server --listen stdio:// works:

    (echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"clientInfo":{"name":"openclaw","title":"OpenClaw","version":"2026.4.10"},"capabilities":{"experimentalApi":true}}}'; sleep 2) | codex app-server --listen stdio://

    Returns a clean response:

    {"id":1,"result":{"userAgent":"openclaw/0.118.0 (Mac OS 26.4.0; arm64) kitty (openclaw; 2026.4.10)","codexHome":"/Users/jankutschera/.codex","platformFamily":"unix","platformOs":"macos"}}

    The returned userAgent parses fine through the current regex /^[^/\s]+\/(\d+\.\d+\.\d+(?:[-+][^\s()]*)?)/ at dist/harness-uOoDA-Wv.js:588, so the parser path does not appear to be the failure point in this case.

  3. openai-codex/gpt-5.4 (legacy OAuth path) works fine — the issue is isolated to the new bundled Codex harness.

  4. WebSocket transport hangs identically. I started an external codex app-server --listen ws://127.0.0.1:39175 (confirmed reachable via nc -zv), switched the plugin to appServer.transport: "websocket" with that URL, and reran. OpenClaw connected (I can see the websocket accepted and codex_core::codex logs a skill load on the server side), but no reply ever flows back to OpenClaw. This rules out transport-specific stdio issues.

Source analysis

File: dist/harness-uOoDA-Wv.js (bundled)

No timeout on initial handshake

CodexAppServerClient.initialize() at line 380-392:

async initialize() {
    if (this.initialized) return;
    assertSupportedCodexAppServerVersion(await this.request("initialize", {
        clientInfo: { name: "openclaw", title: "OpenClaw", version: VERSION },
        capabilities: { experimentalApi: true }
    }));
    this.notify("initialized");
    this.initialized = true;
}

CodexAppServerClient.request() at line 393-449:

request(method, params, options = {}) {
    // ...
    if (options.timeoutMs && Number.isFinite(options.timeoutMs) && options.timeoutMs > 0) {
        timeout = setTimeout(() => rejectPending(new Error(`${method} timed out`)), ...);
    }
    // ...
}

initialize() calls this.request("initialize", ...) without any options.timeoutMs, so the if branch at line 419 is false and no timeout is ever set for the initialize request. If the response doesn't arrive (or isn't parsed/matched back to the pending promise), OpenClaw waits forever.

Shared-client bootstrap timeout also 0ms

At line 651 (shared client initializer):

return await withTimeout(state.promise, options?.timeoutMs ?? 0, "codex app-server initialize timed out");

The caller at line 643 (getSharedCodexAppServerClient()) does not pass timeoutMs, so the wrapper effectively applies a 0ms timeout (no timeout). The "codex app-server initialize timed out" message is dead code for the current call path.

Transport is correct

createStdioTransport at line 236-246:

return spawn(options.command, options.args, {
    env: process.env,
    detached: process.platform !== "win32",
    stdio: ["pipe", "pipe", "pipe"]
});

writeMessage at line 476-477: this.child.stdin.write(\${JSON.stringify(message)}\n`)— newline-terminated, nojsonrpc: "2.0"field. Manual pipe test above confirmscodex app-server` accepts this format.

Ruled out

  • PR fix: [codex-harness] parse Desktop app-server user agents #64666 parser bugCODEX_INTERNAL_ORIGINATOR_OVERRIDE is not set in my environment; userAgent parses cleanly. Also, this PR's symptom is a thrown error, not a silent hang.
  • codex-cli version mismatch — reproduced on both 0.118.0 (min required per assertSupportedCodexAppServerVersion) and 0.120.0.
  • auth / quota — direct codex exec works, ChatGPT Pro OAuth is verified via codex login status.
  • Transport layer — both stdio and websocket transports hang with identical symptoms.
  • Malformed SKILL.md — I had one skill missing frontmatter (/Users/.../github/.agents/claw-score/SKILL.md); I fixed it, no change.

Root cause hypothesis

The hang is on the OpenClaw client side, not the transport and not codex app-server. Either:

  1. The request/response correlation (mapping the id on the incoming response back to the pending Map) fails silently, or
  2. The stdout read loop never sees the bytes (possibly a line-splitting / buffering issue in the stdio reader — not visible in the bundled code excerpts above), or
  3. There's an unhandled-promise-rejection path that swallows an exception during assertSupportedCodexAppServerVersion without logging.

Either way, the immediate visible bug is that CodexAppServerClient.initialize() has no timeout protection — which makes this latent failure mode manifest as an infinite hang instead of a clear error.

Suggested minimal fix (to surface the underlying issue)

At line 382, pass a reasonable timeout:

assertSupportedCodexAppServerVersion(await this.request("initialize", {
    clientInfo: { name: "openclaw", title: "OpenClaw", version: VERSION },
    capabilities: { experimentalApi: true }
}, { timeoutMs: 10_000 }));  // <— add this

And/or at line 651, pass the requestTimeoutMs from plugin config (60_000 default) instead of 0. That will at least convert the silent hang into an actionable initialize timed out error that can be diagnosed properly.

Related

Happy to gather more diagnostics (e.g., dtruss on the child stdio, NODE_DEBUG=child_process log, or a minimal standalone Node client that reproduces the spawn) if helpful — just let me know what would move this forward.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions