Skip to content

openai-codex can hang on Working... with zero-usage aborted turns #4945

@liushuaiiu

Description

@liushuaiiu

Summary

openai-codex / gpt-5.5 sometimes leaves the interactive TUI stuck on Working... with no streamed text, no tool call, and no visible error. The only way to recover is pressing Escape, which records an aborted assistant turn.

This has happened repeatedly over the last couple of days in normal interactive use.

Environment

  • pi: 0.75.5
  • Node: v22.22.1
  • Provider/model: openai-codex / gpt-5.5
  • Thinking level: xhigh
  • No explicit transport setting in user settings
  • Default tools/extensions enabled

Observed behavior

When the issue occurs:

  • TUI keeps showing Working... for minutes.
  • No assistant text is streamed.
  • No tool call is emitted.
  • Pressing Escape aborts the turn.
  • The saved session entry for that assistant turn has:
{
  "role": "assistant",
  "stopReason": "aborted",
  "content": [],
  "usage": {
    "input": 0,
    "output": 0,
    "cacheRead": 0,
    "cacheWrite": 0,
    "totalTokens": 0
  }
}

I saw this pattern multiple times, including cases where the previous turn had completed normally and the next user message then hung before any provider usage was recorded.

This looks different from a long reasoning turn: there is no usage, no partial reasoning/text, and no tool call.

Expected behavior

If the provider/transport stalls before the first event, pi should eventually surface a timeout/transport error or retry in a bounded way, instead of keeping the TUI in Working... indefinitely until the user manually aborts.

Suspected area

From the installed 0.75.5 package:

  • SettingsManager.getTransport() returns "auto" when no setting is present.
  • The docs/settings table says the default is "sse", so there may also be a docs/runtime default mismatch.
  • For openai-codex-responses, transport=auto attempts WebSocket first.
  • retry.provider.timeoutMs appears to be passed into streamSimple(), but the openai-codex-responses implementation does not seem to apply it to the Codex fetch/WebSocket wait path in the same way as SDK-based providers.
  • The WebSocket event loop can wait for the first message/completion without an obvious idle timer.

So the likely failure mode is: Codex WebSocket/transport waits before the first event; no assistant event is produced; interactive UI keeps showing Working...; Escape finally records an aborted zero-usage turn.

Suggested fix / mitigation

Possible fixes:

  1. Add a hard idle timeout for openai-codex-responses WebSocket and SSE stream waits, especially before the first event.
  2. Ensure retry.provider.timeoutMs or httpIdleTimeoutMs applies consistently to this provider path.
  3. If auto is intended as the runtime default, update docs; if sse is intended, adjust SettingsManager.getTransport().
  4. Optionally show a clearer status/error when a provider turn has produced zero events for a long time.

Local workaround I am considering, but have not applied yet:

{
  "transport": "sse",
  "httpIdleTimeoutMs": 120000
}

I can provide more sanitized session metadata if helpful, but I avoided attaching raw session logs because they contain private conversation/tool context.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions