Skip to content

bug: WebSocket transport disconnects are handled as fatal termination with multiple abrupt disconnect paths #13949

@William17738

Description

@William17738

Summary

WebSocket disconnections (from network issues, slow clients, or server-side close) are handled as fatal session termination rather than recoverable events. Multiple code paths convert transient transport issues into permanent session death with no reconnection or graceful degradation.

This is not a guaranteed process panic — I did not find one in static review — but the combination of these paths explains reported behavior where transient disconnects behave like fatal session termination.

Finding 1: Slow or congested clients are proactively disconnected

File: codex-rs/app-server/src/transport.rs

  • Line 46: CHANNEL_CAPACITY = 128 (bounded outbound queue).
  • Line 592: writer.try_send(message) for disconnectable connections.
  • Line 596: TrySendError::Full(_) → immediate disconnect.
  • Lines 445-450: inbound Ping handler uses try_send for Pong; queue-full → connection closed.

A slow client that falls 128 messages behind is disconnected without warning or backpressure signal.

Finding 2: WebSocket task teardown is hard-abort

File: codex-rs/app-server/src/transport.rs, lines 363-370

tokio::select! {
    _ = &mut outbound_task => {
        disconnect_token.cancel();
        inbound_task.abort();  // hard abort
    }
    _ = &mut inbound_task => {
        disconnect_token.cancel();
        outbound_task.abort();  // hard abort
    }
}

When either loop exits (including on transient errors), the peer task is force-aborted. This risks dropping tail messages in the outbound queue and bypasses WebSocket close frame semantics.

Finding 3: Realtime WebSocket has zero reconnection logic

File: codex-rs/core/src/realtime_conversation.rs, lines 317-388

spawn_realtime_input_task breaks on any send failure (line 330) or error/close event (lines 351, 359). It sends a RealtimeConversationClosed event and exits permanently. There is no retry or reconnection loop.

File: codex-rs/codex-api/src/endpoint/realtime_websocket/methods.rs

  • Line 57: spawns the pump task.
  • Line 74: breaks pump loop on send error.
  • Line 154: turns failed command delivery into WsError::ConnectionClosed.

Any network hiccup permanently kills the realtime conversation session.

Finding 4: WsStream::Drop aborts pump without sending Close frame

File: codex-rs/codex-api/src/endpoint/responses_websocket.rs, lines 157-161

impl Drop for WsStream {
    fn drop(&mut self) {
        self.pump_task.abort();  // no Close frame sent
    }
}

The remote peer sees a TCP reset rather than a clean WebSocket close. This can cause the peer to log errors or treat the connection as abnormally terminated.

Finding 5: TOCTOU race in send_json

File: codex-rs/codex-api/src/endpoint/realtime_websocket/methods.rs, lines 330-346

async fn send_json(&self, message: RealtimeOutboundMessage) -> Result<(), ApiError> {
    if self.is_closed.load(Ordering::SeqCst) {   // check
        return Err(ApiError::Stream(...));
    }
    self.stream.send(Message::Text(...)).await     // send — connection may close between check and here
        .map_err(...)?;
}

The is_closed check and the actual send() are not atomic. The connection can close between these operations. The error is properly mapped (not a panic), but callers relying on the is_closed guard may not expect send() to still fail.

Context: the responses WebSocket client does have reconnection

For comparison, the responses WebSocket client (codex-rs/core/src/client.rs:730-771) does check conn.is_closed() before each turn and reconnects if needed, with fallback to HTTP after exhausting retries (try_switch_fallback_transport(), line 1062-1088). The realtime path lacks equivalent recovery.

Suggested improvements

  1. Add backpressure signaling before disconnecting slow clients (e.g., skip non-critical messages instead of disconnecting).
  2. Use graceful shutdown with close frames instead of task.abort().
  3. Add a reconnection loop in the realtime conversation path, similar to the responses WebSocket client.
  4. Send a WebSocket Close frame in WsStream::Drop before aborting the pump task.
  5. Consider merging the is_closed check and send() into a single atomic operation.

I have a fix ready and can submit a PR if invited.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingconnectivityIssues involving networking or endpoint connectivity problems (disconnections)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions