Summary
WebSocket disconnections (from network issues, slow clients, or server-side close) are handled as fatal session termination rather than recoverable events. Multiple code paths convert transient transport issues into permanent session death with no reconnection or graceful degradation.
This is not a guaranteed process panic — I did not find one in static review — but the combination of these paths explains reported behavior where transient disconnects behave like fatal session termination.
Finding 1: Slow or congested clients are proactively disconnected
File: codex-rs/app-server/src/transport.rs
- Line 46:
CHANNEL_CAPACITY = 128 (bounded outbound queue).
- Line 592:
writer.try_send(message) for disconnectable connections.
- Line 596:
TrySendError::Full(_) → immediate disconnect.
- Lines 445-450: inbound
Ping handler uses try_send for Pong; queue-full → connection closed.
A slow client that falls 128 messages behind is disconnected without warning or backpressure signal.
Finding 2: WebSocket task teardown is hard-abort
File: codex-rs/app-server/src/transport.rs, lines 363-370
tokio::select! {
_ = &mut outbound_task => {
disconnect_token.cancel();
inbound_task.abort(); // hard abort
}
_ = &mut inbound_task => {
disconnect_token.cancel();
outbound_task.abort(); // hard abort
}
}
When either loop exits (including on transient errors), the peer task is force-aborted. This risks dropping tail messages in the outbound queue and bypasses WebSocket close frame semantics.
Finding 3: Realtime WebSocket has zero reconnection logic
File: codex-rs/core/src/realtime_conversation.rs, lines 317-388
spawn_realtime_input_task breaks on any send failure (line 330) or error/close event (lines 351, 359). It sends a RealtimeConversationClosed event and exits permanently. There is no retry or reconnection loop.
File: codex-rs/codex-api/src/endpoint/realtime_websocket/methods.rs
- Line 57: spawns the pump task.
- Line 74: breaks pump loop on send error.
- Line 154: turns failed command delivery into
WsError::ConnectionClosed.
Any network hiccup permanently kills the realtime conversation session.
Finding 4: WsStream::Drop aborts pump without sending Close frame
File: codex-rs/codex-api/src/endpoint/responses_websocket.rs, lines 157-161
impl Drop for WsStream {
fn drop(&mut self) {
self.pump_task.abort(); // no Close frame sent
}
}
The remote peer sees a TCP reset rather than a clean WebSocket close. This can cause the peer to log errors or treat the connection as abnormally terminated.
Finding 5: TOCTOU race in send_json
File: codex-rs/codex-api/src/endpoint/realtime_websocket/methods.rs, lines 330-346
async fn send_json(&self, message: RealtimeOutboundMessage) -> Result<(), ApiError> {
if self.is_closed.load(Ordering::SeqCst) { // check
return Err(ApiError::Stream(...));
}
self.stream.send(Message::Text(...)).await // send — connection may close between check and here
.map_err(...)?;
}
The is_closed check and the actual send() are not atomic. The connection can close between these operations. The error is properly mapped (not a panic), but callers relying on the is_closed guard may not expect send() to still fail.
Context: the responses WebSocket client does have reconnection
For comparison, the responses WebSocket client (codex-rs/core/src/client.rs:730-771) does check conn.is_closed() before each turn and reconnects if needed, with fallback to HTTP after exhausting retries (try_switch_fallback_transport(), line 1062-1088). The realtime path lacks equivalent recovery.
Suggested improvements
- Add backpressure signaling before disconnecting slow clients (e.g., skip non-critical messages instead of disconnecting).
- Use graceful shutdown with close frames instead of
task.abort().
- Add a reconnection loop in the realtime conversation path, similar to the responses WebSocket client.
- Send a WebSocket Close frame in
WsStream::Drop before aborting the pump task.
- Consider merging the
is_closed check and send() into a single atomic operation.
I have a fix ready and can submit a PR if invited.
Summary
WebSocket disconnections (from network issues, slow clients, or server-side close) are handled as fatal session termination rather than recoverable events. Multiple code paths convert transient transport issues into permanent session death with no reconnection or graceful degradation.
This is not a guaranteed process panic — I did not find one in static review — but the combination of these paths explains reported behavior where transient disconnects behave like fatal session termination.
Finding 1: Slow or congested clients are proactively disconnected
File:
codex-rs/app-server/src/transport.rsCHANNEL_CAPACITY = 128(bounded outbound queue).writer.try_send(message)for disconnectable connections.TrySendError::Full(_)→ immediate disconnect.Pinghandler usestry_sendfor Pong; queue-full → connection closed.A slow client that falls 128 messages behind is disconnected without warning or backpressure signal.
Finding 2: WebSocket task teardown is hard-abort
File:
codex-rs/app-server/src/transport.rs, lines 363-370When either loop exits (including on transient errors), the peer task is force-aborted. This risks dropping tail messages in the outbound queue and bypasses WebSocket close frame semantics.
Finding 3: Realtime WebSocket has zero reconnection logic
File:
codex-rs/core/src/realtime_conversation.rs, lines 317-388spawn_realtime_input_taskbreaks on any send failure (line 330) or error/close event (lines 351, 359). It sends aRealtimeConversationClosedevent and exits permanently. There is no retry or reconnection loop.File:
codex-rs/codex-api/src/endpoint/realtime_websocket/methods.rsWsError::ConnectionClosed.Any network hiccup permanently kills the realtime conversation session.
Finding 4:
WsStream::Dropaborts pump without sending Close frameFile:
codex-rs/codex-api/src/endpoint/responses_websocket.rs, lines 157-161The remote peer sees a TCP reset rather than a clean WebSocket close. This can cause the peer to log errors or treat the connection as abnormally terminated.
Finding 5: TOCTOU race in send_json
File:
codex-rs/codex-api/src/endpoint/realtime_websocket/methods.rs, lines 330-346The
is_closedcheck and the actualsend()are not atomic. The connection can close between these operations. The error is properly mapped (not a panic), but callers relying on theis_closedguard may not expectsend()to still fail.Context: the responses WebSocket client does have reconnection
For comparison, the responses WebSocket client (
codex-rs/core/src/client.rs:730-771) does checkconn.is_closed()before each turn and reconnects if needed, with fallback to HTTP after exhausting retries (try_switch_fallback_transport(), line 1062-1088). The realtime path lacks equivalent recovery.Suggested improvements
task.abort().WsStream::Dropbefore aborting the pump task.is_closedcheck andsend()into a single atomic operation.I have a fix ready and can submit a PR if invited.