Skip to content

Responses streams need lifecycle diagnostics for close, timeout, and partial-output failures #19745

@jkaunert

Description

@jkaunert

What version of Codex is running?

Current main as of 2026-04-27.

What issue are you seeing?

When a Responses stream closes, errors, or idles before response.completed, Codex reports the failure but does not expose enough lifecycle evidence to diagnose where the stream died.

Today a stream that never produced meaningful output and a stream that reached response.output_item.added or response.output_item.done can collapse into a similar operator-visible failure shape. That makes it hard to distinguish:

  • provider silence before first event
  • connection loss after response.created
  • an early stall after response.output_item.added
  • a later stall after durable output started
  • close-before-completion after partial assistant output
  • response-id correlation problems between created and completed events

Why this matters

These cases need different follow-up behavior. Some are plain stream failures, some are partial-output failures, some are retry/resume candidates, and some may indicate a transport-specific issue.

Without lifecycle diagnostics, downstream retry/fallback bugs are harder to debug and issue reports have to rely on private logs or manual transcript forensics.

Expected behavior

For Responses SSE/WebSocket stream completion and failure paths, Codex should record structured lifecycle evidence such as:

  • request attempt sequence
  • transport path or transport reason
  • created response id
  • completed response id, if any
  • first event elapsed time
  • last event elapsed time
  • last event kind
  • first response.output_item.added elapsed time, if any
  • first response.output_item.done elapsed time, if any
  • first response.output_text.delta elapsed time, if any
  • observed stream event kinds
  • stream event count
  • terminal stream state: completed, closed before completion, idle timeout, or stream error

This would make close-before-completion and idle-timeout reports actionable without changing retry or fallback policy.

Non-goals

This issue is not asking Codex to:

  • shorten the default stream idle timeout
  • switch to HTTP fallback after a particular event pattern
  • change retry budgets
  • change model behavior

The ask is diagnostics only: expose enough stream lifecycle evidence that the correct retry/fallback policy can be reasoned about separately.

Minimal validation shape

A useful test suite would cover:

  1. completed stream: records created/completed response id and terminal state completed
  2. close before response.completed: preserves the ordinary error and records the last observed event kind
  3. idle timeout before any event: records no first event and terminal state idle_timeout
  4. idle timeout after response.output_item.added: records the first output-item-added timing and terminal state idle_timeout
  5. idle timeout after durable output: records first output-item-done and/or first text-delta timing separately from the early-output case

Metadata

Metadata

Assignees

No one assigned

    Labels

    connectivityIssues involving networking or endpoint connectivity problems (disconnections)enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions