Responses streams need lifecycle diagnostics for close, timeout, and partial-output failures

### What version of Codex is running?

Current `main` as of 2026-04-27.

### What issue are you seeing?

When a Responses stream closes, errors, or idles before `response.completed`, Codex reports the failure but does not expose enough lifecycle evidence to diagnose where the stream died.

Today a stream that never produced meaningful output and a stream that reached `response.output_item.added` or `response.output_item.done` can collapse into a similar operator-visible failure shape. That makes it hard to distinguish:

- provider silence before first event
- connection loss after `response.created`
- an early stall after `response.output_item.added`
- a later stall after durable output started
- close-before-completion after partial assistant output
- response-id correlation problems between created and completed events

### Why this matters

These cases need different follow-up behavior. Some are plain stream failures, some are partial-output failures, some are retry/resume candidates, and some may indicate a transport-specific issue.

Without lifecycle diagnostics, downstream retry/fallback bugs are harder to debug and issue reports have to rely on private logs or manual transcript forensics.

### Expected behavior

For Responses SSE/WebSocket stream completion and failure paths, Codex should record structured lifecycle evidence such as:

- request attempt sequence
- transport path or transport reason
- created response id
- completed response id, if any
- first event elapsed time
- last event elapsed time
- last event kind
- first `response.output_item.added` elapsed time, if any
- first `response.output_item.done` elapsed time, if any
- first `response.output_text.delta` elapsed time, if any
- observed stream event kinds
- stream event count
- terminal stream state: completed, closed before completion, idle timeout, or stream error

This would make close-before-completion and idle-timeout reports actionable without changing retry or fallback policy.

### Non-goals

This issue is not asking Codex to:

- shorten the default stream idle timeout
- switch to HTTP fallback after a particular event pattern
- change retry budgets
- change model behavior

The ask is diagnostics only: expose enough stream lifecycle evidence that the correct retry/fallback policy can be reasoned about separately.

### Minimal validation shape

A useful test suite would cover:

1. completed stream: records created/completed response id and terminal state `completed`
2. close before `response.completed`: preserves the ordinary error and records the last observed event kind
3. idle timeout before any event: records no first event and terminal state `idle_timeout`
4. idle timeout after `response.output_item.added`: records the first output-item-added timing and terminal state `idle_timeout`
5. idle timeout after durable output: records first output-item-done and/or first text-delta timing separately from the early-output case


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Responses streams need lifecycle diagnostics for close, timeout, and partial-output failures #19745

What version of Codex is running?

What issue are you seeing?

Why this matters

Expected behavior

Non-goals

Minimal validation shape

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Responses streams need lifecycle diagnostics for close, timeout, and partial-output failures #19745

Description

What version of Codex is running?

What issue are you seeing?

Why this matters

Expected behavior

Non-goals

Minimal validation shape

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions