[Task] Design run-level LLM stream diagnostics for recurring terminated failures

## Goal

Design and implement a first-class LLM run diagnostics architecture so recurring `terminated` failures can be debugged, classified, and eventually recovered without one-off string patches.

What should be true when this is done:

- Session exports answer where an LLM request failed: before connection, early stream, mid-generation, tool side-effect, local abort, watchdog, or provider error.
- The exported diagnostics distinguish user/PawWork aborts from upstream/provider/socket disconnects.
- The diagnostics include enough safe correlation data to compare PawWork behavior against Codex App and upstream OpenCode.
- Recovery policy can be derived from facts: safe auto-retry, user-confirmed retry, or no retry.
- Raw `terminated` / `UND_ERR_SOCKET` should no longer be treated as an opaque user-facing failure.

## Scope

In scope:

- Revisit the diagnostics from #760 / PR #771 after repeated real-world `terminated` samples.
- Design a run-level diagnostics recorder for LLM requests, not just additional ad hoc fields on `llm_trace.stream.error`.
- Define a failure classifier and retry-safety classifier based on stream facts, tool side effects, abort provenance, watchdog state, and transport errors.
- Extend session exports with a human-readable summary plus a bounded engineering timeline.
- Add deterministic local harness coverage for early disconnect, mid-stream disconnect, local abort, watchdog timeout, and post-tool-call disconnect cases.
- Keep exported fields safe: no prompts, secrets, raw auth headers, or unbounded response bodies.

Out of scope for the first design pass:

- Changing provider credentials or OpenAI account routing.
- Blindly adding substring-specific UI translations for `terminated`.
- Auto-retrying all stream failures before retry-safety is explicit.
- Uploading session exports automatically.
- Replacing the whole LLM runtime in the same PR unless the design proves it is necessary.

## Relevant files or context

Recent evidence:

- `docs/debug-session-log/pawwork-session-hidden-mountain-2026-05-20-04-10-54-terminated.json`
- `docs/debug-session-log/pawwork-session-quiet-wizard-2026-05-20-08-25-08-terminated.json`

Both show the same failure signature after PR #771 diagnostics:

- `stream.error.constructor_name: TypeError`
- `stream.error.cause_name: SocketError`
- `stream.error.cause_code: UND_ERR_SOCKET`
- `stream.error.cause_message: other side closed`
- `stack_hint: Fetch.onAborted ... undici`
- `watchdog.fired: false`
- `abort.signal_aborted_at_error: false`
- `watchdog.provider_progressed: true`
- no finish, no tool call, no tool result, no text output in the most recent sample

Related work:

- #754 captured the earlier raw `terminated` leak but was intentionally conservative while waiting for more samples.
- #760 / PR #771 added nested LLM stream diagnostics, watchdog state, error fingerprints, and abort provenance.
- #721 and #755 remain related session/harness reliability issues, but this task should not close them by itself.

Likely code areas:

- `packages/opencode/src/session/llm.ts`
- `packages/opencode/src/session/llm-trace/*`
- `packages/opencode/src/session/export.ts`
- `packages/opencode/src/session/retry.ts`
- `packages/opencode/src/session/processor.ts`
- `packages/opencode/test/session/llm*.test.ts`

Upstream context:

- Upstream OpenCode appears to rely mostly on logs / optional OpenTelemetry / AI SDK `onError` for this path.
- Upstream has an experimental native LLM runtime path, which may eventually help compare AI SDK transport behavior against a more controlled runtime.

## Verification

A design is acceptable when it specifies:

- The diagnostic events and their bounded schema.
- The failure taxonomy and retry-safety taxonomy.
- How request/runtime/transport facts are captured without leaking secrets.
- How session export summary and timeline are structured.
- How deterministic tests simulate stream disconnects.
- Which part is minimum viable diagnostics, which part is architectural foundation, and which part is deferred.

Implementation verification should eventually include targeted tests for:

- Early provider progress followed by `UND_ERR_SOCKET` before text/tool output.
- Mid-generation disconnect after visible text.
- Disconnect after tool call / side effect started.
- Local abort with provenance.
- Watchdog connect timeout and silent stream timeout.
- Export sanitizer redaction of safe transport metadata.

## Execution mode

Investigate and propose a plan first — the agent must post the plan as an issue comment and wait for explicit approval before writing code or opening a PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task] Design run-level LLM stream diagnostics for recurring terminated failures #783

Goal

Scope

Relevant files or context

Verification

Execution mode

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Task] Design run-level LLM stream diagnostics for recurring terminated failures #783

Description

Goal

Scope

Relevant files or context

Verification

Execution mode

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions