Skip to content

[Task] Design run-level LLM stream diagnostics for recurring terminated failures #783

@Astro-Han

Description

@Astro-Han

Goal

Design and implement a first-class LLM run diagnostics architecture so recurring terminated failures can be debugged, classified, and eventually recovered without one-off string patches.

What should be true when this is done:

  • Session exports answer where an LLM request failed: before connection, early stream, mid-generation, tool side-effect, local abort, watchdog, or provider error.
  • The exported diagnostics distinguish user/PawWork aborts from upstream/provider/socket disconnects.
  • The diagnostics include enough safe correlation data to compare PawWork behavior against Codex App and upstream OpenCode.
  • Recovery policy can be derived from facts: safe auto-retry, user-confirmed retry, or no retry.
  • Raw terminated / UND_ERR_SOCKET should no longer be treated as an opaque user-facing failure.

Scope

In scope:

  • Revisit the diagnostics from [Task] Add LLM stream failure diagnostics v2 to session exports #760 / PR feat(session): add llm stream failure diagnostics #771 after repeated real-world terminated samples.
  • Design a run-level diagnostics recorder for LLM requests, not just additional ad hoc fields on llm_trace.stream.error.
  • Define a failure classifier and retry-safety classifier based on stream facts, tool side effects, abort provenance, watchdog state, and transport errors.
  • Extend session exports with a human-readable summary plus a bounded engineering timeline.
  • Add deterministic local harness coverage for early disconnect, mid-stream disconnect, local abort, watchdog timeout, and post-tool-call disconnect cases.
  • Keep exported fields safe: no prompts, secrets, raw auth headers, or unbounded response bodies.

Out of scope for the first design pass:

  • Changing provider credentials or OpenAI account routing.
  • Blindly adding substring-specific UI translations for terminated.
  • Auto-retrying all stream failures before retry-safety is explicit.
  • Uploading session exports automatically.
  • Replacing the whole LLM runtime in the same PR unless the design proves it is necessary.

Relevant files or context

Recent evidence:

  • docs/debug-session-log/pawwork-session-hidden-mountain-2026-05-20-04-10-54-terminated.json
  • docs/debug-session-log/pawwork-session-quiet-wizard-2026-05-20-08-25-08-terminated.json

Both show the same failure signature after PR #771 diagnostics:

  • stream.error.constructor_name: TypeError
  • stream.error.cause_name: SocketError
  • stream.error.cause_code: UND_ERR_SOCKET
  • stream.error.cause_message: other side closed
  • stack_hint: Fetch.onAborted ... undici
  • watchdog.fired: false
  • abort.signal_aborted_at_error: false
  • watchdog.provider_progressed: true
  • no finish, no tool call, no tool result, no text output in the most recent sample

Related work:

Likely code areas:

  • packages/opencode/src/session/llm.ts
  • packages/opencode/src/session/llm-trace/*
  • packages/opencode/src/session/export.ts
  • packages/opencode/src/session/retry.ts
  • packages/opencode/src/session/processor.ts
  • packages/opencode/test/session/llm*.test.ts

Upstream context:

  • Upstream OpenCode appears to rely mostly on logs / optional OpenTelemetry / AI SDK onError for this path.
  • Upstream has an experimental native LLM runtime path, which may eventually help compare AI SDK transport behavior against a more controlled runtime.

Verification

A design is acceptable when it specifies:

  • The diagnostic events and their bounded schema.
  • The failure taxonomy and retry-safety taxonomy.
  • How request/runtime/transport facts are captured without leaking secrets.
  • How session export summary and timeline are structured.
  • How deterministic tests simulate stream disconnects.
  • Which part is minimum viable diagnostics, which part is architectural foundation, and which part is deferred.

Implementation verification should eventually include targeted tests for:

  • Early provider progress followed by UND_ERR_SOCKET before text/tool output.
  • Mid-generation disconnect after visible text.
  • Disconnect after tool call / side effect started.
  • Local abort with provenance.
  • Watchdog connect timeout and silent stream timeout.
  • Export sanitizer redaction of safe transport metadata.

Execution mode

Investigate and propose a plan first — the agent must post the plan as an issue comment and wait for explicit approval before writing code or opening a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High priorityharnessModel harness, prompts, tool descriptions, and session mechanicstaskNarrow execution, audit, spike, migration, tracking, or upstream follow-up worktech-debtSupplemental cleanup, maintainability, architecture, test, or quality debt contextupstreamTracked upstream or vendor behavior

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions