[Feature] Refine run diagnostics for interrupted streaming tool calls

## Goal

Run-level diagnostics should distinguish a transport interruption during tool-call generation from a real tool execution failure.

In the observed GPT-5.5 terminated session, the lower-level trace correctly captured an OpenAI SDK transport interruption:

- error: `TypeError: terminated`
- cause: `SocketError: other side closed`
- cause code: `UND_ERR_SOCKET`
- boundary: `sdk_transport`
- watchdog fired: `false`
- abort signal at error: `false`
- provider progress seen: `true`

But run observability classified the run as `tool_failure.tool_execution_failed`, even though no tool execution started.

When this change is done, the diagnosis should say the simple truth: the model stream disconnected while the assistant was still producing a tool call; the tool had not run.

## Scope

In scope:
- Add a more precise classification for streaming interruptions that happen after `tool_input_start` but before `tool_input_end` / `tool_call`.
- Preserve evidence that separates local aborts, watchdog timeouts, provider-side transport failures, and actual tool execution failures.
- Make the exported session diagnosis readable enough to explain in user-facing language.
- Link this as a follow-up to #794.

Out of scope:
- Implementing retry or recovery behavior after the interruption.
- Changing provider SDK behavior.
- Broad redesign of session export format beyond the minimum fields needed for clearer classification.

## Relevant files or context

Follow-up to #794.

Observed session export:
- `/Users/yuhan/Downloads/pawwork-session-shiny-knight-2026-05-21-02-02-17-terminated.json`

Important observed fields:
- `stream_events.tool_input_start: 1`
- `stream_events.tool_input_end: 0`
- `stream_events.tool_call: 0`
- `run_observability.tool_execution_started: false`
- `run_observability.classification: tool_failure`
- `run_observability.summary_key: tool_failure.tool_execution_failed`
- `stream.error.boundary: sdk_transport`
- `stream.error.cause_code: UND_ERR_SOCKET`

Related but different:
- #755 tracks OpenAI connect-timeout behavior. This issue is about a stream that already started and then disconnected.

## Verification

- Add or update a focused diagnostic test/fixture for: provider progress seen, visible output may exist, tool input started, tool input did not end, tool call did not materialize, tool execution did not start, transport error is `UND_ERR_SOCKET`.
- Confirm the resulting classification is not `tool_failure.tool_execution_failed`.
- Confirm session export still includes the raw low-level evidence for future debugging.
- Manually inspect a generated/exported diagnostic sample and verify the plain-language summary matches the state.

## Execution mode

Investigate and get the design plan approved first. Here, "plan" means the issue-level design / scope proposal, not a PR-level implementation checklist. Once the approved design exists, agents may proceed with implementation plans inside the agreed scope; post a new issue comment and wait for explicit "approved" only when the implementation would change that design scope.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Refine run diagnostics for interrupted streaming tool calls #803

Goal

Scope

Relevant files or context

Verification

Execution mode

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature] Refine run diagnostics for interrupted streaming tool calls #803

Description

Goal

Scope

Relevant files or context

Verification

Execution mode

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions