Skip to content

[Feature] Refine run diagnostics for interrupted streaming tool calls #803

@Astro-Han

Description

@Astro-Han

Goal

Run-level diagnostics should distinguish a transport interruption during tool-call generation from a real tool execution failure.

In the observed GPT-5.5 terminated session, the lower-level trace correctly captured an OpenAI SDK transport interruption:

  • error: TypeError: terminated
  • cause: SocketError: other side closed
  • cause code: UND_ERR_SOCKET
  • boundary: sdk_transport
  • watchdog fired: false
  • abort signal at error: false
  • provider progress seen: true

But run observability classified the run as tool_failure.tool_execution_failed, even though no tool execution started.

When this change is done, the diagnosis should say the simple truth: the model stream disconnected while the assistant was still producing a tool call; the tool had not run.

Scope

In scope:

  • Add a more precise classification for streaming interruptions that happen after tool_input_start but before tool_input_end / tool_call.
  • Preserve evidence that separates local aborts, watchdog timeouts, provider-side transport failures, and actual tool execution failures.
  • Make the exported session diagnosis readable enough to explain in user-facing language.
  • Link this as a follow-up to feat(session): trace lifecycle close provenance #794.

Out of scope:

  • Implementing retry or recovery behavior after the interruption.
  • Changing provider SDK behavior.
  • Broad redesign of session export format beyond the minimum fields needed for clearer classification.

Relevant files or context

Follow-up to #794.

Observed session export:

  • /Users/yuhan/Downloads/pawwork-session-shiny-knight-2026-05-21-02-02-17-terminated.json

Important observed fields:

  • stream_events.tool_input_start: 1
  • stream_events.tool_input_end: 0
  • stream_events.tool_call: 0
  • run_observability.tool_execution_started: false
  • run_observability.classification: tool_failure
  • run_observability.summary_key: tool_failure.tool_execution_failed
  • stream.error.boundary: sdk_transport
  • stream.error.cause_code: UND_ERR_SOCKET

Related but different:

Verification

  • Add or update a focused diagnostic test/fixture for: provider progress seen, visible output may exist, tool input started, tool input did not end, tool call did not materialize, tool execution did not start, transport error is UND_ERR_SOCKET.
  • Confirm the resulting classification is not tool_failure.tool_execution_failed.
  • Confirm session export still includes the raw low-level evidence for future debugging.
  • Manually inspect a generated/exported diagnostic sample and verify the plain-language summary matches the state.

Execution mode

Investigate and get the design plan approved first. Here, "plan" means the issue-level design / scope proposal, not a PR-level implementation checklist. Once the approved design exists, agents may proceed with implementation plans inside the agreed scope; post a new issue comment and wait for explicit "approved" only when the implementation would change that design scope.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium priorityenhancementNew feature or requestharnessModel harness, prompts, tool descriptions, and session mechanics

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions