You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Run-level diagnostics should distinguish a transport interruption during tool-call generation from a real tool execution failure.
In the observed GPT-5.5 terminated session, the lower-level trace correctly captured an OpenAI SDK transport interruption:
error: TypeError: terminated
cause: SocketError: other side closed
cause code: UND_ERR_SOCKET
boundary: sdk_transport
watchdog fired: false
abort signal at error: false
provider progress seen: true
But run observability classified the run as tool_failure.tool_execution_failed, even though no tool execution started.
When this change is done, the diagnosis should say the simple truth: the model stream disconnected while the assistant was still producing a tool call; the tool had not run.
Scope
In scope:
Add a more precise classification for streaming interruptions that happen after tool_input_start but before tool_input_end / tool_call.
Preserve evidence that separates local aborts, watchdog timeouts, provider-side transport failures, and actual tool execution failures.
Make the exported session diagnosis readable enough to explain in user-facing language.
Add or update a focused diagnostic test/fixture for: provider progress seen, visible output may exist, tool input started, tool input did not end, tool call did not materialize, tool execution did not start, transport error is UND_ERR_SOCKET.
Confirm the resulting classification is not tool_failure.tool_execution_failed.
Confirm session export still includes the raw low-level evidence for future debugging.
Manually inspect a generated/exported diagnostic sample and verify the plain-language summary matches the state.
Execution mode
Investigate and get the design plan approved first. Here, "plan" means the issue-level design / scope proposal, not a PR-level implementation checklist. Once the approved design exists, agents may proceed with implementation plans inside the agreed scope; post a new issue comment and wait for explicit "approved" only when the implementation would change that design scope.
Goal
Run-level diagnostics should distinguish a transport interruption during tool-call generation from a real tool execution failure.
In the observed GPT-5.5 terminated session, the lower-level trace correctly captured an OpenAI SDK transport interruption:
TypeError: terminatedSocketError: other side closedUND_ERR_SOCKETsdk_transportfalsefalsetrueBut run observability classified the run as
tool_failure.tool_execution_failed, even though no tool execution started.When this change is done, the diagnosis should say the simple truth: the model stream disconnected while the assistant was still producing a tool call; the tool had not run.
Scope
In scope:
tool_input_startbut beforetool_input_end/tool_call.Out of scope:
Relevant files or context
Follow-up to #794.
Observed session export:
/Users/yuhan/Downloads/pawwork-session-shiny-knight-2026-05-21-02-02-17-terminated.jsonImportant observed fields:
stream_events.tool_input_start: 1stream_events.tool_input_end: 0stream_events.tool_call: 0run_observability.tool_execution_started: falserun_observability.classification: tool_failurerun_observability.summary_key: tool_failure.tool_execution_failedstream.error.boundary: sdk_transportstream.error.cause_code: UND_ERR_SOCKETRelated but different:
Verification
UND_ERR_SOCKET.tool_failure.tool_execution_failed.Execution mode
Investigate and get the design plan approved first. Here, "plan" means the issue-level design / scope proposal, not a PR-level implementation checklist. Once the approved design exists, agents may proceed with implementation plans inside the agreed scope; post a new issue comment and wait for explicit "approved" only when the implementation would change that design scope.