[Feature] Add safe recovery for interrupted streaming runs

## Goal

When a model stream is interrupted after it has started, PawWork should not leave the session feeling dead or mysterious. It should explain what happened in plain language and offer the safest next step.

This follows #794 and complements #803:

- #794 added lifecycle close provenance.
- #803 tracks making the diagnosis more precise when OpenAI streaming disconnects during tool-call generation.
- This issue tracks the user experience and recovery policy after such interruptions.

Target user-facing behavior:

> The connection broke while PawWork was preparing the next step. No tool was executed. You can safely continue.

## Scope

In scope:
- Add a recovery policy for interrupted LLM runs based on already-recorded run facts:
  - whether provider progress was seen,
  - whether visible assistant output was shown,
  - whether a tool call fully materialized,
  - whether tool execution started,
  - whether unsafe side effects may have started,
  - whether side-effect facts are complete.
- Avoid treating every interruption as a terminal session failure.
- Show a calm, user-readable interruption state in the session UI.
- Provide the right next action:
  - auto-retry only when nothing visible and no tool side effects happened,
  - offer a one-click Continue/Resume when visible text exists but no tool executed,
  - require explicit user confirmation when tool execution or unsafe side effects may have started,
  - explain when recovery is unsafe or unknown.
- Prevent duplicate user-visible text or duplicate tool execution after retry/resume.

Out of scope:
- Making transport-level diagnosis more precise; that is #803.
- Local lifecycle causality diagnostics; that is #802.
- Changing OpenAI SDK behavior or provider networking.
- Broad redesign of the session timeline.

## Relevant files or context

Observed failure case:
- GPT-5.5 stream started normally.
- The assistant showed user-visible text.
- The model began producing an `enter-worktree` tool call.
- Tool input did not finish.
- The tool call did not materialize.
- Tool execution did not start.
- The stream failed with `TypeError: terminated`, caused by `SocketError: other side closed`, `UND_ERR_SOCKET`.
- Current run observability says `do_not_auto_retry` because visible output was seen. That is cautious, but the UI should offer a safe resume path instead of leaving the session stuck.

Related issues/PRs:
- #794 — lifecycle close provenance foundation.
- #802 — local lifecycle causality diagnostics.
- #803 — precise classification for interrupted streaming tool calls.
- #755 — related OpenAI timeout reliability issue, but focused on connect-timeout behavior.

## Proposed recovery matrix

- No visible output, no tool call, no tool execution:
  - Safe to auto-retry once with backoff.
- Visible output, partial tool input, no tool execution:
  - Do not silently replay the same visible text.
  - Offer Continue/Resume and preserve the existing transcript.
- Tool call completed, tool execution did not start:
  - Usually safe to re-run the assistant turn, but the UI should say no tool ran.
- Read-only tool execution started/completed:
  - May auto-resume if side-effect facts are complete and the tool is known read-only.
- Unsafe or unknown side-effect tool started:
  - Do not auto-retry.
  - Ask the user before continuing.
- Side-effect facts incomplete:
  - Prefer confirmation over automation.

## Verification

- Add tests for the recovery-policy matrix above.
- Add a fixture matching the observed GPT-5.5 `UND_ERR_SOCKET` case: visible output seen, partial tool input, no tool execution.
- Confirm that case presents a Continue/Resume path rather than only a terminal error.
- Confirm auto-retry is limited to safe cases and capped, with backoff.
- Confirm unsafe/unknown side-effect cases never auto-repeat tools.
- Confirm the UI copy is plain-language and does not expose raw `terminated` as the primary message.
- Manually verify the session page still behaves correctly after an interrupted run and after using Continue/Resume.

## Execution mode

Investigate and propose a plan first — the agent must post the plan as an issue comment and wait for an explicit "approved" comment before writing code or opening a PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add safe recovery for interrupted streaming runs #804

Goal

Scope

Relevant files or context

Proposed recovery matrix

Verification

Execution mode

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature] Add safe recovery for interrupted streaming runs #804

Description

Goal

Scope

Relevant files or context

Proposed recovery matrix

Verification

Execution mode

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions