LLM stream timeout before first provider progress is not auto-retried

## What happened?
A session can fail after a tool result because the follow-up LLM stream stays in the pre-provider-progress phase until the local connect watchdog fires:

`LLM stream connection timed out after 120000ms without provider progress`

In the captured session, the first assistant step successfully received provider progress, read `STATUS.md`, and completed a read-only tool call. The second assistant step started immediately afterward, emitted only the AI SDK `start` event, then produced no `reasoning-start`, `text-start`, `tool-input-start`, `tool-call`, or other provider progress events before the 120s watchdog aborted it.

The user-visible problem is that PawWork stops the task even though no visible output or tool side effect happened in the failing attempt. This case is safe to retry automatically once, but today it surfaces as a terminal error.

Area: Model harness, prompts, tools, or session mechanics
Impact: Breaks an important workflow

## Steps to reproduce
1. Open PawWork on macOS.
2. Use `openai/gpt-5.5` in a project session.
3. Ask the assistant to update a local todo after inspecting a file.
4. Let the first assistant step read the file successfully.
5. Observe the next assistant step fail after 120 seconds with `LLM stream connection timed out after 120000ms without provider progress` if the provider/transport never emits first progress.

Captured session evidence:
- Session export: `docs/debug-session-log/pawwork-session-shiny-harbor-2026-05-23-06-46-15-LLM stream connection timed out after 120000ms without provider progress.json`
- Runtime: `0.0.0-prod-202605230200`, `prod`, `darwin`, timezone `Asia/Shanghai`
- Model: `openai/gpt-5.5`
- Root session: `ses_1ac6c805affeV8HWYe6Jac1YdT`
- Failed message: `msg_e539395a5001aYrXQ96KNgQI6l`

Key diagnostics from the failed trace:
- `stream_events.start = 1`
- `stream_events.reasoning_start = 0`
- `stream_events.text_start = 0`
- `stream_events.tool_input_start = 0`
- `stream_events.tool_call = 0`
- `stored_parts.* = 0`
- `tokens.* = 0`
- `watchdog.provider_progressed = false`
- `watchdog.phase_at_end = "before_first_provider_progress"`
- `watchdog.fired = true`
- `watchdog.fired_phase = "connect"`
- `stream.error.boundary = "watchdog"`
- `run_observability.retry_safety.recommendation = "candidate_safe_auto_retry"`

## What did you expect to happen?
For a pre-first-provider-progress timeout with no visible output and no tool execution, PawWork should recover without making the user manually restart the task.

Minimum expected behavior:
- Automatically retry the failed assistant step once when diagnostics prove there was no visible output, tool call, tool execution, or unsafe side effect.
- If the retry also fails, show a clear message that the model connection timed out before first response and that no tools were executed.
- Keep the failed attempt in diagnostics so RCA can still distinguish first-attempt timeout from retry success/failure.

Diagnostics correctness expected behavior:
- The run incident facts should not contradict the LLM trace. In this capture, `llm_trace.stream.watchdog.fired` is `true`, but `run_incidents[0].facts.watchdog_fired` is `false`.

## PawWork version
2026.5.23 / runtime `0.0.0-prod-202605230200`

## OS version
macOS Darwin 25.5.0

## Can you reproduce it again?
Only once so far in this specific session, with similar same-window timeout events observed nearby in app logs for other sessions.

## Diagnostics
Relevant local code paths:
- `packages/opencode/src/session/llm.ts`: connect watchdog and provider progress detection.
- `packages/opencode/src/provider/transform.ts`: reasoning-capable models use a 120s connect timeout override.
- `packages/opencode/src/session/run-observability/recorder.ts`: retry safety and transport classification.
- `packages/opencode/src/session/run-incident/derive.ts`: incident facts currently miss the watchdog-fired evidence.

Proposed smallest fix:
- Auto-retry once for `before_first_provider_progress` failures when the failed attempt has no visible output, no tool input/call, no tool execution, and no unsafe side effect.
- Add or propagate a `watchdog_fired` incident evidence event so exported incident facts match the LLM trace.

Out of scope for the first fix:
- Proving whether OpenAI, Cloudflare, local network, or SDK internals caused the missing first provider progress. That requires request-level provider correlation or lower-level transport logs.
- Broad run-state architecture changes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM stream timeout before first provider progress is not auto-retried #857

What happened?

Steps to reproduce

What did you expect to happen?

PawWork version

OS version

Can you reproduce it again?

Diagnostics

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

LLM stream timeout before first provider progress is not auto-retried #857

Description

What happened?

Steps to reproduce

What did you expect to happen?

PawWork version

OS version

Can you reproduce it again?

Diagnostics

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions