[Feature] Recover faster from stalled reasoning-model connections before safe retry

## What task are you trying to do?

I want PawWork to recover faster when GPT/reasoning model connections stall before the provider produces any content.

Recent production session exports showed repeated failures where OpenAI `gpt-5.5` never reached first provider progress. PawWork waited the reasoning-model connect watchdog ceiling (`120000ms`) each time before deciding the connection was interrupted. Once #914 lands, these early no-output/no-tool failures can be retried safely, but the first attempt can still leave the user waiting for roughly two minutes before recovery begins.

## Which area would this change affect?

Model harness, prompts, tools, or session mechanics

## What do you do today?

Today, when a reasoning model connection stalls before first provider progress, PawWork waits up to 120 seconds before it can retry or show recovery. This is especially painful for high-frequency GPT usage because one provider/network wobble can freeze the visible workflow for two minutes even though the attempt produced no output and ran no tools.

The 120-second ceiling was introduced by #758 for a real reason: some reasoning-model runs can take more than 30 seconds before first observable provider progress. That means simply reverting to the default 30-second connect watchdog would risk false timeouts on legitimate slow starts.

## What would a good result look like?

PawWork should fail faster on the first stalled connection while still giving legitimate slow reasoning-model starts enough room to complete.

A likely direction to evaluate after #914:

- Use a shorter first-attempt connect watchdog for reasoning models, such as 45s or 60s.
- If the first attempt fails before provider progress and #914 proves no output or tool activity happened, retry automatically once.
- Let the automatic retry keep the longer 120s ceiling, so a legitimate slow first-progress run is not killed repeatedly.
- Keep conservative behavior for any attempt that reaches final text, tool input, tool call materialization, tool execution, provider-executed capability, external boundary, user cancel, lifecycle close, quota, context overflow, or another non-retryable error.

The exact timeout values and user-visible behavior should be discussed before implementation because this affects the trade-off between faster recovery and false-positive timeouts.

## What would count as done?

- The issue has an agreed timeout strategy documented in a comment before implementation.
- The implementation does not broaden retry safety beyond #914's early no-output/no-tool recovery boundary.
- Reasoning-model first attempts no longer always wait 120s before safe retry can begin.
- Legitimate slow first-progress reasoning runs still have a protected path, likely by keeping a longer timeout on the retry attempt.
- Targeted tests cover the selected strategy, including first attempt timeout, retry attempt timeout, and non-retryable/conservative paths.
- Verification uses the captured session-export shape from the May 26 failures and existing #758/#914 context.

## What should stay out of scope?

- Do not change the broader provider retry taxonomy.
- Do not retry after final assistant text or tool activity has started.
- Do not change quota, context-overflow, user-cancel, or lifecycle-close behavior.
- Do not redesign the full incident recovery system.
- Do not change UI copy unless the agreed strategy requires it.

## Which audience does this matter to most?

Both

## Extra context

Related work:

- #758 widened the reasoning-model first-progress watchdog to 120s because `gpt-5.5` could legitimately exceed the previous 30s ceiling.
- #914 allows safe auto retry before first provider progress when no output or tool activity happened.
- Two May 26 production exports showed five failures clustered within about nine minutes, all before first provider progress, with no output and no tool activity:
  - `pawwork-session-proud-comet-2026-05-26-02-33-44.json`
  - `pawwork-session-sunny-meadow-2026-05-26-02-31-46.json`

Observed failure shapes:

- `watchdog_timeout` / `connect`, after `120000ms` without provider progress.
- `provider_transport_disconnect` before first provider progress, with `UND_ERR_SOCKET` / `other side closed`.

This should be treated as a follow-up design/implementation slice after #914 rather than being folded into #914.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Recover faster from stalled reasoning-model connections before safe retry #918

What task are you trying to do?

Which area would this change affect?

What do you do today?

What would a good result look like?

What would count as done?

What should stay out of scope?

Which audience does this matter to most?

Extra context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature] Recover faster from stalled reasoning-model connections before safe retry #918

Description

What task are you trying to do?

Which area would this change affect?

What do you do today?

What would a good result look like?

What would count as done?

What should stay out of scope?

Which audience does this matter to most?

Extra context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions