Skip to content

[Feature] Recover faster from stalled reasoning-model connections before safe retry #918

@Astro-Han

Description

@Astro-Han

What task are you trying to do?

I want PawWork to recover faster when GPT/reasoning model connections stall before the provider produces any content.

Recent production session exports showed repeated failures where OpenAI gpt-5.5 never reached first provider progress. PawWork waited the reasoning-model connect watchdog ceiling (120000ms) each time before deciding the connection was interrupted. Once #914 lands, these early no-output/no-tool failures can be retried safely, but the first attempt can still leave the user waiting for roughly two minutes before recovery begins.

Which area would this change affect?

Model harness, prompts, tools, or session mechanics

What do you do today?

Today, when a reasoning model connection stalls before first provider progress, PawWork waits up to 120 seconds before it can retry or show recovery. This is especially painful for high-frequency GPT usage because one provider/network wobble can freeze the visible workflow for two minutes even though the attempt produced no output and ran no tools.

The 120-second ceiling was introduced by #758 for a real reason: some reasoning-model runs can take more than 30 seconds before first observable provider progress. That means simply reverting to the default 30-second connect watchdog would risk false timeouts on legitimate slow starts.

What would a good result look like?

PawWork should fail faster on the first stalled connection while still giving legitimate slow reasoning-model starts enough room to complete.

A likely direction to evaluate after #914:

  • Use a shorter first-attempt connect watchdog for reasoning models, such as 45s or 60s.
  • If the first attempt fails before provider progress and fix: allow safe retry before provider progress #914 proves no output or tool activity happened, retry automatically once.
  • Let the automatic retry keep the longer 120s ceiling, so a legitimate slow first-progress run is not killed repeatedly.
  • Keep conservative behavior for any attempt that reaches final text, tool input, tool call materialization, tool execution, provider-executed capability, external boundary, user cancel, lifecycle close, quota, context overflow, or another non-retryable error.

The exact timeout values and user-visible behavior should be discussed before implementation because this affects the trade-off between faster recovery and false-positive timeouts.

What would count as done?

  • The issue has an agreed timeout strategy documented in a comment before implementation.
  • The implementation does not broaden retry safety beyond fix: allow safe retry before provider progress #914's early no-output/no-tool recovery boundary.
  • Reasoning-model first attempts no longer always wait 120s before safe retry can begin.
  • Legitimate slow first-progress reasoning runs still have a protected path, likely by keeping a longer timeout on the retry attempt.
  • Targeted tests cover the selected strategy, including first attempt timeout, retry attempt timeout, and non-retryable/conservative paths.
  • Verification uses the captured session-export shape from the May 26 failures and existing fix: widen LLM connect timeout for reasoning models #758/fix: allow safe retry before provider progress #914 context.

What should stay out of scope?

  • Do not change the broader provider retry taxonomy.
  • Do not retry after final assistant text or tool activity has started.
  • Do not change quota, context-overflow, user-cancel, or lifecycle-close behavior.
  • Do not redesign the full incident recovery system.
  • Do not change UI copy unless the agreed strategy requires it.

Which audience does this matter to most?

Both

Extra context

Related work:

  • fix: widen LLM connect timeout for reasoning models #758 widened the reasoning-model first-progress watchdog to 120s because gpt-5.5 could legitimately exceed the previous 30s ceiling.
  • fix: allow safe retry before provider progress #914 allows safe auto retry before first provider progress when no output or tool activity happened.
  • Two May 26 production exports showed five failures clustered within about nine minutes, all before first provider progress, with no output and no tool activity:
    • pawwork-session-proud-comet-2026-05-26-02-33-44.json
    • pawwork-session-sunny-meadow-2026-05-26-02-31-46.json

Observed failure shapes:

  • watchdog_timeout / connect, after 120000ms without provider progress.
  • provider_transport_disconnect before first provider progress, with UND_ERR_SOCKET / other side closed.

This should be treated as a follow-up design/implementation slice after #914 rather than being folded into #914.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium priorityenhancementNew feature or requestharnessModel harness, prompts, tool descriptions, and session mechanics

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions