Skip to content

LLM stream timeout before first provider progress is not auto-retried #857

@Astro-Han

Description

@Astro-Han

What happened?

A session can fail after a tool result because the follow-up LLM stream stays in the pre-provider-progress phase until the local connect watchdog fires:

LLM stream connection timed out after 120000ms without provider progress

In the captured session, the first assistant step successfully received provider progress, read STATUS.md, and completed a read-only tool call. The second assistant step started immediately afterward, emitted only the AI SDK start event, then produced no reasoning-start, text-start, tool-input-start, tool-call, or other provider progress events before the 120s watchdog aborted it.

The user-visible problem is that PawWork stops the task even though no visible output or tool side effect happened in the failing attempt. This case is safe to retry automatically once, but today it surfaces as a terminal error.

Area: Model harness, prompts, tools, or session mechanics
Impact: Breaks an important workflow

Steps to reproduce

  1. Open PawWork on macOS.
  2. Use openai/gpt-5.5 in a project session.
  3. Ask the assistant to update a local todo after inspecting a file.
  4. Let the first assistant step read the file successfully.
  5. Observe the next assistant step fail after 120 seconds with LLM stream connection timed out after 120000ms without provider progress if the provider/transport never emits first progress.

Captured session evidence:

  • Session export: docs/debug-session-log/pawwork-session-shiny-harbor-2026-05-23-06-46-15-LLM stream connection timed out after 120000ms without provider progress.json
  • Runtime: 0.0.0-prod-202605230200, prod, darwin, timezone Asia/Shanghai
  • Model: openai/gpt-5.5
  • Root session: ses_1ac6c805affeV8HWYe6Jac1YdT
  • Failed message: msg_e539395a5001aYrXQ96KNgQI6l

Key diagnostics from the failed trace:

  • stream_events.start = 1
  • stream_events.reasoning_start = 0
  • stream_events.text_start = 0
  • stream_events.tool_input_start = 0
  • stream_events.tool_call = 0
  • stored_parts.* = 0
  • tokens.* = 0
  • watchdog.provider_progressed = false
  • watchdog.phase_at_end = "before_first_provider_progress"
  • watchdog.fired = true
  • watchdog.fired_phase = "connect"
  • stream.error.boundary = "watchdog"
  • run_observability.retry_safety.recommendation = "candidate_safe_auto_retry"

What did you expect to happen?

For a pre-first-provider-progress timeout with no visible output and no tool execution, PawWork should recover without making the user manually restart the task.

Minimum expected behavior:

  • Automatically retry the failed assistant step once when diagnostics prove there was no visible output, tool call, tool execution, or unsafe side effect.
  • If the retry also fails, show a clear message that the model connection timed out before first response and that no tools were executed.
  • Keep the failed attempt in diagnostics so RCA can still distinguish first-attempt timeout from retry success/failure.

Diagnostics correctness expected behavior:

  • The run incident facts should not contradict the LLM trace. In this capture, llm_trace.stream.watchdog.fired is true, but run_incidents[0].facts.watchdog_fired is false.

PawWork version

2026.5.23 / runtime 0.0.0-prod-202605230200

OS version

macOS Darwin 25.5.0

Can you reproduce it again?

Only once so far in this specific session, with similar same-window timeout events observed nearby in app logs for other sessions.

Diagnostics

Relevant local code paths:

  • packages/opencode/src/session/llm.ts: connect watchdog and provider progress detection.
  • packages/opencode/src/provider/transform.ts: reasoning-capable models use a 120s connect timeout override.
  • packages/opencode/src/session/run-observability/recorder.ts: retry safety and transport classification.
  • packages/opencode/src/session/run-incident/derive.ts: incident facts currently miss the watchdog-fired evidence.

Proposed smallest fix:

  • Auto-retry once for before_first_provider_progress failures when the failed attempt has no visible output, no tool input/call, no tool execution, and no unsafe side effect.
  • Add or propagate a watchdog_fired incident evidence event so exported incident facts match the LLM trace.

Out of scope for the first fix:

  • Proving whether OpenAI, Cloudflare, local network, or SDK internals caused the missing first provider progress. That requires request-level provider correlation or lower-level transport logs.
  • Broad run-state architecture changes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium prioritybugSomething isn't workingharnessModel harness, prompts, tool descriptions, and session mechanicsupstreamTracked upstream or vendor behavior

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions