Skip to content

[Bug] 30s LLM connect timeout aborts OpenAI reasoning streams (post-#729 residual) #755

@Astro-Han

Description

@Astro-Han

What happened?

PawWork's 30-second "first provider progress" watchdog (CONNECT_STREAM_TIMEOUT_MS in packages/opencode/src/session/llm.ts:31) aborts in-flight reasoning-model streams when the model spends more than 30 seconds on internal reasoning before emitting any event that isProviderProgressEvent (L546-561) whitelists. The whitelist only counts text-* / reasoning-* / tool-input-* / tool-call / tool-result / tool-error as progress, so connection establishment, the synthetic start envelope, and start-step do not reset the timer. With OpenAI gpt-5.5 (reasoning-capable, default reasoningEffort: "medium" from provider/transform.ts:1185), 14 active tools, and a long session, first-chunk latency reproducibly exceeds 30s and the watchdog kills an otherwise-healthy stream.

The error reaches the user as UnknownError: LLM stream connection timed out after 30000ms without provider progress. SessionRetry.policy.retryable() (packages/opencode/src/session/retry.ts:55-105) does not classify this local error as retryable, so there is no automatic retry. This is distinct from #728 / PR #729: that PR fixed the timer starting before the HTTP request was actually sent, and the build in this report already includes that fix. The residual issue is the 30s ceiling itself, which #729's body explicitly deferred.

Which area seems affected?

Model harness, prompts, tools, or session mechanics

How much does this affect you?

Breaks an important workflow

Steps to reproduce

  1. Open a long-running build-agent session with OpenAI gpt-5.5 (or a comparable reasoning-capable model at reasoningEffort: "medium" or higher).
  2. Let the model run several tool-call rounds so the session accumulates meaningful context (this report: 269 messages / 1063 parts).
  3. Issue a follow-up turn whose first model action requires non-trivial reasoning before any text or tool-input chunk.
  4. Occasionally observe the assistant message fail with the 30000ms timeout error before any provider chunk is received.

What did you expect to happen?

The assistant message completes, or, if the stream must be aborted, the retry policy attempts it again automatically and only surfaces a hard error after repeated failures, rather than failing on the first occurrence with no provider chunk ever received.

PawWork version

0.0.0-prod-202605181651

OS version

macOS 26 (Darwin 25.4.0)

Can you reproduce it again?

Sometimes

Diagnostics

  • Session: ses_1c1b6ccdbffes5qfwa7ovaOcLH. Failing assistant message: msg_e3e9723a10015WNXnu81BTQeXD.
  • Trace counters from the session export: dur_ms: 30204, stream_events.start: 1, all other counters (start_step, text_*, reasoning_*, tool_input_*, tool_call, tool_result, tool_error, error, finish_step, finish) 0, tokens.input/output/reasoning: 0, flags.stream_error: true, flags.empty_completion: false. Provider emitted no error event; PawWork's watchdog aborted the stream.
  • The preceding trace msg_e3e96c8030015laTiPT5gmzjpD finished cleanly with finish_reason: tool-calls 17 seconds earlier, so this is not a stale connection. A user retry 16 seconds after the failure (msg_e3e97d832001lbotXyQzAOF05y) succeeded in 16.6s with 26 text deltas, confirming the model and account were healthy.
  • Investigation chain confirming this is the fix: defer LLM stream connect timeout to after HTTP request is sent #729 residual: grep for the error literal points to session/llm.ts:466. git log -- packages/opencode/src/session/llm.ts shows 610241905 fix: defer LLM stream connect timeout to after HTTP request is sent (#729) as the most recent change to that file. PR fix: defer LLM stream connect timeout to after HTTP request is sent #729's body explicitly defers two follow-ups — (1) SessionRetry.policy not treating connect timeouts as retryable, (2) connectTimeoutMs not being configurable end-to-end — and the build identifier in this report (0.0.0-prod-202605181651, built 2026-05-18 16:51) postdates the PR fix: defer LLM stream connect timeout to after HTTP request is sent #729 merge (2026-05-18 08:06 UTC), so the timer-start fix is present and the residual 30s ceiling is what fired here.
  • Full session export (pawwork-session-neon-orchid-2026-05-19-04-58-27-...json, ~5.1MB) available locally on request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High prioritybugSomething isn't workingharnessModel harness, prompts, tool descriptions, and session mechanics

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions