fix: scope reasoning safe retry timeouts by attempt#922
Conversation
|
Warning Review limit reached
More reviews will be available in 8 minutes and 22 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughThis PR implements faster recovery from stalled reasoning-model connections by introducing per-attempt connection timeout tuning. Reasoning model first attempts use a shorter 60-second timeout to fail fast, while retry attempts use the longer 120-second timeout to protect legitimate slow starts. It broadens safe recovery conditions and updates observability to record per-attempt timeout values, with corresponding i18n messaging updates. ChangesReasoning model connect timeout recovery
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the connection timeout logic for reasoning-capable models, introducing a fast-fail timeout of 60 seconds on the first attempt and a longer 120-second timeout on the safe retry. It also updates the copy for safe retries and failures to be less technical across English, Simplified Chinese, and Traditional Chinese. The reviewer feedback suggests extending the contract tests to also assert the Traditional Chinese (zht) translation changes, ensuring they do not regress.
Perf delta summaryComparator: pass
|
…afe-retry-timeouts # Conflicts: # packages/opencode/src/session/run-incident/policy.ts
## Summary - Added a `retry-decision` metadata layer for model execution retry decisions. - Kept technical retryability, safety-gate output, stream attempts, safe-recovery replay budget, timeout policy, and presentation as separate facts. - Lightly wired `session.processor` to consume the new decision metadata without moving the existing retry loop or changing #922 behavior. - Addressed review feedback by using `modelStreamAttempt` instead of provider-budget wording and by representing blocked-boundary reasoning attempts as `reasoning_global_protected`. ## Verification - `bun test test/session/retry-decision.test.ts` -> 6 passed. - `bun test test/session/retry.test.ts` -> 32 passed. - `bun test test/session/processor-effect.test.ts` -> 33 passed. - `bun run typecheck` in `packages/opencode` -> passed. - PR CI passed: typecheck, lint, unit-app, unit-ui, unit-opencode, unit-desktop, codeql, dependency-review, desktop-smoke, e2e-artifacts, commit/title/triage checks. ## Related Part of #925.
## Summary - Extract safe-recovery replay safety into RunIncident.evaluateReplaySafety so the gate owns only replay permission, recovery mode, and blocked reason. - Move side-effect boundary snapshot construction into RunObservability and keep processor retry scheduling unchanged. - Keep retry attempt metadata and presentation mapping in retry-decision, preserving #922 safe-retry behavior. ## Verification - bun test test/session/run-incident-safety-gate.test.ts test/session/retry-decision.test.ts: 10 passed - bun test test/session/run-observability-boundary.test.ts test/session/run-incident-safety-gate.test.ts test/session/retry-decision.test.ts: 11 passed - bun test test/session/run-observability.test.ts: 63 passed - bun test test/session/processor-effect.test.ts: 33 passed - bun run typecheck: ok - PR CI: green, including unit-opencode and smoke-macos-arm64 ## Review Follow-up - Addressed Gemini import/test coverage comments. - Addressed P3 layer-boundary feedback by keeping presentation and attempt kind outside the safety gate. Closes part of #925.
Summary:\n- Add a dedicated safe-recovery retry policy with separate one-replay budget metadata and lightweight retry presentation.\n- Route session.processor safe-recovery status/backoff scheduling through that policy instead of local retry status and sleep mechanics.\n- Preserve #922 reasoning timeout behavior, lifecycle-close checks, safe_retry_failed notice behavior, and visible-output replay guards.\n\nReview follow-up:\n- Removed the remaining processor-local sleep so the retry policy step owns the backoff delay.\n- Added policy-level coverage for the safe-recovery budget exhaustion branch and tightened it to assert schedule completion rather than generic failure.\n- Verified Cause.done exists in the current Effect runtime; the review concern was a false positive on API existence, but it exposed a useful test-strengthening gap.\n\nCI recovery note:\n- The initial empty retrigger commit did not create pull_request workflow runs. Two temporary marker commits were used to trigger the missing required checks; they leave no final PR diff and are intentionally collapsed by this squash merge.\n\nVerification:\n- TDD RED: safeRecoveryPolicy initially missing; focused test failed for that reason.\n- bun test test/session/retry.test.ts test/session/retry-decision.test.ts test/session/run-incident-safety-gate.test.ts: 44 passed.\n- bun test test/session/processor-effect.test.ts -t "reasoning connect watchdog is attempt-scoped for before-progress safe retry": passed.\n- bun test test/session/processor-effect.test.ts -t "reasoning-only retry writes a notice after the one safe retry is exhausted": passed.\n- bun test test/session/processor-effect.test.ts -t "connect timeout auto retry stops if lifecycle closes during backoff": passed.\n- bun test test/session/processor-effect.test.ts -t "retryable stream error after visible output does not replay the assistant message": passed before the final review follow-up.\n- bun run typecheck: passed.\n- git diff --check: passed.\n- PR CI required checks: passed after rerun.\n\nRelated:\n- Closes part of #925.
Summary
Scope reasoning-model safe-retry connect timeouts by processor attempt instead of changing the global reasoning stream timeout.
This PR also updates the safe-recovery retry/failure copy to the agreed short English and Chinese wording.
Why
Reasoning-model streams currently keep the global
120sconnect watchdog everywhere. That protects legitimate slow starts, but it also makes users wait two minutes before PawWork can safely recover from a no-output/no-tool stalled connection.#914 made the before-progress no-output/no-tool boundary safe to retry. This follow-up applies the #918 two-attempt strategy: fail fast on the first stalled reasoning-model main response, then protect the one automatic safe retry with the previous 120s ceiling.
Related Issue
Closes #918
Human Review Status
Pending
Review Focus
Please review the processor-level timeout merge order and the safe-recovery presentation predicate. The important invariant is that only the session processor main-response recovery flow gets
60s -> 120s; title generation and other non-recoveryllm.stream()callers still use the global reasoning timeout.Risk Notes
The intentional behavior change is that retryable no-output/no-tool failures that exhaust the one automatic recovery retry now render the
safe_retry_failednotice instead of a generic assistant error.Platform checklist left unticked: no platform, packaging, updater, signing, shell, or native path behavior changed.
How To Verify
Screenshots or Recordings
Safe-retry snap target was checked with
bun run snap safe-retry.Checklist
bug,enhancement,task,documentation. Type labels are author-added; the labeler bot does NOT assign them. Add the label in the GitHub UI, then tick this.app,ui,platform,harness,ci. The labeler bot assigns these on PR open based on changed paths. Confirm the bot's choice (or override if wrong), then tick this.P0,P1,P2,P3. The priority-triage bot suggests one on PR open. Confirm or override, then tick this.Pending,Approved by @<reviewer>, orNot required: <reason>(default isPending; "not required" is restricted to bot-authored low-risk PRs).dev, and my PR title and commit messages use Conventional Commits in English.Summary by CodeRabbit
Bug Fixes
Tests