fix: gate interrupted stream retries on recovery facts#863
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (10)
📝 WalkthroughWalkthroughImplements a one-time safe-recovery automatic retry for interrupted LLM streams using side-effect boundary snapshots. Adds snapshot types and re-exports, records snapshots in RunObservability, refactors processor to a manual retry loop with fixed backoff and lifecycle gating, updates incident derivation/policy, expands export sanitization, and adds tests. ChangesSafe Recovery Auto-Retry
Sequence Diagram(s)sequenceDiagram
participant Processor
participant RunObservability as Recorder
participant LLM
participant Exporter
Processor->>Recorder: recordSideEffectBoundarySnapshot(attemptID, snapshot)
Processor->>LLM: startStream(attempt)
LLM->>Processor: stream events / progress / error
Processor->>Recorder: recordAttemptFailureAndDeriveRecovery(evidence)
Recorder->>Processor: recoveryRecommendation (auto_retry_once / do_not_retry / ask_user_before_retry)
alt auto_retry_once and lifecycle allows
Processor->>Processor: sleep BACKOFF_MS
Processor->>LLM: startStream(retry attempt)
else non-auto-retry
Processor->>Processor: halt(recordFailure=false, interruptionMessage)
Processor->>Exporter: include incident / recovered_incidents in export
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request implements a safe recovery and auto-retry mechanism for LLM streams within the SessionProcessor. It introduces watchdog phase detection to identify connection timeouts and side-effect boundary snapshots to ensure retries only occur when no external side effects have been triggered. The observability system is updated to record these snapshots and track multiple attempts per run. Review feedback focuses on refining the observability recorder to handle terminal outcomes correctly across retries, internationalizing new user-facing interruption messages, and improving type safety when merging error metadata.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/opencode/src/session/processor.ts`:
- Around line 1289-1319: The interruption message is computed once into decision
via ctx.runTrace.recordAttemptFailureAndDeriveRecovery before a possible
backoff, so if retryStillAllowed("after_backoff") becomes false you may surface
a stale decision; update the code to recompute the recovery decision and
interruption message immediately before calling yield* halt(...) when a
previously-planned auto-retry was aborted (i.e. after the backoff sleep and
retryStillAllowed check fails). Specifically, call
ctx.runTrace.recordAttemptFailureAndDeriveRecovery again (using the same
attemptID, error, watchdog/evidence details) and pass
recoveryInterruptionMessage(newDecision) to halt so the user-facing interruption
reason reflects the latest lifecycle state; keep existing guards around
automaticStreamRetriesUsed and SAFE_RECOVERY_AUTO_RETRY_BACKOFF_MS.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 9c66ec86-5843-4ff8-942f-8ac5e56268cd
📒 Files selected for processing (9)
packages/opencode/src/session/processor.tspackages/opencode/src/session/run-incident/policy.tspackages/opencode/src/session/run-incident/types.tspackages/opencode/src/session/run-observability/index.tspackages/opencode/src/session/run-observability/recorder.tspackages/opencode/src/session/run-observability/types.tspackages/opencode/test/session/export.test.tspackages/opencode/test/session/processor-effect.test.tspackages/opencode/test/session/run-observability.test.ts
Summary
Gates interrupted stream retries on RunIncident recovery facts instead of legacy error-only retry classification.
This PR adds the first safe recovery slice for interrupted LLM streams:
watchdog_timeoutincidents and preserves attempt history across retry success/failure.Why
#857 showed a GPT-5.5 stream watchdog timeout before first provider progress. The failed attempt had no user-visible output and no tool side effects, but PawWork surfaced a terminal timeout instead of safely retrying.
The root issue was that execution retry was still driven by
SessionRetry.policy, which only looked at error shape. Diagnostics already had enough run facts to decide whether retry was safe, but those facts were not connected to stream retry scheduling. That also left a risk in the other direction: a retryable stream error after visible output could be replayed by the old retry path.Related Issue
Closes #857.
Addresses the first execution slice of #804. This does not close #804 because Continue/Resume UI/API are still deferred.
Human Review Status
Pending
Review Focus
Please focus on the stream retry safety boundary:
recordAttemptFailureAndDeriveRecovery;Risk Notes
Behavior risk: stream retry scheduling now lives in the processor loop instead of
Effect.retry(SessionRetry.policy(...)). The intent is narrower stream retry behavior: at most one fact-gated retry for safe interrupted streams. This intentionally also limits generic retryable 429/5xx stream failures to one safe-recovery retry in PR1; restoring multi-attempt Retry-After/backoff provider transient retry should be handled as a separate policy follow-up.Skipped conditional manual UI/copy check: no renderer component changed. Server-side interrupted copy is asserted in processor-effect tests, and full recovery card UI remains deferred to #804.
Skipped conditional platform check: no platform, packaging, updater, signing, path, shell, or permission surface changed.
Skipped conditional docs/release/dependency note: no docs, release notes, dependencies, permissions, credentials, deletion behavior, generated content, or local file changes are part of this PR.
How To Verify
Screenshots or Recordings
Not applicable. No renderer UI component changed.
Checklist
bug,enhancement,task,documentation. Type labels are author-added; the labeler bot does NOT assign them. Add the label in the GitHub UI, then tick this.app,ui,platform,harness,ci. The labeler bot assigns these on PR open based on changed paths. Confirm the bot's choice (or override if wrong), then tick this.P0,P1,P2,P3. The priority-triage bot suggests one on PR open. Confirm or override, then tick this.Pending,Approved by @<reviewer>, orNot required: <reason>(default isPending; "not required" is restricted to bot-authored low-risk PRs).dev, and my PR title and commit messages use Conventional Commits in English.Summary by CodeRabbit
New Features
Bug Fixes
Improvements
Tests