Skip to content

fix: gate interrupted stream retries on recovery facts#863

Merged
Astro-Han merged 6 commits into
devfrom
codex/i857-safe-recovery
May 23, 2026
Merged

fix: gate interrupted stream retries on recovery facts#863
Astro-Han merged 6 commits into
devfrom
codex/i857-safe-recovery

Conversation

@Astro-Han

@Astro-Han Astro-Han commented May 23, 2026

Copy link
Copy Markdown
Owner

Summary

Gates interrupted stream retries on RunIncident recovery facts instead of legacy error-only retry classification.

This PR adds the first safe recovery slice for interrupted LLM streams:

  • Auto-retries once only when the failed stream attempt for the current provider request proves there was no visible output, no materialized tool call, no tool execution, and complete side-effect facts.
  • Keeps earlier assistant-message history as diagnostics, but does not let completed prior steps block a clean current provider request retry.
  • Records watchdog timeouts as first-class watchdog_timeout incidents and preserves attempt history across retry success/failure.
  • Adds sanitized side-effect boundary snapshots so unknown/unclassified request tool boundaries fail closed.
  • Prevents retryable stream errors after visible output from silently replaying the assistant message.
  • Stops scheduled auto-retry if a local lifecycle close appears during retry backoff, and keeps lifecycle interruption copy aligned with diagnostics.
  • Treats user cancellation during retry backoff as an abort instead of a recovered success.
  • Computes side-effect boundary snapshots from the same active tools exposed to the provider.
  • Replaces raw provider/SDK interruption messages with structured interrupted copy for non-auto recovery cases.

Why

#857 showed a GPT-5.5 stream watchdog timeout before first provider progress. The failed attempt had no user-visible output and no tool side effects, but PawWork surfaced a terminal timeout instead of safely retrying.

The root issue was that execution retry was still driven by SessionRetry.policy, which only looked at error shape. Diagnostics already had enough run facts to decide whether retry was safe, but those facts were not connected to stream retry scheduling. That also left a risk in the other direction: a retryable stream error after visible output could be replayed by the old retry path.

Related Issue

Closes #857.
Addresses the first execution slice of #804. This does not close #804 because Continue/Resume UI/API are still deferred.

Human Review Status

Pending

Review Focus

Please focus on the stream retry safety boundary:

  • every issued stream retry should go through recordAttemptFailureAndDeriveRecovery;
  • visible output, partial tool input, materialized tool calls, tool execution, lifecycle close, and incomplete side-effect facts should block silent auto-retry;
  • successful auto-retry should preserve attempt diagnostics without leaving a terminal assistant error;
  • side-effect boundary snapshots should remain sanitized and low-cardinality.

Risk Notes

Behavior risk: stream retry scheduling now lives in the processor loop instead of Effect.retry(SessionRetry.policy(...)). The intent is narrower stream retry behavior: at most one fact-gated retry for safe interrupted streams. This intentionally also limits generic retryable 429/5xx stream failures to one safe-recovery retry in PR1; restoring multi-attempt Retry-After/backoff provider transient retry should be handled as a separate policy follow-up.

Skipped conditional manual UI/copy check: no renderer component changed. Server-side interrupted copy is asserted in processor-effect tests, and full recovery card UI remains deferred to #804.

Skipped conditional platform check: no platform, packaging, updater, signing, path, shell, or permission surface changed.

Skipped conditional docs/release/dependency note: no docs, release notes, dependencies, permissions, credentials, deletion behavior, generated content, or local file changes are part of this PR.

How To Verify

Focused tests: bun test test/session/run-observability.test.ts test/session/processor-effect.test.ts test/session/export.test.ts --timeout 40000 -> 120 pass, 0 fail
Typecheck: bun run typecheck -> pass
Diff check: git diff --check -> pass

Screenshots or Recordings

Not applicable. No renderer UI component changed.

Checklist

How to use this checklist:

  • Tick a box by replacing [ ] with [x]. Do not edit, add, or remove items.
  • The bot-applied label items can only be honestly ticked AFTER the PR is opened and the labeler / priority-triage bots have run — return to the PR description and tick them then.
  • Most items are required. The few that are conditional are explicitly marked (conditional); for those, leave unticked if they truly do not apply and explain why in Risk Notes. All other items must be ticked before requesting human review.
  • Type label — this PR carries exactly one of bug, enhancement, task, documentation. Type labels are author-added; the labeler bot does NOT assign them. Add the label in the GitHub UI, then tick this.
  • Routing labels — this PR carries at least one of app, ui, platform, harness, ci. The labeler bot assigns these on PR open based on changed paths. Confirm the bot's choice (or override if wrong), then tick this.
  • Priority label — this PR carries exactly one of P0, P1, P2, P3. The priority-triage bot suggests one on PR open. Confirm or override, then tick this.
  • Human Review Status above is set to Pending, Approved by @<reviewer>, or Not required: <reason> (default is Pending; "not required" is restricted to bot-authored low-risk PRs).
  • I linked the related issue, or stated in Summary why there is no issue.
  • I described the review focus and any meaningful risks.
  • I replaced the example block in How To Verify with the real verification steps and the key result for each.
  • I did not introduce unrelated refactors, dependencies, generated files, or file changes beyond the stated scope.
  • (conditional) I manually checked visible UI or copy changes when needed, with screenshots or recordings. Leave unticked only if no visible UI or copy changed.
  • (conditional) I considered macOS and Windows impact for platform, packaging, updater, signing, paths, shell, or permissions changes. Leave unticked only if no platform/packaging surface was touched.
  • (conditional) I called out docs, release notes, dependencies, permissions, credentials, deletion behavior, generated content, or local file changes when relevant. Leave unticked only if none of those surfaces was touched.
  • I reviewed the final diff for unrelated changes and suspicious dependency changes.
  • I am targeting dev, and my PR title and commit messages use Conventional Commits in English.

Summary by CodeRabbit

  • New Features

    • Single automatic safe-recovery retry for transient stream/connect failures (with fixed backoff)
    • Recovered-incident tracking surfaced in diagnostics/exports
  • Bug Fixes

    • Prevents unwanted retries when user cancels or local lifecycle closes
    • Avoids replaying assistant output after retryable stream termination
  • Improvements

    • Expanded observability with side-effect boundary snapshots and richer recovery reasons
  • Tests

    • Added live and unit tests covering retry, recovery and observability behaviors

Review Change Stack

@Astro-Han Astro-Han added bug Something isn't working P1 High priority harness Model harness, prompts, tool descriptions, and session mechanics labels May 23, 2026
@coderabbitai

coderabbitai Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: def7883b-c973-4314-a8ab-c59c477bba80

📥 Commits

Reviewing files that changed from the base of the PR and between 4fc12f4 and 39761dd.

📒 Files selected for processing (10)
  • packages/opencode/src/session/export.ts
  • packages/opencode/src/session/llm.ts
  • packages/opencode/src/session/processor.ts
  • packages/opencode/src/session/run-incident/derive.ts
  • packages/opencode/src/session/run-incident/policy.ts
  • packages/opencode/src/session/run-observability/recorder.ts
  • packages/opencode/src/session/run-observability/types.ts
  • packages/opencode/test/session/export.test.ts
  • packages/opencode/test/session/processor-effect.test.ts
  • packages/opencode/test/session/run-observability.test.ts

📝 Walkthrough

Walkthrough

Implements a one-time safe-recovery automatic retry for interrupted LLM streams using side-effect boundary snapshots. Adds snapshot types and re-exports, records snapshots in RunObservability, refactors processor to a manual retry loop with fixed backoff and lifecycle gating, updates incident derivation/policy, expands export sanitization, and adds tests.

Changes

Safe Recovery Auto-Retry

Layer / File(s) Summary
Side-effect snapshot types
packages/opencode/src/session/run-observability/types.ts, packages/opencode/src/session/run-observability/index.ts, packages/opencode/src/session/run-incident/types.ts
Adds SideEffectBoundarySnapshot, re-exports it from RunObservability, and attaches optional side_effect_boundary_snapshot to incident evidence and summaries.
Export incident flattening & sanitization
packages/opencode/src/session/export.ts, packages/opencode/test/session/export.test.ts
Collects incidents from run_observability summaries (including recovered_incidents), sanitizes recovered_incidents, adjusts schema-version logic, and adds a test verifying sanitized snapshots exclude raw/sensitive provider content.
LLM tool resolution API
packages/opencode/src/session/llm.ts
Exports resolveTools from the module to allow external reuse of tool-resolution/permission logic.
Processor helpers & halt extension
packages/opencode/src/session/processor.ts
Adds fixed BACKOFF_MS, watchdog-phase classifier, side-effect snapshot derivation, interruption-message mapper, and extends halt() to accept { recordFailure?: boolean; interruptionMessage?: string }.
Manual processor retry loop
packages/opencode/src/session/processor.ts
Refactors process() into a manual retry loop with runAttempt() that records side-effect snapshots, drains the LLM stream, derives recovery on failure, and allows at most one automatic safe-recovery retry guarded by lifecycle-close checks and recovery signals.
Facts derivation & recovery policy
packages/opencode/src/session/run-incident/derive.ts, packages/opencode/src/session/run-incident/policy.ts
Compute attempt-scoped side_effect_facts_complete in factsFromEvidence; use terminalFacts for terminal gating; add early do_not_retry branches for user_cancel_seen and lifecycle_close_seen.
Recorder: snapshot recording & incident derivation
packages/opencode/src/session/run-observability/recorder.ts, packages/opencode/src/session/run-observability/types.ts
Recorder accepts/emits side_effect_boundary_snapshot, updates proof/sideEffectFactsComplete, centralizes deriveCurrentIncident, records transport vs watchdog evidence, tracks recovered attempts/incidents, and includes snapshot + recovered_incidents in final summary; adds watchdog_timeout classification handling.
Processor & observability tests
packages/opencode/test/session/processor-effect.test.ts, packages/opencode/test/session/run-observability.test.ts
Adds live and unit tests covering one-time auto-retry success/failure paths, lifecycle-close gating during backoff, backoff interruption, safe retry with disabled unknown tools, non-replay of visible output, watchdog-timeout derivation, and side-effect incomplete denial.

Sequence Diagram(s)

sequenceDiagram
  participant Processor
  participant RunObservability as Recorder
  participant LLM
  participant Exporter
  Processor->>Recorder: recordSideEffectBoundarySnapshot(attemptID, snapshot)
  Processor->>LLM: startStream(attempt)
  LLM->>Processor: stream events / progress / error
  Processor->>Recorder: recordAttemptFailureAndDeriveRecovery(evidence)
  Recorder->>Processor: recoveryRecommendation (auto_retry_once / do_not_retry / ask_user_before_retry)
  alt auto_retry_once and lifecycle allows
    Processor->>Processor: sleep BACKOFF_MS
    Processor->>LLM: startStream(retry attempt)
  else non-auto-retry
    Processor->>Processor: halt(recordFailure=false, interruptionMessage)
    Processor->>Exporter: include incident / recovered_incidents in export
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Astro-Han/pawwork#812: Modifies run-incident/run-observability pipeline; related to incident evidence and recovery wiring.
  • Astro-Han/pawwork#825: Overlaps processor interruption/scope-closed handling and lifecycle metadata.
  • Astro-Han/pawwork#558: Related connect-timeout behavior in session/llm.ts that this PR builds on for retry semantics.

Poem

🐰 A stream went quiet before the show,

No tools, no traces, nothing to sow.
We snapshot the boundary, wait a beat—then try once more,
If life closes the door, we stop, not roar.
A soft retry, a tidy trace, hop on—no chore.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title 'fix: gate interrupted stream retries on recovery facts' is concise, clear, and accurately summarizes the main change: moving stream retry logic from legacy error classification to RunIncident recovery facts.
Description check ✅ Passed PR description follows the template with all required sections: Summary, Why, Related Issue, Human Review Status, Review Focus, Risk Notes, How To Verify, and completed checklist items.
Linked Issues check ✅ Passed PR fully addresses #857 objectives (auto-retry once for pre-first-provider-progress failures with no visible output/tool effects; watchdog-fired evidence; preserved diagnostics) and implements the first execution slice of #804 (safe recovery policy based on run facts: visible output, tool materialization/execution, side-effect completeness, lifecycle close as blockers).
Out of Scope Changes check ✅ Passed All changes are in-scope: stream retry scheduling refactored to use recovery facts; watchdog incidents recorded; side-effect boundary snapshots added; lifecycle close and user cancel handling implemented; all directly address #857/#804 objectives without introducing unrelated refactors or dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/i857-safe-recovery

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested priority: P2 (includes non-doc, non-test paths outside the low-risk bucket).

P1/P0 are reserved for maintainer confirmation. Please relabel manually if this is a release blocker, security issue, data-loss risk, or updater/runtime failure.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a safe recovery and auto-retry mechanism for LLM streams within the SessionProcessor. It introduces watchdog phase detection to identify connection timeouts and side-effect boundary snapshots to ensure retries only occur when no external side effects have been triggered. The observability system is updated to record these snapshots and track multiple attempts per run. Review feedback focuses on refining the observability recorder to handle terminal outcomes correctly across retries, internationalizing new user-facing interruption messages, and improving type safety when merging error metadata.

Comment thread packages/opencode/src/session/run-observability/recorder.ts
Comment thread packages/opencode/src/session/processor.ts
Comment thread packages/opencode/src/session/processor.ts Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/opencode/src/session/processor.ts`:
- Around line 1289-1319: The interruption message is computed once into decision
via ctx.runTrace.recordAttemptFailureAndDeriveRecovery before a possible
backoff, so if retryStillAllowed("after_backoff") becomes false you may surface
a stale decision; update the code to recompute the recovery decision and
interruption message immediately before calling yield* halt(...) when a
previously-planned auto-retry was aborted (i.e. after the backoff sleep and
retryStillAllowed check fails). Specifically, call
ctx.runTrace.recordAttemptFailureAndDeriveRecovery again (using the same
attemptID, error, watchdog/evidence details) and pass
recoveryInterruptionMessage(newDecision) to halt so the user-facing interruption
reason reflects the latest lifecycle state; keep existing guards around
automaticStreamRetriesUsed and SAFE_RECOVERY_AUTO_RETRY_BACKOFF_MS.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9c66ec86-5843-4ff8-942f-8ac5e56268cd

📥 Commits

Reviewing files that changed from the base of the PR and between 694a390 and 4fc12f4.

📒 Files selected for processing (9)
  • packages/opencode/src/session/processor.ts
  • packages/opencode/src/session/run-incident/policy.ts
  • packages/opencode/src/session/run-incident/types.ts
  • packages/opencode/src/session/run-observability/index.ts
  • packages/opencode/src/session/run-observability/recorder.ts
  • packages/opencode/src/session/run-observability/types.ts
  • packages/opencode/test/session/export.test.ts
  • packages/opencode/test/session/processor-effect.test.ts
  • packages/opencode/test/session/run-observability.test.ts

Comment thread packages/opencode/src/session/processor.ts
@Astro-Han Astro-Han merged commit 9f21e74 into dev May 23, 2026
27 of 28 checks passed
@Astro-Han Astro-Han deleted the codex/i857-safe-recovery branch May 23, 2026 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working harness Model harness, prompts, tool descriptions, and session mechanics P1 High priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LLM stream timeout before first provider progress is not auto-retried [Feature] Add safe recovery for interrupted streaming runs

1 participant