fix: gate interrupted stream retries on recovery facts by Astro-Han · Pull Request #863 · Astro-Han/pawwork

Astro-Han · 2026-05-23T09:10:39Z

Summary

Gates interrupted stream retries on RunIncident recovery facts instead of legacy error-only retry classification.

This PR adds the first safe recovery slice for interrupted LLM streams:

Auto-retries once only when the failed stream attempt for the current provider request proves there was no visible output, no materialized tool call, no tool execution, and complete side-effect facts.
Keeps earlier assistant-message history as diagnostics, but does not let completed prior steps block a clean current provider request retry.
Records watchdog timeouts as first-class watchdog_timeout incidents and preserves attempt history across retry success/failure.
Adds sanitized side-effect boundary snapshots so unknown/unclassified request tool boundaries fail closed.
Prevents retryable stream errors after visible output from silently replaying the assistant message.
Stops scheduled auto-retry if a local lifecycle close appears during retry backoff, and keeps lifecycle interruption copy aligned with diagnostics.
Treats user cancellation during retry backoff as an abort instead of a recovered success.
Computes side-effect boundary snapshots from the same active tools exposed to the provider.
Replaces raw provider/SDK interruption messages with structured interrupted copy for non-auto recovery cases.

Why

#857 showed a GPT-5.5 stream watchdog timeout before first provider progress. The failed attempt had no user-visible output and no tool side effects, but PawWork surfaced a terminal timeout instead of safely retrying.

The root issue was that execution retry was still driven by SessionRetry.policy, which only looked at error shape. Diagnostics already had enough run facts to decide whether retry was safe, but those facts were not connected to stream retry scheduling. That also left a risk in the other direction: a retryable stream error after visible output could be replayed by the old retry path.

Related Issue

Closes #857.
Addresses the first execution slice of #804. This does not close #804 because Continue/Resume UI/API are still deferred.

Human Review Status

Pending

Review Focus

Please focus on the stream retry safety boundary:

every issued stream retry should go through recordAttemptFailureAndDeriveRecovery;
visible output, partial tool input, materialized tool calls, tool execution, lifecycle close, and incomplete side-effect facts should block silent auto-retry;
successful auto-retry should preserve attempt diagnostics without leaving a terminal assistant error;
side-effect boundary snapshots should remain sanitized and low-cardinality.

Risk Notes

Behavior risk: stream retry scheduling now lives in the processor loop instead of Effect.retry(SessionRetry.policy(...)). The intent is narrower stream retry behavior: at most one fact-gated retry for safe interrupted streams. This intentionally also limits generic retryable 429/5xx stream failures to one safe-recovery retry in PR1; restoring multi-attempt Retry-After/backoff provider transient retry should be handled as a separate policy follow-up.

Skipped conditional manual UI/copy check: no renderer component changed. Server-side interrupted copy is asserted in processor-effect tests, and full recovery card UI remains deferred to #804.

Skipped conditional platform check: no platform, packaging, updater, signing, path, shell, or permission surface changed.

Skipped conditional docs/release/dependency note: no docs, release notes, dependencies, permissions, credentials, deletion behavior, generated content, or local file changes are part of this PR.

How To Verify

Focused tests: bun test test/session/run-observability.test.ts test/session/processor-effect.test.ts test/session/export.test.ts --timeout 40000 -> 120 pass, 0 fail
Typecheck: bun run typecheck -> pass
Diff check: git diff --check -> pass

Screenshots or Recordings

Not applicable. No renderer UI component changed.

Checklist

How to use this checklist:

Tick a box by replacing [ ] with [x]. Do not edit, add, or remove items.

The bot-applied label items can only be honestly ticked AFTER the PR is opened and the labeler / priority-triage bots have run — return to the PR description and tick them then.

Most items are required. The few that are conditional are explicitly marked (conditional); for those, leave unticked if they truly do not apply and explain why in Risk Notes. All other items must be ticked before requesting human review.

Summary by CodeRabbit

New Features
- Single automatic safe-recovery retry for transient stream/connect failures (with fixed backoff)
- Recovered-incident tracking surfaced in diagnostics/exports
Bug Fixes
- Prevents unwanted retries when user cancels or local lifecycle closes
- Avoids replaying assistant output after retryable stream termination
Improvements
- Expanded observability with side-effect boundary snapshots and richer recovery reasons
Tests
- Added live and unit tests covering retry, recovery and observability behaviors

coderabbitai · 2026-05-23T09:10:52Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: def7883b-c973-4314-a8ab-c59c477bba80

📥 Commits

Reviewing files that changed from the base of the PR and between 4fc12f4 and 39761dd.

📒 Files selected for processing (10)

packages/opencode/src/session/export.ts
packages/opencode/src/session/llm.ts
packages/opencode/src/session/processor.ts
packages/opencode/src/session/run-incident/derive.ts
packages/opencode/src/session/run-incident/policy.ts
packages/opencode/src/session/run-observability/recorder.ts
packages/opencode/src/session/run-observability/types.ts
packages/opencode/test/session/export.test.ts
packages/opencode/test/session/processor-effect.test.ts
packages/opencode/test/session/run-observability.test.ts

📝 Walkthrough

Walkthrough

Implements a one-time safe-recovery automatic retry for interrupted LLM streams using side-effect boundary snapshots. Adds snapshot types and re-exports, records snapshots in RunObservability, refactors processor to a manual retry loop with fixed backoff and lifecycle gating, updates incident derivation/policy, expands export sanitization, and adds tests.

Changes

Safe Recovery Auto-Retry

Layer / File(s)	Summary
Side-effect snapshot types `packages/opencode/src/session/run-observability/types.ts`, `packages/opencode/src/session/run-observability/index.ts`, `packages/opencode/src/session/run-incident/types.ts`	Adds `SideEffectBoundarySnapshot`, re-exports it from `RunObservability`, and attaches optional `side_effect_boundary_snapshot` to incident evidence and summaries.
Export incident flattening & sanitization `packages/opencode/src/session/export.ts`, `packages/opencode/test/session/export.test.ts`	Collects incidents from run_observability summaries (including `recovered_incidents`), sanitizes recovered_incidents, adjusts schema-version logic, and adds a test verifying sanitized snapshots exclude raw/sensitive provider content.
LLM tool resolution API `packages/opencode/src/session/llm.ts`	Exports `resolveTools` from the module to allow external reuse of tool-resolution/permission logic.
Processor helpers & halt extension `packages/opencode/src/session/processor.ts`	Adds fixed BACKOFF_MS, watchdog-phase classifier, side-effect snapshot derivation, interruption-message mapper, and extends `halt()` to accept `{ recordFailure?: boolean; interruptionMessage?: string }`.
Manual processor retry loop `packages/opencode/src/session/processor.ts`	Refactors `process()` into a manual retry loop with `runAttempt()` that records side-effect snapshots, drains the LLM stream, derives recovery on failure, and allows at most one automatic safe-recovery retry guarded by lifecycle-close checks and recovery signals.
Facts derivation & recovery policy `packages/opencode/src/session/run-incident/derive.ts`, `packages/opencode/src/session/run-incident/policy.ts`	Compute attempt-scoped `side_effect_facts_complete` in `factsFromEvidence`; use `terminalFacts` for terminal gating; add early `do_not_retry` branches for `user_cancel_seen` and `lifecycle_close_seen`.
Recorder: snapshot recording & incident derivation `packages/opencode/src/session/run-observability/recorder.ts`, `packages/opencode/src/session/run-observability/types.ts`	Recorder accepts/emits `side_effect_boundary_snapshot`, updates proof/sideEffectFactsComplete, centralizes `deriveCurrentIncident`, records transport vs watchdog evidence, tracks recovered attempts/incidents, and includes snapshot + `recovered_incidents` in final summary; adds `watchdog_timeout` classification handling.
Processor & observability tests `packages/opencode/test/session/processor-effect.test.ts`, `packages/opencode/test/session/run-observability.test.ts`	Adds live and unit tests covering one-time auto-retry success/failure paths, lifecycle-close gating during backoff, backoff interruption, safe retry with disabled unknown tools, non-replay of visible output, watchdog-timeout derivation, and side-effect incomplete denial.

Sequence Diagram(s)

sequenceDiagram
  participant Processor
  participant RunObservability as Recorder
  participant LLM
  participant Exporter
  Processor->>Recorder: recordSideEffectBoundarySnapshot(attemptID, snapshot)
  Processor->>LLM: startStream(attempt)
  LLM->>Processor: stream events / progress / error
  Processor->>Recorder: recordAttemptFailureAndDeriveRecovery(evidence)
  Recorder->>Processor: recoveryRecommendation (auto_retry_once / do_not_retry / ask_user_before_retry)
  alt auto_retry_once and lifecycle allows
    Processor->>Processor: sleep BACKOFF_MS
    Processor->>LLM: startStream(retry attempt)
  else non-auto-retry
    Processor->>Processor: halt(recordFailure=false, interruptionMessage)
    Processor->>Exporter: include incident / recovered_incidents in export
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Astro-Han/pawwork#812: Modifies run-incident/run-observability pipeline; related to incident evidence and recovery wiring.
Astro-Han/pawwork#825: Overlaps processor interruption/scope-closed handling and lifecycle metadata.
Astro-Han/pawwork#558: Related connect-timeout behavior in session/llm.ts that this PR builds on for retry semantics.

Poem

🐰 A stream went quiet before the show,

No tools, no traces, nothing to sow.
We snapshot the boundary, wait a beat—then try once more,
If life closes the door, we stop, not roar.
A soft retry, a tidy trace, hop on—no chore.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title 'fix: gate interrupted stream retries on recovery facts' is concise, clear, and accurately summarizes the main change: moving stream retry logic from legacy error classification to RunIncident recovery facts.
Description check	✅ Passed	PR description follows the template with all required sections: Summary, Why, Related Issue, Human Review Status, Review Focus, Risk Notes, How To Verify, and completed checklist items.
Linked Issues check	✅ Passed	PR fully addresses `#857` objectives (auto-retry once for pre-first-provider-progress failures with no visible output/tool effects; watchdog-fired evidence; preserved diagnostics) and implements the first execution slice of `#804` (safe recovery policy based on run facts: visible output, tool materialization/execution, side-effect completeness, lifecycle close as blockers).
Out of Scope Changes check	✅ Passed	All changes are in-scope: stream retry scheduling refactored to use recovery facts; watchdog incidents recorded; side-effect boundary snapshots added; lifecycle close and user cancel handling implemented; all directly address `#857/`#804 objectives without introducing unrelated refactors or dependencies.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/i857-safe-recovery

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions

Suggested priority: P2 (includes non-doc, non-test paths outside the low-risk bucket).

P1/P0 are reserved for maintainer confirmation. Please relabel manually if this is a release blocker, security issue, data-loss risk, or updater/runtime failure.

gemini-code-assist

Code Review

This pull request implements a safe recovery and auto-retry mechanism for LLM streams within the SessionProcessor. It introduces watchdog phase detection to identify connection timeouts and side-effect boundary snapshots to ensure retries only occur when no external side effects have been triggered. The observability system is updated to record these snapshots and track multiple attempts per run. Review feedback focuses on refining the observability recorder to handle terminal outcomes correctly across retries, internationalizing new user-facing interruption messages, and improving type safety when merging error metadata.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/opencode/src/session/processor.ts`:
- Around line 1289-1319: The interruption message is computed once into decision
via ctx.runTrace.recordAttemptFailureAndDeriveRecovery before a possible
backoff, so if retryStillAllowed("after_backoff") becomes false you may surface
a stale decision; update the code to recompute the recovery decision and
interruption message immediately before calling yield* halt(...) when a
previously-planned auto-retry was aborted (i.e. after the backoff sleep and
retryStillAllowed check fails). Specifically, call
ctx.runTrace.recordAttemptFailureAndDeriveRecovery again (using the same
attemptID, error, watchdog/evidence details) and pass
recoveryInterruptionMessage(newDecision) to halt so the user-facing interruption
reason reflects the latest lifecycle state; keep existing guards around
automaticStreamRetriesUsed and SAFE_RECOVERY_AUTO_RETRY_BACKOFF_MS.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9c66ec86-5843-4ff8-942f-8ac5e56268cd

📥 Commits

Reviewing files that changed from the base of the PR and between 694a390 and 4fc12f4.

📒 Files selected for processing (9)

packages/opencode/src/session/processor.ts
packages/opencode/src/session/run-incident/policy.ts
packages/opencode/src/session/run-incident/types.ts
packages/opencode/src/session/run-observability/index.ts
packages/opencode/src/session/run-observability/recorder.ts
packages/opencode/src/session/run-observability/types.ts
packages/opencode/test/session/export.test.ts
packages/opencode/test/session/processor-effect.test.ts
packages/opencode/test/session/run-observability.test.ts

Astro-Han added 2 commits May 23, 2026 16:26

fix: gate stream retry on recovery facts

932a7f3

fix: harden safe recovery interruptions

4fc12f4

Astro-Han added bug Something isn't working P1 High priority harness Model harness, prompts, tool descriptions, and session mechanics labels May 23, 2026

github-actions Bot reviewed May 23, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 23, 2026

View reviewed changes

Comment thread packages/opencode/src/session/run-observability/recorder.ts

Comment thread packages/opencode/src/session/processor.ts

Comment thread packages/opencode/src/session/processor.ts Outdated

coderabbitai Bot reviewed May 23, 2026

View reviewed changes

Comment thread packages/opencode/src/session/processor.ts

Astro-Han added 4 commits May 23, 2026 17:51

fix: scope safe retry to terminal attempt

1b50c47

fix: harden retry backoff recovery

3859e2e

test: lock safe retry attempt limit

bde7795

fix: guard interruption metadata shape

39761dd

Astro-Han merged commit 9f21e74 into dev May 23, 2026
27 of 28 checks passed

Astro-Han deleted the codex/i857-safe-recovery branch May 23, 2026 11:25

This was referenced May 26, 2026

fix: allow safe retry before provider progress #914

Merged

refactor(session): extract safe recovery gate #929

Merged

This was referenced May 31, 2026

enhancement: improve stream disconnect retry with exponential backoff and more attempts #1005

Closed

[Feature] improve stream disconnect retry with exponential backoff and more attempts #1006

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: gate interrupted stream retries on recovery facts#863

fix: gate interrupted stream retries on recovery facts#863
Astro-Han merged 6 commits into
devfrom
codex/i857-safe-recovery

Astro-Han commented May 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 23, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Astro-Han commented May 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Related Issue

Human Review Status

Review Focus

Risk Notes

How To Verify

Screenshots or Recordings

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Astro-Han commented May 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 23, 2026 •

edited

Loading