fix: allow safe retry before provider progress by Astro-Han · Pull Request #914 · Astro-Han/pawwork

Astro-Han · 2026-05-26T02:35:07Z

Summary

Allow one automatic retry for retryable provider transport failures that happen before first provider progress when the terminal attempt produced no output and no tool activity.
Keep the path fail-closed when the boundary snapshot is missing, provider/external boundaries are present, lifecycle/user cancel is observed, or telemetry shows any output/tool activity.
Add regression coverage for the Issue [Bug] Safe retry is blocked by unknown exposed tool boundaries before provider progress #912 export shape plus negative cases for missing evidence and contradictory tool/output state.

Why

Issue #912 showed a production session where the provider returned Service Unavailable before any provider progress, assistant output, tool input, tool call, or tool execution. PawWork still asked the user before retrying because unknown exposed tool metadata made the static side-effect boundary proof incomplete. Retry safety should be based on what the terminal attempt actually reached; unknown metadata for uncalled tools should not block this early safe-retry case.

Related Issue

Closes #912

Human Review Status

Pending — waiting for a human reviewer to approve.

Review Focus

The new canAutoRetryBeforeFirstProviderProgress predicate in packages/opencode/src/session/run-incident/policy.ts, especially the fail-closed guards for missing snapshots, global lifecycle/user cancel, provider/external boundaries, and contradictory output/tool facts.
The negative matrix in packages/opencode/test/session/run-observability.test.ts to ensure the fast path only applies to truly early no-output/no-tool failures.

Risk Notes

The behavior change is intentionally narrow, but it affects retry policy for model transport failures. The main risk is over-retrying if evidence is incomplete; the fast path requires a recorded boundary snapshot and explicit absence of provider/external boundaries to mitigate that.

Skipped conditional checklist items:

Visible UI/copy check: not applicable; no visible UI or copy changed.
Platform impact check: not applicable; no platform, packaging, updater, signing, paths, shell, or permissions surface changed.
Docs/release/dependencies/etc.: not applicable; no docs, release notes, dependencies, permissions, credentials, generated content, or local file behavior changed.

How To Verify

RED check: bun test test/session/run-observability.test.ts failed on the new Issue #912 regression with side_effect_facts_incomplete before the fix.
Focused observability tests: bun test test/session/run-observability.test.ts — 62 passed.
Processor integration tests: bun test test/session/processor-effect.test.ts — 29 passed.
Typecheck: bun run typecheck in packages/opencode — passed.
Diff check: git diff --check — no whitespace errors.

Screenshots or Recordings

Not applicable; no visible UI changes.

Checklist

How to use this checklist:

Tick a box by replacing [ ] with [x]. Do not edit, add, or remove items.

The bot-applied label items can only be honestly ticked AFTER the PR is opened and the labeler / priority-triage bots have run — return to the PR description and tick them then.

Most items are required. The few that are conditional are explicitly marked (conditional); for those, leave unticked if they truly do not apply and explain why in Risk Notes. All other items must be ticked before requesting human review.

Summary by CodeRabbit

Bug Fixes
- Improved error recovery logic with enhanced automatic retry capabilities during incident handling.
- Refined safety checks to ensure retries are attempted only when appropriate conditions are met.
Tests
- Expanded test coverage for error recovery scenarios and retry behavior validation.

coderabbitai · 2026-05-26T02:35:19Z

Warning

Review limit reached

@Astro-Han, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 53 minutes and 2 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9704b870-58b2-40db-8111-b7457a42227c

📥 Commits

Reviewing files that changed from the base of the PR and between 1d53a14 and 5e43a7d.

📒 Files selected for processing (2)

packages/opencode/src/session/run-incident/policy.ts
packages/opencode/test/session/run-observability.test.ts

📝 Walkthrough

Walkthrough

This PR fixes a production bug where safe retries were blocked by unknown exposed-tool boundaries even when no tools were called. It adds an early eligibility path in the incident retry policy that allows automatic retry for retryable transport failures occurring before first provider progress when no output or tool activity has been observed, regardless of incomplete boundary metadata.

Changes

Before-progress auto-retry eligibility

Layer / File(s)	Summary
Before-progress auto-retry policy and helpers `packages/opencode/src/session/run-incident/policy.ts`	`recoveryFor` early decision tree adds `canAutoRetryBeforeFirstProviderProgress` eligibility check for retryable transport/timeout causes occurring before first provider progress with no tool/output activity and boundary snapshot permission. New internal helpers determine cause subcategory, detect absence of all activity flags, and gate on boundary proof/external/provider-executed capability. `user_cancel` and `local_lifecycle_close` returns moved earlier in decision flow.
Before-progress auto-retry test fixtures and validation `packages/opencode/test/session/run-observability.test.ts`	Adds shared test helpers (`beforeProgressCause`, `beforeProgressFacts`, `recoveryForBeforeProgress`) for before-progress retry scenarios. Test cases verify auto-retry is allowed when unknown tools have no activity, denied when boundary snapshot is missing or any boundary evidence appears, and denied for non-retryable transport errors. Replaces earlier "unknown boundary" test with assertion that unknown tools do not block retry when no activity occurred.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Astro-Han/pawwork#812: Introduces the initial recoveryFor policy and basic provider-disconnect auto-retry path in the same file that this PR extends with before-progress eligibility logic.
Astro-Han/pawwork#863: Updates recoveryFor decision tree to gate auto-retry based on side-effect/boundary facts completeness and adjusts early termination-cause handling, overlapping at the same retry eligibility logic that this PR refactors.

Suggested labels

P1

Poem

🐰 A transport hiccup mid-session flow,
Where tools untouched, yet boundaries below,
Now asks: did we start? before gates close,
Safe retry before the first output rose. 🌱

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: allowing safe automatic retry before the provider produces progress.
Description check	✅ Passed	The PR description is comprehensive, covering summary, rationale, related issue, review focus, risks, verification steps, and all required checklist items are addressed.
Linked Issues check	✅ Passed	The code changes directly address Issue `#912` by implementing the minimal fast-path solution: allowing auto-retry for retryable transport failures before first provider progress when no output/tool activity occurred and boundary snapshot permits it.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to implementing the early retry fast-path for Issue `#912`, with no unrelated refactors, dependencies, or file changes beyond the stated scope.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch pawwork/issue-912-safe-retry-fastpath

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions

Suggested priority: P2 (includes non-doc, non-test paths outside the low-risk bucket).

P1/P0 are reserved for maintainer confirmation. Please relabel manually if this is a release blocker, security issue, data-loss risk, or updater/runtime failure.

gemini-code-assist

Code Review

This pull request introduces logic to allow automatic retries before the first provider progress occurs, provided there has been no visible output or tool activity. It adds the helper function canAutoRetryBeforeFirstProviderProgress along with several supporting checks in policy.ts, and updates the recovery policy evaluation order. Additionally, comprehensive unit tests have been added in run-observability.test.ts to validate these new retry conditions and ensure they fail closed when output, tool activity, or external boundaries are detected. There are no review comments, so I have no further feedback to provide.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

packages/opencode/test/session/run-observability.test.ts (1)
1067-1105: ⚡ Quick win

Boundary fail-closed coverage is currently confounded by incomplete-facts fallback.

In Line 1068 case setup, provider/external-boundary scenarios inherit side_effect_facts_complete: false, so these assertions can pass without proving boundary evidence itself blocks auto-retry. Add variants with side_effect_facts_complete: true to lock this behavior directly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/opencode/test/session/run-observability.test.ts` around lines 1067 -
1105, The provider/external-boundary cases in the test rely on
beforeProgressFacts() defaulting to side_effect_facts_complete: false, so the
assertions don't prove boundary evidence itself blocks auto-retry; update the
test by adding variants for the two boundary cases that set
side_effect_facts_complete: true inside the overrides (i.e., add entries where
the override includes side_effect_facts_complete: true alongside
side_effect_boundary_snapshot with provider_executed_capability_present or
external_boundary_present and their proof_reason) and then call
recoveryForBeforeProgress(overrides) and assert the decision does not match {
recommendation: "auto_retry_once" } to lock the behavior when facts are
complete.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/opencode/src/session/run-incident/policy.ts`:
- Around line 27-47: The before-progress causes were excluded only from the
reasoning-only branch but not from the generic transport auto-retry branch,
allowing auto_retry_once to be returned incorrectly; update the transport
auto-retry condition (the branch that currently checks retryableTransport &&
noToolActivity && !isBeforeFirstProviderProgressCause(input.cause) &&
terminalFacts.reasoning_output_started — or if that exact check is in a
different branch, add the missing guard) to explicitly check for and exclude
before-first-provider-progress causes using
isBeforeFirstProviderProgressCause(input.cause) (or reuse
canAutoRetryBeforeFirstProviderProgress) so that any branch that returns
auto_retry_once also requires the before-progress gate to have passed.

---

Nitpick comments:
In `@packages/opencode/test/session/run-observability.test.ts`:
- Around line 1067-1105: The provider/external-boundary cases in the test rely
on beforeProgressFacts() defaulting to side_effect_facts_complete: false, so the
assertions don't prove boundary evidence itself blocks auto-retry; update the
test by adding variants for the two boundary cases that set
side_effect_facts_complete: true inside the overrides (i.e., add entries where
the override includes side_effect_facts_complete: true alongside
side_effect_boundary_snapshot with provider_executed_capability_present or
external_boundary_present and their proof_reason) and then call
recoveryForBeforeProgress(overrides) and assert the decision does not match {
recommendation: "auto_retry_once" } to lock the behavior when facts are
complete.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b027bff3-d429-4c58-8165-d75f5e3b6404

📥 Commits

Reviewing files that changed from the base of the PR and between ecb8cd2 and 1d53a14.

📒 Files selected for processing (2)

packages/opencode/src/session/run-incident/policy.ts
packages/opencode/test/session/run-observability.test.ts

fix: allow safe retry before provider progress

1d53a14

Astro-Han added bug Something isn't working P2 Medium priority harness Model harness, prompts, tool descriptions, and session mechanics labels May 26, 2026

github-actions Bot reviewed May 26, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 26, 2026

View reviewed changes

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

Comment thread packages/opencode/src/session/run-incident/policy.ts

Astro-Han mentioned this pull request May 26, 2026

[Feature] Recover faster from stalled reasoning-model connections before safe retry #918

Closed

Astro-Han added 3 commits May 26, 2026 10:52

test: cover before-progress boundary retry denial

adfd11d

fix: reject unknown before-progress boundary proof

b2d9920

fix: require boundary snapshot before retry fallback

5e43a7d

Astro-Han merged commit f6a0e38 into dev May 26, 2026
26 checks passed

Astro-Han deleted the pawwork/issue-912-safe-retry-fastpath branch May 26, 2026 04:47

This was referenced May 26, 2026

fix: scope reasoning safe retry timeouts by attempt #922

Merged

[Task] Consolidate model execution retry pipeline #925

Closed

[Feature] Recover model runs after tool activity on transport disconnect #927

Open

coderabbitai Bot mentioned this pull request May 26, 2026

refactor(session): extract safe recovery gate #929

Merged

13 tasks

This was referenced May 31, 2026

enhancement: improve stream disconnect retry with exponential backoff and more attempts #1005

Closed

[Feature] improve stream disconnect retry with exponential backoff and more attempts #1006

Closed

coderabbitai Bot mentioned this pull request May 31, 2026

feat(session): retry safe recovery up to 3 times with exponential backoff #1008

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: allow safe retry before provider progress#914

fix: allow safe retry before provider progress#914
Astro-Han merged 4 commits into
devfrom
pawwork/issue-912-safe-retry-fastpath

Astro-Han commented May 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Astro-Han commented May 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Related Issue

Human Review Status

Review Focus

Risk Notes

How To Verify

Screenshots or Recordings

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Astro-Han commented May 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading