Skip to content

fix: allow safe retry before provider progress#914

Merged
Astro-Han merged 4 commits into
devfrom
pawwork/issue-912-safe-retry-fastpath
May 26, 2026
Merged

fix: allow safe retry before provider progress#914
Astro-Han merged 4 commits into
devfrom
pawwork/issue-912-safe-retry-fastpath

Conversation

@Astro-Han

@Astro-Han Astro-Han commented May 26, 2026

Copy link
Copy Markdown
Owner

Summary

  • Allow one automatic retry for retryable provider transport failures that happen before first provider progress when the terminal attempt produced no output and no tool activity.
  • Keep the path fail-closed when the boundary snapshot is missing, provider/external boundaries are present, lifecycle/user cancel is observed, or telemetry shows any output/tool activity.
  • Add regression coverage for the Issue [Bug] Safe retry is blocked by unknown exposed tool boundaries before provider progress #912 export shape plus negative cases for missing evidence and contradictory tool/output state.

Why

Issue #912 showed a production session where the provider returned Service Unavailable before any provider progress, assistant output, tool input, tool call, or tool execution. PawWork still asked the user before retrying because unknown exposed tool metadata made the static side-effect boundary proof incomplete. Retry safety should be based on what the terminal attempt actually reached; unknown metadata for uncalled tools should not block this early safe-retry case.

Related Issue

Closes #912

Human Review Status

  • Pending — waiting for a human reviewer to approve.

Review Focus

  • The new canAutoRetryBeforeFirstProviderProgress predicate in packages/opencode/src/session/run-incident/policy.ts, especially the fail-closed guards for missing snapshots, global lifecycle/user cancel, provider/external boundaries, and contradictory output/tool facts.
  • The negative matrix in packages/opencode/test/session/run-observability.test.ts to ensure the fast path only applies to truly early no-output/no-tool failures.

Risk Notes

The behavior change is intentionally narrow, but it affects retry policy for model transport failures. The main risk is over-retrying if evidence is incomplete; the fast path requires a recorded boundary snapshot and explicit absence of provider/external boundaries to mitigate that.

Skipped conditional checklist items:

  • Visible UI/copy check: not applicable; no visible UI or copy changed.
  • Platform impact check: not applicable; no platform, packaging, updater, signing, paths, shell, or permissions surface changed.
  • Docs/release/dependencies/etc.: not applicable; no docs, release notes, dependencies, permissions, credentials, generated content, or local file behavior changed.

How To Verify

RED check: bun test test/session/run-observability.test.ts failed on the new Issue #912 regression with side_effect_facts_incomplete before the fix.
Focused observability tests: bun test test/session/run-observability.test.ts — 62 passed.
Processor integration tests: bun test test/session/processor-effect.test.ts — 29 passed.
Typecheck: bun run typecheck in packages/opencode — passed.
Diff check: git diff --check — no whitespace errors.

Screenshots or Recordings

Not applicable; no visible UI changes.

Checklist

How to use this checklist:

  • Tick a box by replacing [ ] with [x]. Do not edit, add, or remove items.
  • The bot-applied label items can only be honestly ticked AFTER the PR is opened and the labeler / priority-triage bots have run — return to the PR description and tick them then.
  • Most items are required. The few that are conditional are explicitly marked (conditional); for those, leave unticked if they truly do not apply and explain why in Risk Notes. All other items must be ticked before requesting human review.
  • Type label — this PR carries exactly one of bug, enhancement, task, documentation. Type labels are author-added; the labeler bot does NOT assign them. Add the label in the GitHub UI, then tick this.
  • Routing labels — this PR carries at least one of app, ui, platform, harness, ci. The labeler bot assigns these on PR open based on changed paths. Confirm the bot's choice (or override if wrong), then tick this.
  • Priority label — this PR carries exactly one of P0, P1, P2, P3. The priority-triage bot suggests one on PR open. Confirm or override, then tick this.
  • Human Review Status above is set to Pending, Approved by @<reviewer>, or Not required: <reason> (default is Pending; "not required" is restricted to bot-authored low-risk PRs).
  • I linked the related issue, or stated in Summary why there is no issue.
  • I described the review focus and any meaningful risks.
  • I replaced the example block in How To Verify with the real verification steps and the key result for each.
  • I did not introduce unrelated refactors, dependencies, generated files, or file changes beyond the stated scope.
  • (conditional) I manually checked visible UI or copy changes when needed, with screenshots or recordings. Leave unticked only if no visible UI or copy changed.
  • (conditional) I considered macOS and Windows impact for platform, packaging, updater, signing, paths, shell, or permissions changes. Leave unticked only if no platform/packaging surface was touched.
  • (conditional) I called out docs, release notes, dependencies, permissions, credentials, deletion behavior, generated content, or local file changes when relevant. Leave unticked only if none of those surfaces was touched.
  • I reviewed the final diff for unrelated changes and suspicious dependency changes.
  • I am targeting dev, and my PR title and commit messages use Conventional Commits in English.

Summary by CodeRabbit

  • Bug Fixes

    • Improved error recovery logic with enhanced automatic retry capabilities during incident handling.
    • Refined safety checks to ensure retries are attempted only when appropriate conditions are met.
  • Tests

    • Expanded test coverage for error recovery scenarios and retry behavior validation.

Review Change Stack

@Astro-Han Astro-Han added bug Something isn't working P2 Medium priority harness Model harness, prompts, tool descriptions, and session mechanics labels May 26, 2026
@coderabbitai

coderabbitai Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

Warning

Review limit reached

@Astro-Han, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 53 minutes and 2 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9704b870-58b2-40db-8111-b7457a42227c

📥 Commits

Reviewing files that changed from the base of the PR and between 1d53a14 and 5e43a7d.

📒 Files selected for processing (2)
  • packages/opencode/src/session/run-incident/policy.ts
  • packages/opencode/test/session/run-observability.test.ts
📝 Walkthrough

Walkthrough

This PR fixes a production bug where safe retries were blocked by unknown exposed-tool boundaries even when no tools were called. It adds an early eligibility path in the incident retry policy that allows automatic retry for retryable transport failures occurring before first provider progress when no output or tool activity has been observed, regardless of incomplete boundary metadata.

Changes

Before-progress auto-retry eligibility

Layer / File(s) Summary
Before-progress auto-retry policy and helpers
packages/opencode/src/session/run-incident/policy.ts
recoveryFor early decision tree adds canAutoRetryBeforeFirstProviderProgress eligibility check for retryable transport/timeout causes occurring before first provider progress with no tool/output activity and boundary snapshot permission. New internal helpers determine cause subcategory, detect absence of all activity flags, and gate on boundary proof/external/provider-executed capability. user_cancel and local_lifecycle_close returns moved earlier in decision flow.
Before-progress auto-retry test fixtures and validation
packages/opencode/test/session/run-observability.test.ts
Adds shared test helpers (beforeProgressCause, beforeProgressFacts, recoveryForBeforeProgress) for before-progress retry scenarios. Test cases verify auto-retry is allowed when unknown tools have no activity, denied when boundary snapshot is missing or any boundary evidence appears, and denied for non-retryable transport errors. Replaces earlier "unknown boundary" test with assertion that unknown tools do not block retry when no activity occurred.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • Astro-Han/pawwork#812: Introduces the initial recoveryFor policy and basic provider-disconnect auto-retry path in the same file that this PR extends with before-progress eligibility logic.
  • Astro-Han/pawwork#863: Updates recoveryFor decision tree to gate auto-retry based on side-effect/boundary facts completeness and adjusts early termination-cause handling, overlapping at the same retry eligibility logic that this PR refactors.

Suggested labels

P1

Poem

🐰 A transport hiccup mid-session flow,
Where tools untouched, yet boundaries below,
Now asks: did we start? before gates close,
Safe retry before the first output rose. 🌱

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: allowing safe automatic retry before the provider produces progress.
Description check ✅ Passed The PR description is comprehensive, covering summary, rationale, related issue, review focus, risks, verification steps, and all required checklist items are addressed.
Linked Issues check ✅ Passed The code changes directly address Issue #912 by implementing the minimal fast-path solution: allowing auto-retry for retryable transport failures before first provider progress when no output/tool activity occurred and boundary snapshot permits it.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing the early retry fast-path for Issue #912, with no unrelated refactors, dependencies, or file changes beyond the stated scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pawwork/issue-912-safe-retry-fastpath

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested priority: P2 (includes non-doc, non-test paths outside the low-risk bucket).

P1/P0 are reserved for maintainer confirmation. Please relabel manually if this is a release blocker, security issue, data-loss risk, or updater/runtime failure.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces logic to allow automatic retries before the first provider progress occurs, provided there has been no visible output or tool activity. It adds the helper function canAutoRetryBeforeFirstProviderProgress along with several supporting checks in policy.ts, and updates the recovery policy evaluation order. Additionally, comprehensive unit tests have been added in run-observability.test.ts to validate these new retry conditions and ensure they fail closed when output, tool activity, or external boundaries are detected. There are no review comments, so I have no further feedback to provide.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
packages/opencode/test/session/run-observability.test.ts (1)

1067-1105: ⚡ Quick win

Boundary fail-closed coverage is currently confounded by incomplete-facts fallback.

In Line 1068 case setup, provider/external-boundary scenarios inherit side_effect_facts_complete: false, so these assertions can pass without proving boundary evidence itself blocks auto-retry. Add variants with side_effect_facts_complete: true to lock this behavior directly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/opencode/test/session/run-observability.test.ts` around lines 1067 -
1105, The provider/external-boundary cases in the test rely on
beforeProgressFacts() defaulting to side_effect_facts_complete: false, so the
assertions don't prove boundary evidence itself blocks auto-retry; update the
test by adding variants for the two boundary cases that set
side_effect_facts_complete: true inside the overrides (i.e., add entries where
the override includes side_effect_facts_complete: true alongside
side_effect_boundary_snapshot with provider_executed_capability_present or
external_boundary_present and their proof_reason) and then call
recoveryForBeforeProgress(overrides) and assert the decision does not match {
recommendation: "auto_retry_once" } to lock the behavior when facts are
complete.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/opencode/src/session/run-incident/policy.ts`:
- Around line 27-47: The before-progress causes were excluded only from the
reasoning-only branch but not from the generic transport auto-retry branch,
allowing auto_retry_once to be returned incorrectly; update the transport
auto-retry condition (the branch that currently checks retryableTransport &&
noToolActivity && !isBeforeFirstProviderProgressCause(input.cause) &&
terminalFacts.reasoning_output_started — or if that exact check is in a
different branch, add the missing guard) to explicitly check for and exclude
before-first-provider-progress causes using
isBeforeFirstProviderProgressCause(input.cause) (or reuse
canAutoRetryBeforeFirstProviderProgress) so that any branch that returns
auto_retry_once also requires the before-progress gate to have passed.

---

Nitpick comments:
In `@packages/opencode/test/session/run-observability.test.ts`:
- Around line 1067-1105: The provider/external-boundary cases in the test rely
on beforeProgressFacts() defaulting to side_effect_facts_complete: false, so the
assertions don't prove boundary evidence itself blocks auto-retry; update the
test by adding variants for the two boundary cases that set
side_effect_facts_complete: true inside the overrides (i.e., add entries where
the override includes side_effect_facts_complete: true alongside
side_effect_boundary_snapshot with provider_executed_capability_present or
external_boundary_present and their proof_reason) and then call
recoveryForBeforeProgress(overrides) and assert the decision does not match {
recommendation: "auto_retry_once" } to lock the behavior when facts are
complete.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b027bff3-d429-4c58-8165-d75f5e3b6404

📥 Commits

Reviewing files that changed from the base of the PR and between ecb8cd2 and 1d53a14.

📒 Files selected for processing (2)
  • packages/opencode/src/session/run-incident/policy.ts
  • packages/opencode/test/session/run-observability.test.ts

Comment thread packages/opencode/src/session/run-incident/policy.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working harness Model harness, prompts, tool descriptions, and session mechanics P2 Medium priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Safe retry is blocked by unknown exposed tool boundaries before provider progress

1 participant