Skip to content

refactor(session): route safe recovery through retry policy#931

Merged
Astro-Han merged 4 commits into
devfrom
codex/i925-route-safe-recovery-retry-policy
May 26, 2026
Merged

refactor(session): route safe recovery through retry policy#931
Astro-Han merged 4 commits into
devfrom
codex/i925-route-safe-recovery-retry-policy

Conversation

@Astro-Han

@Astro-Han Astro-Han commented May 26, 2026

Copy link
Copy Markdown
Owner

Summary

Routes safe-recovery replay scheduling through the session retry policy path for #925 PR 3.

This adds a dedicated safe-recovery retry policy with its own one-replay budget metadata and lightweight retry presentation, then has session.processor use that policy instead of building the retry status and backoff timing locally.

Why

#925 is consolidating model execution retry behavior so provider retry mechanics and PawWork safe-recovery checks share one retry pipeline. PR #928 added unified retry decision metadata, and PR #929 extracted the safety gate. This PR moves the remaining safe-recovery scheduling mechanics out of the processor-local branch while preserving the current #922 behavior.

Related Issue

Closes part of #925.

Human Review Status

Pending

Review Focus

Please check that SessionRetry.safeRecoveryPolicy keeps safe-recovery replay metadata separate from ordinary provider retry attempts, and that session.processor still preserves the existing lifecycle-close checks, reasoning retry timeout behavior, and safe-retry failure notice behavior.

Risk Notes

Safe-recovery replay behavior is on the model execution path, so the main risk is accidentally broadening automatic replay. This PR keeps the existing safety gate and processor-level constraints in place, and only moves retry status/backoff scheduling to the retry policy layer.

Skipped conditional checklist items:

  • Visible UI or copy check: not applicable because this PR does not change visible UI or copy.
  • macOS/Windows platform check: not applicable because this PR does not touch platform, packaging, updater, signing, paths, shell, or permissions behavior.
  • Docs/release/dependency/local-file check: not applicable because this PR does not change docs, release notes, dependencies, permissions, credentials, generated content, or local files.

How To Verify

TDD RED: bun test test/session/retry.test.ts -t "safe recovery policy emits lightweight retry presentation"
Result: failed first because SessionRetry.safeRecoveryPolicy did not exist.

Retry policy tests: bun test test/session/retry.test.ts test/session/retry-decision.test.ts test/session/run-incident-safety-gate.test.ts
Result: 43 passed, 0 failed.

Processor safe retry: bun test test/session/processor-effect.test.ts -t "reasoning connect watchdog is attempt-scoped for before-progress safe retry"
Result: 1 passed, 0 failed.

Processor safe retry exhaustion: bun test test/session/processor-effect.test.ts -t "reasoning-only retry writes a notice after the one safe retry is exhausted"
Result: 1 passed, 0 failed.

Processor visible output guard: bun test test/session/processor-effect.test.ts -t "retryable stream error after visible output does not replay the assistant message"
Result: 1 passed, 0 failed.

Typecheck: bun run typecheck
Result: passed.

Diff check: git diff --check
Result: passed.

Screenshots or Recordings

Not applicable. This PR does not change visible UI.

Checklist

How to use this checklist:

  • Tick a box by replacing [ ] with [x]. Do not edit, add, or remove items.
  • The bot-applied label items can only be honestly ticked AFTER the PR is opened and the labeler / priority-triage bots have run — return to the PR description and tick them then.
  • Most items are required. The few that are conditional are explicitly marked (conditional); for those, leave unticked if they truly do not apply and explain why in Risk Notes. All other items must be ticked before requesting human review.
  • Type label — this PR carries exactly one of bug, enhancement, task, documentation. Type labels are author-added; the labeler bot does NOT assign them. Add the label in the GitHub UI, then tick this.
  • Routing labels — this PR carries at least one of app, ui, platform, harness, ci. The labeler bot assigns these on PR open based on changed paths. Confirm the bot's choice (or override if wrong), then tick this.
  • Priority label — this PR carries exactly one of P0, P1, P2, P3. The priority-triage bot suggests one on PR open. Confirm or override, then tick this.
  • Human Review Status above is set to Pending, Approved by @<reviewer>, or Not required: <reason> (default is Pending; "not required" is restricted to bot-authored low-risk PRs).
  • I linked the related issue, or stated in Summary why there is no issue.
  • I described the review focus and any meaningful risks.
  • I replaced the example block in How To Verify with the real verification steps and the key result for each.
  • I did not introduce unrelated refactors, dependencies, generated files, or file changes beyond the stated scope.
  • (conditional) I manually checked visible UI or copy changes when needed, with screenshots or recordings. Leave unticked only if no visible UI or copy changed.
  • (conditional) I considered macOS and Windows impact for platform, packaging, updater, signing, paths, shell, or permissions changes. Leave unticked only if no platform/packaging surface was touched.
  • (conditional) I called out docs, release notes, dependencies, permissions, credentials, deletion behavior, generated content, or local file changes when relevant. Leave unticked only if none of those surfaces was touched.
  • I reviewed the final diff for unrelated changes and suspicious dependency changes.
  • I am targeting dev, and my PR title and commit messages use Conventional Commits in English.

Summary by CodeRabbit

  • New Features

    • Added configurable safe recovery retry policy for improved session stability and resilience
  • Refactor

    • Enhanced session recovery retry mechanism with dynamic scheduling for better reliability

Review Change Stack

@Astro-Han Astro-Han added enhancement New feature or request P2 Medium priority harness Model harness, prompts, tool descriptions, and session mechanics labels May 26, 2026
@coderabbitai

coderabbitai Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR refactors the session processor's safe-recovery auto-replay retry mechanism from fixed backoff constants to effect-based scheduling. A new safeRecoveryPolicy function in SessionRetry provides configurable replay delay and attempt limits, which the processor uses to schedule retries with proper metadata tracking and failure handling.

Changes

Safe Recovery Policy and Scheduling Refactoring

Layer / File(s) Summary
Safe recovery policy contract and configuration
packages/opencode/src/session/retry.ts
Introduces SAFE_RECOVERY_REPLAY_DELAY and SAFE_RECOVERY_MAX_ATTEMPTS configuration constants and the safeRecoveryPolicy function that builds a step-based schedule terminating after max attempts, recording safe-recovery presentation state with each step.
Safe recovery policy test coverage
packages/opencode/test/session/retry.test.ts
Adds two tests validating the safe recovery policy: one asserts the emitted retry entry includes safe-recovery presentation and reason metadata, the other verifies the policy terminates after the replay budget is exhausted and records the expected state.
Processor safe recovery scheduling integration
packages/opencode/src/session/processor.ts
Imports Schedule, creates safeRecoveryStep from the new policy, and replaces fixed-sleep safe-recovery retry backoff with scheduled step execution; on scheduling failure, writes safe-retry-failed notice and terminates the retry loop.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Possibly related PRs

  • Astro-Han/pawwork#922: Both PRs modify the session processor's safe-recovery/safe-retry retry flow; this PR refactors the safe-recovery wait to use SessionRetry.safeRecoveryPolicy/Schedule while that PR changes the safe-recovery retry condition and failure gating logic.
  • Astro-Han/pawwork#929: Both PRs affect safe-recovery replay execution; this PR changes how the processor schedules the safe-recovery wait via the new policy, while that PR modifies the upstream replay/safety decision logic in buildModelRetryDecision that determines when the safe-recovery path is taken.

Poem

🐰 Schedules bloom where fixed delays once stood,
Safe recovery now flows as it should,
Effect and policy dance hand in hand,
Retrying with grace across the codebase land!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: refactoring safe recovery to use the retry policy path rather than local processor logic.
Description check ✅ Passed The description is well-structured with all required sections complete: Summary, Why, Related Issue, Human Review Status, Review Focus, Risk Notes, How To Verify, Screenshots/Recordings, and a fully-completed Checklist with conditional items properly marked.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/i925-route-safe-recovery-retry-policy

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the safe recovery retry logic in the session processor to use a structured safeRecoveryPolicy schedule defined in retry.ts, replacing the previous inline status updates and backoff delays. It also adds corresponding unit tests. However, a critical runtime issue was identified in retry.ts where Cause.done is called; this method does not exist in the imported Cause namespace from the effect library, which will lead to a TypeError when the maximum retry attempts are exceeded.

Comment thread packages/opencode/src/session/retry.ts
@Astro-Han Astro-Han force-pushed the codex/i925-route-safe-recovery-retry-policy branch 2 times, most recently from 125197a to 4b68c3b Compare May 26, 2026 11:10
@Astro-Han Astro-Han force-pushed the codex/i925-route-safe-recovery-retry-policy branch from 4b68c3b to d625a16 Compare May 26, 2026 11:12
@Astro-Han

Copy link
Copy Markdown
Owner Author

Temporarily closing and reopening to retrigger missing required GitHub Actions checks; no code change.

@Astro-Han Astro-Han closed this May 26, 2026
@Astro-Han Astro-Han reopened this May 26, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/opencode/test/session/retry.test.ts (1)

131-207: ⚡ Quick win

Use testEffect(...)/it.effect(...) for these new Effect-based tests.

These new cases run Effect workflows directly via Effect.runPromise(...); please migrate them to the repo’s Effect test harness pattern (const it = testEffect(...) + it.effect(...)) to stay consistent with test runtime semantics.

As per coding guidelines: “Use testEffect(...) from test/lib/effect.ts for tests that exercise Effect services or Effect-based workflows. Use it.effect(...) when the test should run with TestClock and TestConsole.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/opencode/test/session/retry.test.ts` around lines 131 - 207, Replace
the plain Jest tests that call Effect.runPromise with the repository's Effect
test harness: wrap the file with the testEffect helper (e.g., const it =
testEffect(...)) and convert the two tests to it.effect(...) so they run under
the Effect runtime/TestClock semantics; update the two test blocks ("safe
recovery policy emits..." and "safe recovery policy stops...") to call
SessionRetry.safeRecoveryPolicy and other Effect-based code inside the it.effect
callback instead of calling Effect.runPromise directly, preserving the same
assertions and references to SessionID, SessionStatus.Service.use,
Schedule.toStepWithMetadata, and Exit/Pull checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@packages/opencode/test/session/retry.test.ts`:
- Around line 131-207: Replace the plain Jest tests that call Effect.runPromise
with the repository's Effect test harness: wrap the file with the testEffect
helper (e.g., const it = testEffect(...)) and convert the two tests to
it.effect(...) so they run under the Effect runtime/TestClock semantics; update
the two test blocks ("safe recovery policy emits..." and "safe recovery policy
stops...") to call SessionRetry.safeRecoveryPolicy and other Effect-based code
inside the it.effect callback instead of calling Effect.runPromise directly,
preserving the same assertions and references to SessionID,
SessionStatus.Service.use, Schedule.toStepWithMetadata, and Exit/Pull checks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: fbd15e1f-c92f-4df5-a7d7-44f338640fde

📥 Commits

Reviewing files that changed from the base of the PR and between bafb19b and b26d661.

📒 Files selected for processing (3)
  • packages/opencode/src/session/processor.ts
  • packages/opencode/src/session/retry.ts
  • packages/opencode/test/session/retry.test.ts

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested priority: P2 (includes non-doc, non-test paths outside the low-risk bucket).

P1/P0 are reserved for maintainer confirmation. Please relabel manually if this is a release blocker, security issue, data-loss risk, or updater/runtime failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request harness Model harness, prompts, tool descriptions, and session mechanics P2 Medium priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant