[Task] Consolidate model execution retry pipeline

## Goal

Consolidate PawWork's model execution retry path so ordinary provider retry behavior and PawWork's safe-recovery checks run through one session retry pipeline.

The desired long-term shape is:

- Keep `/global/event` SSE reconnect separate because it only restores event delivery and must not re-run model requests.
- Reuse the existing session retry engine for retry execution mechanics: attempt count, retry-after/backoff, status updates, retry events, and terminal retry classification.
- Keep PawWork's safe-recovery logic as a safety gate inside that pipeline, not as a second retry loop in `session.processor`.
- Preserve the short-term #922 behavior while the migration happens: reasoning-model first attempt can fail fast, the one automatic safe retry keeps the longer protected timeout, and there is no automatic third safe-recovery attempt.

## Scope

In scope:

- Audit the current split between `SessionRetry.policy`, `session.processor` safe-recovery retry handling, run-incident recovery decisions, run observability, and UI retry presentation.
- Define a single model execution retry pipeline with clear layer ownership:
  - Retry Engine: technical retryability, retry-after/backoff, attempts, retry status/event emission.
  - Safety Gate: run-observability based decision about whether re-running the model turn is safe.
  - UI / Observability: user-facing retry state, final notices, and diagnostic evidence about both engine and gate decisions.
- Migrate in small PRs instead of one broad rewrite.
- Keep #922's `60s -> 120s` reasoning safe-retry behavior stable during the migration.
- Keep conservative behavior for visible output, tool input, materialized tool calls, tool execution, external boundaries, provider-executed capabilities, user cancel, lifecycle close, quota, context overflow, and other non-retryable failures.

Out of scope:

- Replacing `/global/event` SSE reconnect with the model retry pipeline.
- Implementing full turn resume after partial output or tool execution.
- Redesigning atomic file writes or tool idempotency in this issue.
- Broad provider retry taxonomy changes beyond what is needed to plug the safety gate into the retry pipeline.
- Promising that every interrupted run can be recovered automatically.

## Proposed design

Treat safe recovery as a gate, not an executor.

Flow:

```text
model stream fails
  -> classify whether the failure is technically retryable
  -> derive run-incident safety from observability
  -> if both allow retry, the retry engine schedules and emits retry state
  -> processor re-enters the next attempt through the same pipeline
```

Layer ownership:

- `SessionRetry.policy` or its successor owns retry execution mechanics: attempt numbering, retry-after / backoff delay, max attempts, retry status payload shape, and retry event emission.
- `RunIncident.recoveryFor` / a small extracted safety module owns product safety: whether this failed attempt can be automatically replayed, requires user confirmation, should offer continue/resume, or should stop.
- `session.processor` should orchestrate the stream attempts, but should not maintain a separate ad hoc retry engine with its own counter, sleep, presentation predicate, and terminal behavior.
- UI copy and `safe_retry_failed` presentation should key off stable retry decision metadata instead of processor-local predicate names.

This keeps the important PawWork safety check while avoiding three competing retry mechanisms.

## Suggested migration slices

1. Extract the safety gate behind a small pure API.

   Move the safe-recovery boundary checks out of `session.processor` into a module owned by run-incident / observability. This PR should be mostly mechanical and should not change behavior.

2. Add retry-decision metadata that can carry both engine and gate results.

   The processor should be able to distinguish technical retryability from safety permission without duplicating predicates such as `reasoningOnlySafeRetry` versus `beforeProgressSafeRetry`.

3. Route safe-recovery scheduling through the retry engine.

   Replace processor-local counter/sleep/status mechanics with the shared retry policy path while preserving current behavior: one automatic safe retry, existing retry status semantics, lifecycle-close interruption handling, and #922's timeout policy.

4. Clean up UI / observability naming and tests.

   Ensure retry state, notices, exports, and tests make it clear whether a retry was blocked by technical classification, blocked by safety, attempted by the engine, or completed/failed after retry.

## Risks

- A migration that only moves code could accidentally broaden automatic retry beyond the current safe boundary.
- A migration that only centralizes retry mechanics could lose PawWork-specific safety proof, especially around tool calls and external/provider-executed boundaries.
- Retry attempt counting can become misleading if ordinary provider retries and safe-recovery attempts are conflated without clear metadata.
- The #922 reasoning-model timeout behavior can regress if timeout selection is moved before the safety decision is available.
- SSE reconnect may be confused with model retry unless naming and tests keep the boundary explicit.

## Acceptance criteria

- There is one model execution retry pipeline for ordinary retryable provider failures and safe-recovery retries.
- Safe-recovery checks still block automatic retry after visible output, text output, reasoning output where applicable, tool input, materialized tool calls, tool execution, unsafe side effects, external boundaries, provider-executed capabilities, user cancel, lifecycle close, quota, context overflow, and other non-retryable failures.
- #922 behavior remains intact: reasoning-model first safe-recovery-eligible attempt uses the shorter timeout, the automatic safe retry keeps the longer timeout, and no automatic third safe-recovery attempt is made.
- `/global/event` reconnect remains independent and does not trigger model re-execution.
- Tests cover both the allowed safe-recovery path and conservative blocked paths.
- Observability can answer: whether the failure was technically retryable, whether the safety gate allowed it, whether a retry was attempted, which attempt timeout was used, and how the retry ended.

## Relevant files or context

Likely files:

- `packages/opencode/src/session/retry.ts`
- `packages/opencode/src/session/processor.ts`
- `packages/opencode/src/session/run-incident/policy.ts`
- `packages/opencode/src/session/run-observability/recorder.ts`
- `packages/opencode/src/session/run-observability/types.ts`
- `packages/ui/src/components/session-retry.tsx`
- `packages/ui/src/components/message-part/parts/notice.tsx`
- `packages/app/src/context/global-sdk.tsx`

Related work:

- #914 allowed safe retry before first provider progress when no output or tool activity happened.
- #918 / #922 apply the short-term reasoning-model `60s -> 120s` timeout strategy.
- Upstream OpenCode has ordinary `SessionRetry.policy` provider retry behavior and separate SSE reconnect behavior.

DeepSeek v4-pro review agreed with the direction: keep SSE reconnect separate, treat retry execution as the engine, and keep PawWork safe recovery as a safety gate inside the model execution retry pipeline.

## Verification

- Add or update unit tests around `SessionRetry.policy` / the new retry pipeline for provider retry, safe-recovery retry, and blocked safety decisions.
- Keep or add processor-level tests for `60s -> 120s`, no automatic third safe-recovery attempt, safe-retry notice behavior, lifecycle close, user cancel, quota, context overflow, visible output, tool input, materialized tool calls, and tool execution.
- Verify `/global/event` reconnect tests still pass and do not imply model retry.
- Run targeted opencode/session tests and UI retry contract tests.
- For visible retry copy changes, run the existing safe-retry snap target or equivalent visual check.

## Execution mode

Design changes require approval before implementation. Implementation plans for already-approved design slices do not need a separate approval gate; agents may post the plan and proceed with code and PR work when the slice stays within the approved design.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task] Consolidate model execution retry pipeline #925

Goal

Scope

Proposed design

Suggested migration slices

Risks

Acceptance criteria

Relevant files or context

Verification

Execution mode

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Task] Consolidate model execution retry pipeline #925

Description

Goal

Scope

Proposed design

Suggested migration slices

Risks

Acceptance criteria

Relevant files or context

Verification

Execution mode

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions