You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Consolidate PawWork's model execution retry path so ordinary provider retry behavior and PawWork's safe-recovery checks run through one session retry pipeline.
The desired long-term shape is:
Keep /global/event SSE reconnect separate because it only restores event delivery and must not re-run model requests.
Reuse the existing session retry engine for retry execution mechanics: attempt count, retry-after/backoff, status updates, retry events, and terminal retry classification.
Keep PawWork's safe-recovery logic as a safety gate inside that pipeline, not as a second retry loop in session.processor.
Preserve the short-term fix: scope reasoning safe retry timeouts by attempt #922 behavior while the migration happens: reasoning-model first attempt can fail fast, the one automatic safe retry keeps the longer protected timeout, and there is no automatic third safe-recovery attempt.
Scope
In scope:
Audit the current split between SessionRetry.policy, session.processor safe-recovery retry handling, run-incident recovery decisions, run observability, and UI retry presentation.
Define a single model execution retry pipeline with clear layer ownership:
Keep conservative behavior for visible output, tool input, materialized tool calls, tool execution, external boundaries, provider-executed capabilities, user cancel, lifecycle close, quota, context overflow, and other non-retryable failures.
Out of scope:
Replacing /global/event SSE reconnect with the model retry pipeline.
Implementing full turn resume after partial output or tool execution.
Redesigning atomic file writes or tool idempotency in this issue.
Broad provider retry taxonomy changes beyond what is needed to plug the safety gate into the retry pipeline.
Promising that every interrupted run can be recovered automatically.
Proposed design
Treat safe recovery as a gate, not an executor.
Flow:
model stream fails
-> classify whether the failure is technically retryable
-> derive run-incident safety from observability
-> if both allow retry, the retry engine schedules and emits retry state
-> processor re-enters the next attempt through the same pipeline
Layer ownership:
SessionRetry.policy or its successor owns retry execution mechanics: attempt numbering, retry-after / backoff delay, max attempts, retry status payload shape, and retry event emission.
RunIncident.recoveryFor / a small extracted safety module owns product safety: whether this failed attempt can be automatically replayed, requires user confirmation, should offer continue/resume, or should stop.
session.processor should orchestrate the stream attempts, but should not maintain a separate ad hoc retry engine with its own counter, sleep, presentation predicate, and terminal behavior.
UI copy and safe_retry_failed presentation should key off stable retry decision metadata instead of processor-local predicate names.
This keeps the important PawWork safety check while avoiding three competing retry mechanisms.
Suggested migration slices
Extract the safety gate behind a small pure API.
Move the safe-recovery boundary checks out of session.processor into a module owned by run-incident / observability. This PR should be mostly mechanical and should not change behavior.
Add retry-decision metadata that can carry both engine and gate results.
The processor should be able to distinguish technical retryability from safety permission without duplicating predicates such as reasoningOnlySafeRetry versus beforeProgressSafeRetry.
Route safe-recovery scheduling through the retry engine.
Replace processor-local counter/sleep/status mechanics with the shared retry policy path while preserving current behavior: one automatic safe retry, existing retry status semantics, lifecycle-close interruption handling, and fix: scope reasoning safe retry timeouts by attempt #922's timeout policy.
Clean up UI / observability naming and tests.
Ensure retry state, notices, exports, and tests make it clear whether a retry was blocked by technical classification, blocked by safety, attempted by the engine, or completed/failed after retry.
Risks
A migration that only moves code could accidentally broaden automatic retry beyond the current safe boundary.
A migration that only centralizes retry mechanics could lose PawWork-specific safety proof, especially around tool calls and external/provider-executed boundaries.
Retry attempt counting can become misleading if ordinary provider retries and safe-recovery attempts are conflated without clear metadata.
SSE reconnect may be confused with model retry unless naming and tests keep the boundary explicit.
Acceptance criteria
There is one model execution retry pipeline for ordinary retryable provider failures and safe-recovery retries.
Safe-recovery checks still block automatic retry after visible output, text output, reasoning output where applicable, tool input, materialized tool calls, tool execution, unsafe side effects, external boundaries, provider-executed capabilities, user cancel, lifecycle close, quota, context overflow, and other non-retryable failures.
fix: scope reasoning safe retry timeouts by attempt #922 behavior remains intact: reasoning-model first safe-recovery-eligible attempt uses the shorter timeout, the automatic safe retry keeps the longer timeout, and no automatic third safe-recovery attempt is made.
/global/event reconnect remains independent and does not trigger model re-execution.
Tests cover both the allowed safe-recovery path and conservative blocked paths.
Observability can answer: whether the failure was technically retryable, whether the safety gate allowed it, whether a retry was attempted, which attempt timeout was used, and how the retry ended.
Upstream OpenCode has ordinary SessionRetry.policy provider retry behavior and separate SSE reconnect behavior.
DeepSeek v4-pro review agreed with the direction: keep SSE reconnect separate, treat retry execution as the engine, and keep PawWork safe recovery as a safety gate inside the model execution retry pipeline.
Verification
Add or update unit tests around SessionRetry.policy / the new retry pipeline for provider retry, safe-recovery retry, and blocked safety decisions.
Keep or add processor-level tests for 60s -> 120s, no automatic third safe-recovery attempt, safe-retry notice behavior, lifecycle close, user cancel, quota, context overflow, visible output, tool input, materialized tool calls, and tool execution.
Verify /global/event reconnect tests still pass and do not imply model retry.
Run targeted opencode/session tests and UI retry contract tests.
For visible retry copy changes, run the existing safe-retry snap target or equivalent visual check.
Execution mode
Design changes require approval before implementation. Implementation plans for already-approved design slices do not need a separate approval gate; agents may post the plan and proceed with code and PR work when the slice stays within the approved design.
Goal
Consolidate PawWork's model execution retry path so ordinary provider retry behavior and PawWork's safe-recovery checks run through one session retry pipeline.
The desired long-term shape is:
/global/eventSSE reconnect separate because it only restores event delivery and must not re-run model requests.session.processor.Scope
In scope:
SessionRetry.policy,session.processorsafe-recovery retry handling, run-incident recovery decisions, run observability, and UI retry presentation.60s -> 120sreasoning safe-retry behavior stable during the migration.Out of scope:
/global/eventSSE reconnect with the model retry pipeline.Proposed design
Treat safe recovery as a gate, not an executor.
Flow:
Layer ownership:
SessionRetry.policyor its successor owns retry execution mechanics: attempt numbering, retry-after / backoff delay, max attempts, retry status payload shape, and retry event emission.RunIncident.recoveryFor/ a small extracted safety module owns product safety: whether this failed attempt can be automatically replayed, requires user confirmation, should offer continue/resume, or should stop.session.processorshould orchestrate the stream attempts, but should not maintain a separate ad hoc retry engine with its own counter, sleep, presentation predicate, and terminal behavior.safe_retry_failedpresentation should key off stable retry decision metadata instead of processor-local predicate names.This keeps the important PawWork safety check while avoiding three competing retry mechanisms.
Suggested migration slices
Extract the safety gate behind a small pure API.
Move the safe-recovery boundary checks out of
session.processorinto a module owned by run-incident / observability. This PR should be mostly mechanical and should not change behavior.Add retry-decision metadata that can carry both engine and gate results.
The processor should be able to distinguish technical retryability from safety permission without duplicating predicates such as
reasoningOnlySafeRetryversusbeforeProgressSafeRetry.Route safe-recovery scheduling through the retry engine.
Replace processor-local counter/sleep/status mechanics with the shared retry policy path while preserving current behavior: one automatic safe retry, existing retry status semantics, lifecycle-close interruption handling, and fix: scope reasoning safe retry timeouts by attempt #922's timeout policy.
Clean up UI / observability naming and tests.
Ensure retry state, notices, exports, and tests make it clear whether a retry was blocked by technical classification, blocked by safety, attempted by the engine, or completed/failed after retry.
Risks
Acceptance criteria
/global/eventreconnect remains independent and does not trigger model re-execution.Relevant files or context
Likely files:
packages/opencode/src/session/retry.tspackages/opencode/src/session/processor.tspackages/opencode/src/session/run-incident/policy.tspackages/opencode/src/session/run-observability/recorder.tspackages/opencode/src/session/run-observability/types.tspackages/ui/src/components/session-retry.tsxpackages/ui/src/components/message-part/parts/notice.tsxpackages/app/src/context/global-sdk.tsxRelated work:
60s -> 120stimeout strategy.SessionRetry.policyprovider retry behavior and separate SSE reconnect behavior.DeepSeek v4-pro review agreed with the direction: keep SSE reconnect separate, treat retry execution as the engine, and keep PawWork safe recovery as a safety gate inside the model execution retry pipeline.
Verification
SessionRetry.policy/ the new retry pipeline for provider retry, safe-recovery retry, and blocked safety decisions.60s -> 120s, no automatic third safe-recovery attempt, safe-retry notice behavior, lifecycle close, user cancel, quota, context overflow, visible output, tool input, materialized tool calls, and tool execution./global/eventreconnect tests still pass and do not imply model retry.Execution mode
Design changes require approval before implementation. Implementation plans for already-approved design slices do not need a separate approval gate; agents may post the plan and proceed with code and PR work when the slice stays within the approved design.