You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Revision 2 (2026-06-10) — this body is the only authoritative version. The original body proposed interactive recovery prompts (continue / retry / stop buttons). That design was dropped after a full shaping pass and two adversarial design reviews; the settled direction is silent-first recovery with zero new UI. The design-lock comment below carries the full rationale, code anchors, and the implementation brief.
What task are you trying to do?
When PawWork's connection to a model provider drops during or after tool activity, the session should recover silently and keep going — without ever repeating a side effect, and without throwing away work that already happened. Network jitter is normal for long-running agent sessions; a disconnect must feel like nothing happened, not like a crash.
Which area would this change affect?
Model harness, prompts, tools, or session mechanics
What do you do today?
#939 covers the simpler cases (no output yet, reasoning-only, text-started). For tool-phase disconnects the engine is at its worst today:
A transport disconnect interrupts in-flight tool executions (packages/opencode/src/session/llm.ts:269 passes the same abort signal to tool execute as to streamText; processor.ts cleanup waits TOOL_CLEANUP_TIMEOUT_MS = 1s then marks unsettled tool parts as interrupted errors). The side effect may already have happened, the result is thrown away, and a replay would run the tool again.
The recovery decision table in run-incident/policy.ts (merged in [Bug] Reasoning-only UND_ERR_SOCKET disconnects do not auto recover #939) already derives ask_user_before_retry / offer_continue recommendations for every tool phase, but the consumption layer degrades all of them to a halt with a static English string stuffed into the assistant error (processor.tsrecoveryInterruptionMessage). The session stops dead; the user cannot tell whether resending will repeat a side effect.
What would a good result look like?
One silent recovery pipeline, no interactive recovery UI:
A disconnect never kills in-flight tools. Provider stream abort and local tool abort become separate signals: user cancel, lifecycle close, and session delete still abort tools; a transport disconnect does not. In-flight tools run to completion under their own timeouts and their results persist (this requires a completion path that does not depend on stream events).
Normalize the broken half-turn into legal history. Keep completed tool parts; mark genuinely dead tools (process killed, drain timeout) as tool-level errors carrying an explicit, model-visible caution: "this operation may have started or completed; do not repeat it without verifying / asking the user". Never set the assistant-message-level error in recoverable paths — toModelMessages() would drop the whole turn.
One recovery request, with backoff, under the shared safe-recovery budget (raised 3 → 5, renamed to a generic recovery-attempt budget): if the broken turn produced nothing visible and no tool ran, silently replay it; otherwise silently continue from the normalized history. No tool is ever re-executed by the engine.
Uncertainty never halts the run. A dead tool surfaces to the model as a normal tool error with the caution text; the model keeps going — it can verify state itself or ask the user conversationally. The only remaining stop is budget exhaustion (network genuinely down), which lands on the existing [Bug] Reasoning-only UND_ERR_SOCKET disconnects do not auto recover #939safe_retry_failed notice.
What would count as done?
A transport disconnect while a tool is executing does not interrupt the tool; its result persists and the run continues from that result after reconnect.
A disconnect on an empty broken turn (no visible output, no tool execution) recovers by silent replay; the user sees only the existing retry status bar.
A disconnect that leaves a genuinely dead tool marks that tool part as an error with the model-visible do-not-repeat caution, and the run continues; the engine never re-executes a tool call that already started.
Recoverable paths never set the assistant-message error; the normalized half-turn survives toModelMessages() into the next request.
The recovery budget is 5 attempts per turn, shared by replay and continue, with the existing backoff; exhaustion lands on the existing safe_retry_failed notice and clean idle state.
Integration tests (TestLLMServer, simulated disconnect) cover: mid-text, mid-tool-input, tool call materialized but not executed, tool executing and drained to completion, tool executing and genuinely dead (caution reaches model context; no re-execution), budget exhausted.
The interrupted-tool error card shows localized plain-language copy (what is uncertain, what to check); one snap visual check.
What should stay out of scope?
Interactive recovery cards, continue/retry/stop buttons, a choice-reply API, sidebar pending markers — cut by design, not deferred.
Automatic file-snapshot rollback (unreliable and riskier than disclosure; snapshots keep recording, existing undo stays the manual path).
A separate evaluator model.
Crash/restart recovery (crash_or_restart_incomplete is a different category).
Design history and decision evidence live in the design-lock comment below (shaping pass + two Codex adversarial reviews, 2026-06-10). Key prior art: #925 (retry pipeline + safety gate), #939 (failure classifier, run observability, recovery budget). Competitive grounding: Claude Code retries up to 10× and never lets network failure interrupt local tool execution (tools run between requests), yet still suffers orphaned-tool_use state corruption (claude-code#26729); Codex CLI blind-retries 5× then abandons the task (codex#19121, codex#18723). PawWork's differentiator: zero blind re-runs and zero unnecessary stops.
What task are you trying to do?
When PawWork's connection to a model provider drops during or after tool activity, the session should recover silently and keep going — without ever repeating a side effect, and without throwing away work that already happened. Network jitter is normal for long-running agent sessions; a disconnect must feel like nothing happened, not like a crash.
Which area would this change affect?
Model harness, prompts, tools, or session mechanics
What do you do today?
#939 covers the simpler cases (no output yet, reasoning-only, text-started). For tool-phase disconnects the engine is at its worst today:
packages/opencode/src/session/llm.ts:269passes the same abort signal to toolexecuteas tostreamText;processor.tscleanup waitsTOOL_CLEANUP_TIMEOUT_MS = 1sthen marks unsettled tool parts as interrupted errors). The side effect may already have happened, the result is thrown away, and a replay would run the tool again.run-incident/policy.ts(merged in [Bug] Reasoning-only UND_ERR_SOCKET disconnects do not auto recover #939) already derivesask_user_before_retry/offer_continuerecommendations for every tool phase, but the consumption layer degrades all of them to a halt with a static English string stuffed into the assistant error (processor.tsrecoveryInterruptionMessage). The session stops dead; the user cannot tell whether resending will repeat a side effect.What would a good result look like?
One silent recovery pipeline, no interactive recovery UI:
toModelMessages()would drop the whole turn.safe_retry_failednotice.What would count as done?
toModelMessages()into the next request.safe_retry_failednotice and clean idle state.What should stay out of scope?
crash_or_restart_incompleteis a different category).Which audience does this matter to most?
Both
Extra context
Design history and decision evidence live in the design-lock comment below (shaping pass + two Codex adversarial reviews, 2026-06-10). Key prior art: #925 (retry pipeline + safety gate), #939 (failure classifier, run observability, recovery budget). Competitive grounding: Claude Code retries up to 10× and never lets network failure interrupt local tool execution (tools run between requests), yet still suffers orphaned-tool_use state corruption (claude-code#26729); Codex CLI blind-retries 5× then abandons the task (codex#19121, codex#18723). PawWork's differentiator: zero blind re-runs and zero unnecessary stops.