[Feature] Recover model runs after tool activity on transport disconnect

> **Revision 2 (2026-06-10) — this body is the only authoritative version.** The original body proposed interactive recovery prompts (continue / retry / stop buttons). That design was dropped after a full shaping pass and two adversarial design reviews; the settled direction is silent-first recovery with zero new UI. The design-lock comment below carries the full rationale, code anchors, and the implementation brief.

### What task are you trying to do?

When PawWork's connection to a model provider drops during or after tool activity, the session should recover silently and keep going — without ever repeating a side effect, and without throwing away work that already happened. Network jitter is normal for long-running agent sessions; a disconnect must feel like nothing happened, not like a crash.

### Which area would this change affect?

Model harness, prompts, tools, or session mechanics

### What do you do today?

#939 covers the simpler cases (no output yet, reasoning-only, text-started). For tool-phase disconnects the engine is at its worst today:

- A transport disconnect interrupts in-flight tool executions (`packages/opencode/src/session/llm.ts:269` passes the same abort signal to tool `execute` as to `streamText`; `processor.ts` cleanup waits `TOOL_CLEANUP_TIMEOUT_MS = 1s` then marks unsettled tool parts as interrupted errors). The side effect may already have happened, the result is thrown away, and a replay would run the tool again.
- The recovery decision table in `run-incident/policy.ts` (merged in #939) already derives `ask_user_before_retry` / `offer_continue` recommendations for every tool phase, but the consumption layer degrades all of them to a halt with a static English string stuffed into the assistant error (`processor.ts` `recoveryInterruptionMessage`). The session stops dead; the user cannot tell whether resending will repeat a side effect.

### What would a good result look like?

One silent recovery pipeline, no interactive recovery UI:

1. **A disconnect never kills in-flight tools.** Provider stream abort and local tool abort become separate signals: user cancel, lifecycle close, and session delete still abort tools; a transport disconnect does not. In-flight tools run to completion under their own timeouts and their results persist (this requires a completion path that does not depend on stream events).
2. **Normalize the broken half-turn into legal history.** Keep completed tool parts; mark genuinely dead tools (process killed, drain timeout) as tool-level errors carrying an explicit, model-visible caution: "this operation may have started or completed; do not repeat it without verifying / asking the user". Never set the assistant-message-level error in recoverable paths — `toModelMessages()` would drop the whole turn.
3. **One recovery request, with backoff, under the shared safe-recovery budget** (raised 3 → 5, renamed to a generic recovery-attempt budget): if the broken turn produced nothing visible and no tool ran, silently replay it; otherwise silently continue from the normalized history. No tool is ever re-executed by the engine.
4. **Uncertainty never halts the run.** A dead tool surfaces to the model as a normal tool error with the caution text; the model keeps going — it can verify state itself or ask the user conversationally. The only remaining stop is budget exhaustion (network genuinely down), which lands on the existing #939 `safe_retry_failed` notice.

### What would count as done?

- A transport disconnect while a tool is executing does not interrupt the tool; its result persists and the run continues from that result after reconnect.
- A disconnect on an empty broken turn (no visible output, no tool execution) recovers by silent replay; the user sees only the existing retry status bar.
- A disconnect that leaves a genuinely dead tool marks that tool part as an error with the model-visible do-not-repeat caution, and the run continues; the engine never re-executes a tool call that already started.
- Recoverable paths never set the assistant-message error; the normalized half-turn survives `toModelMessages()` into the next request.
- The recovery budget is 5 attempts per turn, shared by replay and continue, with the existing backoff; exhaustion lands on the existing `safe_retry_failed` notice and clean idle state.
- Integration tests (TestLLMServer, simulated disconnect) cover: mid-text, mid-tool-input, tool call materialized but not executed, tool executing and drained to completion, tool executing and genuinely dead (caution reaches model context; no re-execution), budget exhausted.
- The interrupted-tool error card shows localized plain-language copy (what is uncertain, what to check); one snap visual check.

### What should stay out of scope?

- Interactive recovery cards, continue/retry/stop buttons, a choice-reply API, sidebar pending markers — cut by design, not deferred.
- Automatic file-snapshot rollback (unreliable and riskier than disclosure; snapshots keep recording, existing undo stays the manual path).
- A separate evaluator model.
- Crash/restart recovery (`crash_or_restart_incomplete` is a different category).
- Provider-native resume.
- The existing #939 recovery paths (no-output retry, reasoning-only retry, text continuation) — unchanged.

### Which audience does this matter to most?

Both

### Extra context

Design history and decision evidence live in the design-lock comment below (shaping pass + two Codex adversarial reviews, 2026-06-10). Key prior art: #925 (retry pipeline + safety gate), #939 (failure classifier, run observability, recovery budget). Competitive grounding: Claude Code retries up to 10× and never lets network failure interrupt local tool execution (tools run between requests), yet still suffers orphaned-tool_use state corruption ([claude-code#26729](https://github.com/anthropics/claude-code/issues/26729)); Codex CLI blind-retries 5× then abandons the task ([codex#19121](https://github.com/openai/codex/issues/19121), [codex#18723](https://github.com/openai/codex/issues/18723)). PawWork's differentiator: zero blind re-runs and zero unnecessary stops.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Recover model runs after tool activity on transport disconnect #927

What task are you trying to do?

Which area would this change affect?

What do you do today?

What would a good result look like?

What would count as done?

What should stay out of scope?

Which audience does this matter to most?

Extra context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature] Recover model runs after tool activity on transport disconnect #927

Description

What task are you trying to do?

Which area would this change affect?

What do you do today?

What would a good result look like?

What would count as done?

What should stay out of scope?

Which audience does this matter to most?

Extra context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions