Skip to content

[Feature] Recover model runs after tool activity on transport disconnect #927

@Astro-Han

Description

@Astro-Han

Revision 2 (2026-06-10) — this body is the only authoritative version. The original body proposed interactive recovery prompts (continue / retry / stop buttons). That design was dropped after a full shaping pass and two adversarial design reviews; the settled direction is silent-first recovery with zero new UI. The design-lock comment below carries the full rationale, code anchors, and the implementation brief.

What task are you trying to do?

When PawWork's connection to a model provider drops during or after tool activity, the session should recover silently and keep going — without ever repeating a side effect, and without throwing away work that already happened. Network jitter is normal for long-running agent sessions; a disconnect must feel like nothing happened, not like a crash.

Which area would this change affect?

Model harness, prompts, tools, or session mechanics

What do you do today?

#939 covers the simpler cases (no output yet, reasoning-only, text-started). For tool-phase disconnects the engine is at its worst today:

  • A transport disconnect interrupts in-flight tool executions (packages/opencode/src/session/llm.ts:269 passes the same abort signal to tool execute as to streamText; processor.ts cleanup waits TOOL_CLEANUP_TIMEOUT_MS = 1s then marks unsettled tool parts as interrupted errors). The side effect may already have happened, the result is thrown away, and a replay would run the tool again.
  • The recovery decision table in run-incident/policy.ts (merged in [Bug] Reasoning-only UND_ERR_SOCKET disconnects do not auto recover #939) already derives ask_user_before_retry / offer_continue recommendations for every tool phase, but the consumption layer degrades all of them to a halt with a static English string stuffed into the assistant error (processor.ts recoveryInterruptionMessage). The session stops dead; the user cannot tell whether resending will repeat a side effect.

What would a good result look like?

One silent recovery pipeline, no interactive recovery UI:

  1. A disconnect never kills in-flight tools. Provider stream abort and local tool abort become separate signals: user cancel, lifecycle close, and session delete still abort tools; a transport disconnect does not. In-flight tools run to completion under their own timeouts and their results persist (this requires a completion path that does not depend on stream events).
  2. Normalize the broken half-turn into legal history. Keep completed tool parts; mark genuinely dead tools (process killed, drain timeout) as tool-level errors carrying an explicit, model-visible caution: "this operation may have started or completed; do not repeat it without verifying / asking the user". Never set the assistant-message-level error in recoverable paths — toModelMessages() would drop the whole turn.
  3. One recovery request, with backoff, under the shared safe-recovery budget (raised 3 → 5, renamed to a generic recovery-attempt budget): if the broken turn produced nothing visible and no tool ran, silently replay it; otherwise silently continue from the normalized history. No tool is ever re-executed by the engine.
  4. Uncertainty never halts the run. A dead tool surfaces to the model as a normal tool error with the caution text; the model keeps going — it can verify state itself or ask the user conversationally. The only remaining stop is budget exhaustion (network genuinely down), which lands on the existing [Bug] Reasoning-only UND_ERR_SOCKET disconnects do not auto recover #939 safe_retry_failed notice.

What would count as done?

  • A transport disconnect while a tool is executing does not interrupt the tool; its result persists and the run continues from that result after reconnect.
  • A disconnect on an empty broken turn (no visible output, no tool execution) recovers by silent replay; the user sees only the existing retry status bar.
  • A disconnect that leaves a genuinely dead tool marks that tool part as an error with the model-visible do-not-repeat caution, and the run continues; the engine never re-executes a tool call that already started.
  • Recoverable paths never set the assistant-message error; the normalized half-turn survives toModelMessages() into the next request.
  • The recovery budget is 5 attempts per turn, shared by replay and continue, with the existing backoff; exhaustion lands on the existing safe_retry_failed notice and clean idle state.
  • Integration tests (TestLLMServer, simulated disconnect) cover: mid-text, mid-tool-input, tool call materialized but not executed, tool executing and drained to completion, tool executing and genuinely dead (caution reaches model context; no re-execution), budget exhausted.
  • The interrupted-tool error card shows localized plain-language copy (what is uncertain, what to check); one snap visual check.

What should stay out of scope?

  • Interactive recovery cards, continue/retry/stop buttons, a choice-reply API, sidebar pending markers — cut by design, not deferred.
  • Automatic file-snapshot rollback (unreliable and riskier than disclosure; snapshots keep recording, existing undo stays the manual path).
  • A separate evaluator model.
  • Crash/restart recovery (crash_or_restart_incomplete is a different category).
  • Provider-native resume.
  • The existing [Bug] Reasoning-only UND_ERR_SOCKET disconnects do not auto recover #939 recovery paths (no-output retry, reasoning-only retry, text continuation) — unchanged.

Which audience does this matter to most?

Both

Extra context

Design history and decision evidence live in the design-lock comment below (shaping pass + two Codex adversarial reviews, 2026-06-10). Key prior art: #925 (retry pipeline + safety gate), #939 (failure classifier, run observability, recovery budget). Competitive grounding: Claude Code retries up to 10× and never lets network failure interrupt local tool execution (tools run between requests), yet still suffers orphaned-tool_use state corruption (claude-code#26729); Codex CLI blind-retries 5× then abandons the task (codex#19121, codex#18723). PawWork's differentiator: zero blind re-runs and zero unnecessary stops.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High priorityappApplication behavior and product flowsenhancementNew feature or requestharnessModel harness, prompts, tool descriptions, and session mechanics

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions