[FEATURE] Streaming Resilience: Detect network loss, save in-flight state, and auto-resume on reconnect

## Problem

When using Claude Code on an unstable network (WiFi drops, power outages, VPN reconnects, mobile hotspot switching, laptop sleep/wake), a mid-task disconnection leads to a cascade of problems:

1. **The CLI hangs indefinitely** — no SSE events arrive, but no timeout triggers. Escape/Ctrl+C often don't work. The only recovery is killing the process.
2. **Conversation state gets corrupted** — orphaned `tool_use` blocks without corresponding `tool_result` blocks break the message history, causing API 400 errors on retry.
3. **In-flight work is lost** — partial streaming responses, pending tool calls, and task context (todo list state, which files were being edited) disappear.
4. **`--resume` doesn't know what was in-progress** — it restores conversation history but Claude has no awareness that it was interrupted mid-task. Users must manually prompt "you were cut off, continue from here" and paste partial output. Claude often hallucinates that it already finished.

This isn't an edge case. Anyone working from a café, on mobile hotspot, in a region with unreliable power, or behind a corporate VPN hits this regularly.

## Current Workaround

1. Notice the CLI is frozen (sometimes only after minutes of waiting)
2. Kill all Claude processes (`pkill -f claude`)
3. Restart with `claude --resume` or `claude --continue`
4. Manually re-explain what was happening: "Your last response was cut off due to a connection loss. You were editing src/auth.ts and had 3 more files to update. Continue from where you left off."
5. Hope Claude doesn't hallucinate that it already completed the work

## Proposed Solution

A three-layer approach to network resilience:

### Layer 1: Stream Watchdog (Don't hang forever)

- Monitor the SSE stream for event gaps. If no event (including `ping`) is received within a configurable timeout (default: 30s), treat the connection as dead.
- Gracefully abort the hung request instead of freezing the CLI.
- Surface a clear message: "Connection lost. Your session has been saved."
- This alone would fix the most painful symptom — the indefinite hang.

### Layer 2: Recovery Snapshot (Save in-flight state on disconnect)

When a disconnect is detected, persist a lightweight recovery snapshot alongside the session JSONL:

```json
{
  "session_id": "abc123",
  "disconnect_phase": "streaming | tool_execution | between_turns",
  "active_tool_calls": [
    {
      "tool_use_id": "tu_xyz",
      "tool": "Edit",
      "file": "src/auth.ts",
      "status": "pending_result"
    }
  ],
  "partial_response_text": "Let me update the authentication...",
  "last_committed_message_index": 42,
  "last_file_checkpoint": "cp_789",
  "pending_todos": [...]
}
```

On resume, this snapshot tells Claude exactly what was happening and what remains incomplete — eliminating the need for users to manually re-explain context.

### Layer 3: Auto-Resume on Reconnect

- Monitor network reachability (OS-level events + periodic lightweight health checks).
- When connectivity is restored, verify stability (2+ consecutive successful pings to avoid flapping).
- Repair conversation state: inject synthetic `tool_result` with `"error": "connection_lost"` for any orphaned `tool_use` blocks.
- Apply resume strategy based on what was interrupted:
  - **Read-only tools** (Glob, Grep, Read): safe to auto-retry
  - **File mutations** (Edit, Write): check file against checkpoint before retrying
  - **Bash commands**: prompt user before re-running (side effects unknown)
  - **Streaming text**: re-send the last user message with added context about the interruption
- Use exponential backoff (1s → 2s → 4s → ... → 30s max) for retry attempts.

### Configuration

All of this should be opt-in/configurable:

```jsonc
{
  "network_resilience": {
    "enabled": true,
    "stream_timeout_ms": 30000,
    "auto_resume": true,
    "auto_retry_readonly_tools": true,
    "auto_retry_mutations": false
  }
}
```

## Evidence of Demand

This proposal consolidates a pattern seen across 100+ issues in this repo. A few representative ones:

**Hanging / freezing on disconnect:**
- #15151 — CLI hangs indefinitely on network interruption
- #1121 — Network disconnection causes CC to remain permanently offline
- #19060 — CLI freezes with "No messages returned," never recovers
- #18289 — Hangs during processing, requires manual intervention

**Connection errors / ECONNRESET:**
- #5674 — Persistent ECONNRESET errors on macOS
- #23744 — Fails to reconnect after network change, requires restart
- #24614 — ECONNRESET kills session, needs auto-retry/reconnect
- #4297 — API Error (Connection error)

**Resume / session recovery gaps:**
- #6254 — Auto resume (feature request)
- #16607 — Allow agent to pick up where it was when interrupted
- #24304 — Conversation history missing on resume
- #21280 — Conversation history lost on reconnect

**Retry and network awareness:**
- #23115 — Add configurable retry behavior for API errors
- #25429 — Hook to trigger when network issue
- #9995 — Network reachability monitoring for offline handling
- #22597 — Auto-continue session after user timeout

**Conversation corruption after interruption:**
- #6836 — tool_use/tool_result block mismatch (150+ reports)
- #21041 — Orphaned tool_use blocks cause conversation corruption

## Prior Art

- The SSE specification itself supports reconnection via `Last-Event-ID` header — clients can resume from where they left off if the server supports it.
- The Anthropic TypeScript SDK already has an open discussion about streaming idle timeout (anthropics/anthropic-sdk-typescript#867).
- VS Code's language server protocol handles connection drops with automatic reconnection.

## Incremental Path

This doesn't need to ship as one monolithic change:

1. **Stream watchdog + graceful timeout** — immediate relief for the hanging problem
2. **Recovery snapshots** — enables informed manual resume
3. **Auto-resume engine** — the full seamless experience

Even just Layer 1 would dramatically improve the experience for users on unreliable networks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Streaming Resilience: Detect network loss, save in-flight state, and auto-resume on reconnect #26729

Problem

Current Workaround

Proposed Solution

Layer 1: Stream Watchdog (Don't hang forever)

Layer 2: Recovery Snapshot (Save in-flight state on disconnect)

Layer 3: Auto-Resume on Reconnect

Configuration

Evidence of Demand

Prior Art

Incremental Path

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[FEATURE] Streaming Resilience: Detect network loss, save in-flight state, and auto-resume on reconnect #26729

Description

Problem

Current Workaround

Proposed Solution

Layer 1: Stream Watchdog (Don't hang forever)

Layer 2: Recovery Snapshot (Save in-flight state on disconnect)

Layer 3: Auto-Resume on Reconnect

Configuration

Evidence of Demand

Prior Art

Incremental Path

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions