Skip to content

[FEATURE] Streaming Resilience: Detect network loss, save in-flight state, and auto-resume on reconnect #26729

@SaravananJaichandar

Description

@SaravananJaichandar

Problem

When using Claude Code on an unstable network (WiFi drops, power outages, VPN reconnects, mobile hotspot switching, laptop sleep/wake), a mid-task disconnection leads to a cascade of problems:

  1. The CLI hangs indefinitely — no SSE events arrive, but no timeout triggers. Escape/Ctrl+C often don't work. The only recovery is killing the process.
  2. Conversation state gets corrupted — orphaned tool_use blocks without corresponding tool_result blocks break the message history, causing API 400 errors on retry.
  3. In-flight work is lost — partial streaming responses, pending tool calls, and task context (todo list state, which files were being edited) disappear.
  4. --resume doesn't know what was in-progress — it restores conversation history but Claude has no awareness that it was interrupted mid-task. Users must manually prompt "you were cut off, continue from here" and paste partial output. Claude often hallucinates that it already finished.

This isn't an edge case. Anyone working from a café, on mobile hotspot, in a region with unreliable power, or behind a corporate VPN hits this regularly.

Current Workaround

  1. Notice the CLI is frozen (sometimes only after minutes of waiting)
  2. Kill all Claude processes (pkill -f claude)
  3. Restart with claude --resume or claude --continue
  4. Manually re-explain what was happening: "Your last response was cut off due to a connection loss. You were editing src/auth.ts and had 3 more files to update. Continue from where you left off."
  5. Hope Claude doesn't hallucinate that it already completed the work

Proposed Solution

A three-layer approach to network resilience:

Layer 1: Stream Watchdog (Don't hang forever)

  • Monitor the SSE stream for event gaps. If no event (including ping) is received within a configurable timeout (default: 30s), treat the connection as dead.
  • Gracefully abort the hung request instead of freezing the CLI.
  • Surface a clear message: "Connection lost. Your session has been saved."
  • This alone would fix the most painful symptom — the indefinite hang.

Layer 2: Recovery Snapshot (Save in-flight state on disconnect)

When a disconnect is detected, persist a lightweight recovery snapshot alongside the session JSONL:

{
  "session_id": "abc123",
  "disconnect_phase": "streaming | tool_execution | between_turns",
  "active_tool_calls": [
    {
      "tool_use_id": "tu_xyz",
      "tool": "Edit",
      "file": "src/auth.ts",
      "status": "pending_result"
    }
  ],
  "partial_response_text": "Let me update the authentication...",
  "last_committed_message_index": 42,
  "last_file_checkpoint": "cp_789",
  "pending_todos": [...]
}

On resume, this snapshot tells Claude exactly what was happening and what remains incomplete — eliminating the need for users to manually re-explain context.

Layer 3: Auto-Resume on Reconnect

  • Monitor network reachability (OS-level events + periodic lightweight health checks).
  • When connectivity is restored, verify stability (2+ consecutive successful pings to avoid flapping).
  • Repair conversation state: inject synthetic tool_result with "error": "connection_lost" for any orphaned tool_use blocks.
  • Apply resume strategy based on what was interrupted:
    • Read-only tools (Glob, Grep, Read): safe to auto-retry
    • File mutations (Edit, Write): check file against checkpoint before retrying
    • Bash commands: prompt user before re-running (side effects unknown)
    • Streaming text: re-send the last user message with added context about the interruption
  • Use exponential backoff (1s → 2s → 4s → ... → 30s max) for retry attempts.

Configuration

All of this should be opt-in/configurable:

{
  "network_resilience": {
    "enabled": true,
    "stream_timeout_ms": 30000,
    "auto_resume": true,
    "auto_retry_readonly_tools": true,
    "auto_retry_mutations": false
  }
}

Evidence of Demand

This proposal consolidates a pattern seen across 100+ issues in this repo. A few representative ones:

Hanging / freezing on disconnect:

Connection errors / ECONNRESET:

Resume / session recovery gaps:

Retry and network awareness:

Conversation corruption after interruption:

Prior Art

Incremental Path

This doesn't need to ship as one monolithic change:

  1. Stream watchdog + graceful timeout — immediate relief for the hanging problem
  2. Recovery snapshots — enables informed manual resume
  3. Auto-resume engine — the full seamless experience

Even just Layer 1 would dramatically improve the experience for users on unreliable networks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleIssue is inactive

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions