Problem
When using Claude Code on an unstable network (WiFi drops, power outages, VPN reconnects, mobile hotspot switching, laptop sleep/wake), a mid-task disconnection leads to a cascade of problems:
- The CLI hangs indefinitely — no SSE events arrive, but no timeout triggers. Escape/Ctrl+C often don't work. The only recovery is killing the process.
- Conversation state gets corrupted — orphaned
tool_use blocks without corresponding tool_result blocks break the message history, causing API 400 errors on retry.
- In-flight work is lost — partial streaming responses, pending tool calls, and task context (todo list state, which files were being edited) disappear.
--resume doesn't know what was in-progress — it restores conversation history but Claude has no awareness that it was interrupted mid-task. Users must manually prompt "you were cut off, continue from here" and paste partial output. Claude often hallucinates that it already finished.
This isn't an edge case. Anyone working from a café, on mobile hotspot, in a region with unreliable power, or behind a corporate VPN hits this regularly.
Current Workaround
- Notice the CLI is frozen (sometimes only after minutes of waiting)
- Kill all Claude processes (
pkill -f claude)
- Restart with
claude --resume or claude --continue
- Manually re-explain what was happening: "Your last response was cut off due to a connection loss. You were editing src/auth.ts and had 3 more files to update. Continue from where you left off."
- Hope Claude doesn't hallucinate that it already completed the work
Proposed Solution
A three-layer approach to network resilience:
Layer 1: Stream Watchdog (Don't hang forever)
- Monitor the SSE stream for event gaps. If no event (including
ping) is received within a configurable timeout (default: 30s), treat the connection as dead.
- Gracefully abort the hung request instead of freezing the CLI.
- Surface a clear message: "Connection lost. Your session has been saved."
- This alone would fix the most painful symptom — the indefinite hang.
Layer 2: Recovery Snapshot (Save in-flight state on disconnect)
When a disconnect is detected, persist a lightweight recovery snapshot alongside the session JSONL:
{
"session_id": "abc123",
"disconnect_phase": "streaming | tool_execution | between_turns",
"active_tool_calls": [
{
"tool_use_id": "tu_xyz",
"tool": "Edit",
"file": "src/auth.ts",
"status": "pending_result"
}
],
"partial_response_text": "Let me update the authentication...",
"last_committed_message_index": 42,
"last_file_checkpoint": "cp_789",
"pending_todos": [...]
}
On resume, this snapshot tells Claude exactly what was happening and what remains incomplete — eliminating the need for users to manually re-explain context.
Layer 3: Auto-Resume on Reconnect
- Monitor network reachability (OS-level events + periodic lightweight health checks).
- When connectivity is restored, verify stability (2+ consecutive successful pings to avoid flapping).
- Repair conversation state: inject synthetic
tool_result with "error": "connection_lost" for any orphaned tool_use blocks.
- Apply resume strategy based on what was interrupted:
- Read-only tools (Glob, Grep, Read): safe to auto-retry
- File mutations (Edit, Write): check file against checkpoint before retrying
- Bash commands: prompt user before re-running (side effects unknown)
- Streaming text: re-send the last user message with added context about the interruption
- Use exponential backoff (1s → 2s → 4s → ... → 30s max) for retry attempts.
Configuration
All of this should be opt-in/configurable:
Evidence of Demand
This proposal consolidates a pattern seen across 100+ issues in this repo. A few representative ones:
Hanging / freezing on disconnect:
Connection errors / ECONNRESET:
Resume / session recovery gaps:
Retry and network awareness:
Conversation corruption after interruption:
Prior Art
Incremental Path
This doesn't need to ship as one monolithic change:
- Stream watchdog + graceful timeout — immediate relief for the hanging problem
- Recovery snapshots — enables informed manual resume
- Auto-resume engine — the full seamless experience
Even just Layer 1 would dramatically improve the experience for users on unreliable networks.
Problem
When using Claude Code on an unstable network (WiFi drops, power outages, VPN reconnects, mobile hotspot switching, laptop sleep/wake), a mid-task disconnection leads to a cascade of problems:
tool_useblocks without correspondingtool_resultblocks break the message history, causing API 400 errors on retry.--resumedoesn't know what was in-progress — it restores conversation history but Claude has no awareness that it was interrupted mid-task. Users must manually prompt "you were cut off, continue from here" and paste partial output. Claude often hallucinates that it already finished.This isn't an edge case. Anyone working from a café, on mobile hotspot, in a region with unreliable power, or behind a corporate VPN hits this regularly.
Current Workaround
pkill -f claude)claude --resumeorclaude --continueProposed Solution
A three-layer approach to network resilience:
Layer 1: Stream Watchdog (Don't hang forever)
ping) is received within a configurable timeout (default: 30s), treat the connection as dead.Layer 2: Recovery Snapshot (Save in-flight state on disconnect)
When a disconnect is detected, persist a lightweight recovery snapshot alongside the session JSONL:
{ "session_id": "abc123", "disconnect_phase": "streaming | tool_execution | between_turns", "active_tool_calls": [ { "tool_use_id": "tu_xyz", "tool": "Edit", "file": "src/auth.ts", "status": "pending_result" } ], "partial_response_text": "Let me update the authentication...", "last_committed_message_index": 42, "last_file_checkpoint": "cp_789", "pending_todos": [...] }On resume, this snapshot tells Claude exactly what was happening and what remains incomplete — eliminating the need for users to manually re-explain context.
Layer 3: Auto-Resume on Reconnect
tool_resultwith"error": "connection_lost"for any orphanedtool_useblocks.Configuration
All of this should be opt-in/configurable:
{ "network_resilience": { "enabled": true, "stream_timeout_ms": 30000, "auto_resume": true, "auto_retry_readonly_tools": true, "auto_retry_mutations": false } }Evidence of Demand
This proposal consolidates a pattern seen across 100+ issues in this repo. A few representative ones:
Hanging / freezing on disconnect:
Connection errors / ECONNRESET:
Resume / session recovery gaps:
Retry and network awareness:
Conversation corruption after interruption:
Prior Art
Last-Event-IDheader — clients can resume from where they left off if the server supports it.Incremental Path
This doesn't need to ship as one monolithic change:
Even just Layer 1 would dramatically improve the experience for users on unreliable networks.