fix(session): unify transport error classification for stream disconnect recovery#941
Conversation
…ect recovery
Extract a shared StreamFailureClassifier that centralizes transport error
recognition (ECONNRESET, UND_ERR_SOCKET, ECONNREFUSED, ETIMEDOUT) in one
module. Both run observability and the processor retry path now classify
the same error through the same list, eliminating the contradiction where
a UND_ERR_SOCKET disconnect was marked safe-to-retry by the safety gate
but not-retryable by the technical gate.
- Replace the inline ECONNRESET case in fromError() with classifier
delegation so all transport errors produce retryable APIErrors
- Support cause-chain traversal for undici's TypeError("terminated")
wrapping pattern
- Update recovery interruption copy to short, actionable user messages
Closes #939
|
Warning Review limit reached
More reviews will be available in 11 minutes and 44 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (7)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a stream failure classifier to identify and handle retryable transport disconnects (such as ECONNRESET, ECONNREFUSED, ETIMEDOUT, and UND_ERR_SOCKET) during stream processing. It also refactors recovery interruption messages to be more concise and user-friendly, and adds comprehensive unit tests. The feedback suggests using the logical OR operator (||) instead of the nullish coalescing operator (??) to ensure empty error messages fall back to the default string, and wrapping the recursive cause-chain property accesses in a try-catch block to safely handle arbitrary objects.
…aversal Address review feedback: - Use || instead of ?? in fromError transport branch so empty error messages fall back to the default string - Wrap cause-chain property access in try-catch to guard against throwing getters or revoked proxies in error objects
Strengthen the existing TCP-reset integration test to explicitly verify
that the failed reasoning part ("one") is absent after safe recovery,
not just that it wasn't concatenated with the retry result.
Bun's test harness cannot produce Node undici's UND_ERR_SOCKET client
error shape; the processor retry/replay chain is covered by TCP reset,
while UND_ERR_SOCKET classification is verified by stream-failure-classifier
and fromError unit tests.
|
Re: P2 — processor-level UND_ERR_SOCKET regression test Not adding a standalone processor-level Strengthened the existing TCP-reset integration test ( |
Bump PawWork desktop release version to 2026.5.28. Changes since v2026.5.27: - feat(ui): fold reasoning into trow block (#948) - feat: align outbound HTTP headers with canonical OpenCode desktop (#940) - feat(app): collapse notification settings to single tri-state control (#938) - fix(ui): track List header surface via --list-surface (#954) - fix(ui): render tooltip shortcut hints as plain sans glyphs (#955) - fix(watcher): isolate macOS workspace roots - fix(session): terminalize stale question blockers - fix(session): unify transport error classification for stream disconnect recovery (#941) - test: add route inventory guardrails - ci: repair electron desktop build + install
Summary
StreamFailureClassifiermodule that centralizes transport error recognition (ECONNRESET, UND_ERR_SOCKET, ECONNREFUSED, ETIMEDOUT) in one placefromError()with classifier delegation so all transport errors produce retryable APIErrors through the same pathTypeError("terminated")wrapping patternWhy
A prod session hit a
UND_ERR_SOCKETdisconnect during reasoning-only generation. The safety gate correctly classified it ascandidate_safe_auto_retry, but the technical retryability gate rejected it becausefromError()only recognizedECONNRESET, not undici'sTypeError("terminated")withcause.code === "UND_ERR_SOCKET". The two classification paths disagreed on the same transport error, so the session stopped instead of auto-recovering.The root cause was duplicate classification logic: run observability and the processor retry path each classified transport errors independently. This PR eliminates that duplication with a shared classifier.
Related Issue
Closes #939
Human Review Status
Pending
Review Focus
classifyStreamFailurefunction: does the cause-chain traversal (max depth 4) cover the real-world undici error wrapping patterns? Could it misclassify a non-transport error?fromError()change: the old code used a hardcoded"Connection reset by server"message; the new code passes through the original error message. Is this safe for all consumers?recoveryInterruptionMessage: tool-related cases (partial_tool_input,tool_call_materialized,tool_execution_started,unsafe_side_effect_started,side_effect_facts_incomplete) are now consolidated into fewer messages. Verify the wording is appropriate for each case.Risk Notes
fromError()change means ECONNRESET errors now carry the original error message instead of"Connection reset by server". This is more accurate but changes the message text that appears in diagnostics exports. No user-visible UI shows this raw message directly.How To Verify
Screenshots or Recordings
No visible UI changes. Copy changes are in error messages shown after failed recovery — verified through integration test assertions.
Checklist
bug,enhancement,task,documentation. Type labels are author-added; the labeler bot does NOT assign them. Add the label in the GitHub UI, then tick this.app,ui,platform,harness,ci. The labeler bot assigns these on PR open based on changed paths. Confirm the bot's choice (or override if wrong), then tick this.P0,P1,P2,P3. The priority-triage bot suggests one on PR open. Confirm or override, then tick this.Pending,Approved by @<reviewer>, orNot required: <reason>(default isPending; "not required" is restricted to bot-authored low-risk PRs).dev, and my PR title and commit messages use Conventional Commits in English.