Skip to content

fix(provider): replay an SSE stream cut before any token#3161

Merged
esengine merged 1 commit into
main-v2from
fix/3148-stream-reconnect
Jun 5, 2026
Merged

fix(provider): replay an SSE stream cut before any token#3161
esengine merged 1 commit into
main-v2from
fix/3148-stream-reconnect

Conversation

@esengine

@esengine esengine commented Jun 5, 2026

Copy link
Copy Markdown
Owner

Problem

With a local proxy in front of the API (v2rayN / sing-box core), a streaming turn dies mid-flight:

deepseek-flash: read stream: read tcp 127.0.0.1:7442->127.0.0.1:10808:
wsarecv: An existing connection was forcibly closed by the remote host.

The dominant trigger: a reasoner has a long time-to-first-token (prefill + thinking) during which no bytes flow, so the proxy treats the long-lived SSE connection as idle and forcibly closes (RST) it — before any token reaches us. SendWithRetry only covers the connect + header phase; once the body streams, a drop was surfaced as a hard error, failing the whole turn.

Fix

The OpenAI provider (which DeepSeek uses) now drives the stream through streamWithReconnect:

  • If the connection resets before any model output has been forwarded, the request is replayed from scratch. That window is exactly the idle-prefill case the proxy kills, and a replay is idempotent — under prompt caching the resent prefix is a cache hit, so recovery is cheap.
  • Once a token (reasoning / text / tool-call) has streamed, a reset is surfaced as an error instead — replaying would duplicate already-visible output.
  • Bounded at maxStreamReconnects (3); each replay still gets SendWithRetry's header-phase backoff.

provider.IsConnReset gates the decision strictly on connection-level errors (peer reset, truncated EOF, closed socket via net.Error / ECONNRESET / io.ErrUnexpectedEOF), so decode errors and 4xx still fail fast — no silent masking.

readStream was refactored to return (emitted bool, err error) rather than emitting the terminal error itself, so the reconnect wrapper owns the replay-or-surface decision and channel lifecycle.

Tests

  • TestIsConnReset — only connection-level drops count; ctx-cancel and protocol errors don't.
  • TestStreamReconnectsOnEarlyConnReset — a mock that force-RSTs the first SSE connection (zero tokens) then serves a full stream; the caller sees one clean stream, server takes 2 requests.
  • TestStreamDoesNotReplayAfterOutput — a reset after a token surfaces a ChunkError, no replay (server takes 1 request).

End-to-end verification

Built the real reasonix CLI and ran reasonix run against a mock DeepSeek endpoint that force-RSTs the first stream connection mid-flight (before any token), then serves the answer on retry. Full stack (boot → controller → provider):

$ reasonix run "Reply with the recovery token"
E2E_OK_RECOVERED
  · 10 tok · in 7 (0 cached / 7 new) · out 3
exit 0

The answer is only served on the second (reconnected) attempt — the turn recovered transparently with no error, exit 0.

Scope: this lands on the OpenAI provider (the one DeepSeek uses, and the one in the report). The Anthropic provider has the same readStream shape and can take the same treatment in a follow-up.

Closes #3148

A local proxy (v2rayN/sing-box) idle-closes the long-lived streaming
connection during a reasoner's first-token gap, surfacing as
"read stream: ... wsarecv: An existing connection was forcibly closed".
The turn failed even though nothing had been emitted yet.

The OpenAI provider now replays the request when the connection resets
before any model output has been forwarded — safe and idempotent, and a
cache hit on the resent prefix. Once a token has streamed, the error is
surfaced instead, since a replay would duplicate visible output. Capped
at a few reconnects. IsConnReset gates strictly on connection-level
errors so decode/4xx failures still fail fast.

Closes #3148
@esengine esengine requested a review from SivanCola as a code owner June 5, 2026 01:44
@github-actions github-actions Bot added the v2 Go rewrite (1.x) — main-v2 branch, active development label Jun 5, 2026
@esengine esengine merged commit d5df0e5 into main-v2 Jun 5, 2026
9 checks passed
@esengine esengine deleted the fix/3148-stream-reconnect branch June 5, 2026 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v2 Go rewrite (1.x) — main-v2 branch, active development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Reasonix 用的是长连接(SSE 流式请求),sing-box 可能对这种连接处理有问题。

1 participant