Skip to content

fix: treat EOF as retriable error on write stream#917

Merged
mattisonchao merged 2 commits intomainfrom
fix/retry-eof-on-write-stream
Feb 26, 2026
Merged

fix: treat EOF as retriable error on write stream#917
mattisonchao merged 2 commits intomainfrom
fix/retry-eof-on-write-stream

Conversation

@mattisonchao
Copy link
Copy Markdown
Member

Summary

  • When a bidirectional write stream is closed by the server before delivering the gRPC status (e.g. the WriteStream handler returns CodeNodeIsNotLeader before reading any messages), the client's stream.Recv() can return io.EOF instead of the proper gRPC status error
  • io.EOF was not classified as retriable in isRetriable(), so the batch retry logic treated it as a permanent failure
  • This caused flaky TestLeaderHintWithClient failures where the DizzyShardManager forces the client to connect to a non-leader node — the server closes the stream immediately, the client gets EOF, and gives up instead of retrying with the leader hint

Root cause

In the WriteStream server handler (public_rpc_server.go:184-186), when the node is not the leader, getLeader() returns an error and the handler returns immediately — before ever calling stream.Recv(). This causes gRPC to close the server-side stream, and the client's Recv() can race between receiving the proper status error vs io.EOF.

The isRetriable() function checks status.Code(err), but status.Code(io.EOF) returns codes.OK which falls through to the non-retriable default case.

Fix

Add an explicit errors.Is(err, io.EOF) check before the gRPC status code switch, treating EOF as a retriable transient condition.

Test plan

  • go vet ./oxia/internal/batch/... passes
  • CI passes

🤖 Generated with Claude Code

When a bidirectional write stream is closed by the server before
delivering the gRPC status (e.g. the server returns an error from
WriteStream before reading any messages), the client's stream.Recv()
can return io.EOF instead of the proper gRPC status error. This is a
transient condition that should be retried rather than treated as a
permanent failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The test was asserting that io.EOF propagates to callers, but EOF is
now correctly retried. Replace with codes.Internal which is a proper
non-retriable gRPC error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mattisonchao mattisonchao force-pushed the fix/retry-eof-on-write-stream branch from f75b996 to 12498e9 Compare February 26, 2026 16:35
@mattisonchao mattisonchao merged commit 52a4596 into main Feb 26, 2026
9 checks passed
@mattisonchao mattisonchao deleted the fix/retry-eof-on-write-stream branch February 26, 2026 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant