fix: support retry for RangeScan operations#938
Conversation
11eb255 to
dd52df0
Compare
mattisonchao
left a comment
There was a problem hiding this comment.
Self-review: PR #938 (RangeScan retry)
Code correctness
Retry logic — Same backoff.RetryNotify + FindLeaderHint pattern as List PR and read/write batches. retryCtx (with requestTimeout) is correctly used for both backoff timing and RPC calls.
Duplicate prevention — Same dataSent flag approach as List PR. Retries only before data flows to the channel.
Behavior change: close(ch) is now always called
The original rangeScanFromShard had a subtle issue:
client, err := c.executor.ExecuteRangeScan(ctx, request, nil)
if err != nil {
ch <- GetResult{Err: err}
return // ← channel NOT closed
}
defer close(ch) // ← only set up on successIf ExecuteRangeScan failed (e.g. connection refused), the channel was never closed. In the single-shard path, the consumer would receive the error but never see channel close — any for range ch would block forever after the error.
The new code always calls close(ch) at the end of rangeScanFromShard:
if err != nil {
ch <- GetResult{Err: err}
}
close(ch)This is a bug fix — the consumer now always sees a proper channel close. In the multi-shard path, the per-shard channels are also always closed, which lets aggregateAndSortRangeScanAcrossShards know the shard is done.
Observations
-
requestTimeoutmay be tight for large RangeScan operations — same observation as List PR. Consistent with other retry paths. -
Test pre-allocates
resultsslice (make([]oxia.GetResult, 0, 2)) to satisfyprealloclint. -
Both PRs export
IsRetriableindependently — merge-order safe.
Verdict
LGTM. The close(ch) change is an improvement over the original behavior.
7c207dd to
7ea4160
Compare
5f00681 to
e4092ae
Compare
RangeScan operations previously had no retry logic and passed nil as the leader hint. When the client connected to a non-leader node, the request would fail immediately instead of retrying with the correct leader. This adds retry with exponential backoff and leader hint extraction to rangeScanFromShard, following the same pattern used by read and write batches. To avoid duplicate records, retries are only attempted before any data has been sent to the channel. Also exports IsRetriable from the batch package so it can be reused by the RangeScan retry logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: mattisonchao <mattisonchao@gmail.com>
The assignment dispatcher on CI may take longer to propagate shard assignments. Increase the Eventually timeout to avoid flaky failures. Signed-off-by: Mattison <mattison@streamnative.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: mattisonchao <mattisonchao@gmail.com>
Signed-off-by: Mattison <mattison@streamnative.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: mattisonchao <mattisonchao@gmail.com>
e4092ae to
3d1f732
Compare
Summary
rangeScanFromShard(), following the same pattern used by read and write batchesIsRetriablefrom the batch package so it can be reused by RangeScan retry logicTest plan
TestLeaderHintRangeScanWithoutClient— verifies leader hint is returned for RangeScan RPC on non-leader nodeTestLeaderHintRangeScanWithClient— verifies end-to-end RangeScan with DizzyShardManager (forces client to hit wrong node first)