Skip to content

fix: support retry for RangeScan operations#938

Merged
mattisonchao merged 3 commits intomainfrom
feat/range-scan-leader-hint-retry
Mar 9, 2026
Merged

fix: support retry for RangeScan operations#938
mattisonchao merged 3 commits intomainfrom
feat/range-scan-leader-hint-retry

Conversation

@mattisonchao
Copy link
Copy Markdown
Member

@mattisonchao mattisonchao commented Mar 8, 2026

Summary

  • Add retry with exponential backoff and leader hint extraction to rangeScanFromShard(), following the same pattern used by read and write batches
  • To avoid duplicate records, retries are only attempted before any data has been sent to the channel
  • Export IsRetriable from the batch package so it can be reused by RangeScan retry logic

Test plan

  • Added TestLeaderHintRangeScanWithoutClient — verifies leader hint is returned for RangeScan RPC on non-leader node
  • Added TestLeaderHintRangeScanWithClient — verifies end-to-end RangeScan with DizzyShardManager (forces client to hit wrong node first)
  • All existing tests pass

@mattisonchao mattisonchao changed the title fix: support leader hint retry for RangeScan operations fix: support retry for RangeScan operations Mar 8, 2026
@mattisonchao mattisonchao self-assigned this Mar 8, 2026
@mattisonchao mattisonchao force-pushed the feat/range-scan-leader-hint-retry branch 2 times, most recently from 11eb255 to dd52df0 Compare March 8, 2026 18:09
Copy link
Copy Markdown
Member Author

@mattisonchao mattisonchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review: PR #938 (RangeScan retry)

Code correctness

Retry logic — Same backoff.RetryNotify + FindLeaderHint pattern as List PR and read/write batches. retryCtx (with requestTimeout) is correctly used for both backoff timing and RPC calls.

Duplicate prevention — Same dataSent flag approach as List PR. Retries only before data flows to the channel.

Behavior change: close(ch) is now always called

The original rangeScanFromShard had a subtle issue:

client, err := c.executor.ExecuteRangeScan(ctx, request, nil)
if err != nil {
    ch <- GetResult{Err: err}
    return  // ← channel NOT closed
}
defer close(ch)  // ← only set up on success

If ExecuteRangeScan failed (e.g. connection refused), the channel was never closed. In the single-shard path, the consumer would receive the error but never see channel close — any for range ch would block forever after the error.

The new code always calls close(ch) at the end of rangeScanFromShard:

if err != nil {
    ch <- GetResult{Err: err}
}
close(ch)

This is a bug fix — the consumer now always sees a proper channel close. In the multi-shard path, the per-shard channels are also always closed, which lets aggregateAndSortRangeScanAcrossShards know the shard is done.

Observations

  1. requestTimeout may be tight for large RangeScan operations — same observation as List PR. Consistent with other retry paths.

  2. Test pre-allocates results slice (make([]oxia.GetResult, 0, 2)) to satisfy prealloc lint.

  3. Both PRs export IsRetriable independently — merge-order safe.

Verdict

LGTM. The close(ch) change is an improvement over the original behavior.

@mattisonchao mattisonchao force-pushed the feat/range-scan-leader-hint-retry branch 2 times, most recently from 7c207dd to 7ea4160 Compare March 8, 2026 18:34
@mattisonchao mattisonchao changed the base branch from main to feat/list-leader-hint-retry March 8, 2026 18:34
@mattisonchao mattisonchao changed the base branch from feat/list-leader-hint-retry to main March 8, 2026 18:47
@mattisonchao mattisonchao force-pushed the feat/range-scan-leader-hint-retry branch from 5f00681 to e4092ae Compare March 8, 2026 18:50
mattisonchao and others added 3 commits March 9, 2026 08:43
RangeScan operations previously had no retry logic and passed nil as
the leader hint. When the client connected to a non-leader node, the
request would fail immediately instead of retrying with the correct
leader.

This adds retry with exponential backoff and leader hint extraction
to rangeScanFromShard, following the same pattern used by read and
write batches. To avoid duplicate records, retries are only attempted
before any data has been sent to the channel.

Also exports IsRetriable from the batch package so it can be reused
by the RangeScan retry logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: mattisonchao <mattisonchao@gmail.com>
The assignment dispatcher on CI may take longer to propagate shard
assignments. Increase the Eventually timeout to avoid flaky failures.

Signed-off-by: Mattison <mattison@streamnative.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: mattisonchao <mattisonchao@gmail.com>
Signed-off-by: Mattison <mattison@streamnative.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: mattisonchao <mattisonchao@gmail.com>
@mattisonchao mattisonchao force-pushed the feat/range-scan-leader-hint-retry branch from e4092ae to 3d1f732 Compare March 9, 2026 00:43
@mattisonchao mattisonchao merged commit 2d97a69 into main Mar 9, 2026
9 checks passed
@mattisonchao mattisonchao deleted the feat/range-scan-leader-hint-retry branch March 9, 2026 00:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant