fix: support retry for List operations by mattisonchao · Pull Request #937 · oxia-db/oxia

mattisonchao · 2026-03-08T16:39:35Z

Summary

Add retry with exponential backoff and leader hint extraction to listFromShard(), following the same pattern used by read and write batches
To avoid duplicate keys, retries are only attempted before any data has been sent to the channel
Export IsRetriable from the batch package so it can be reused by List and RangeScan retry logic

Test plan

Added TestLeaderHintListWithoutClient — verifies leader hint is returned for List RPC on non-leader node
Added TestLeaderHintListWithClient — verifies end-to-end List with DizzyShardManager (forces client to hit wrong node first)
All existing leader hint tests pass
All batch and client tests pass

mattisonchao

Self-review: found a bug and a couple of observations.

Bug: `ctx` vs `retryCtx` mismatch

In listFromShard, we create retryCtx with a timeout but pass the original ctx (no timeout guarantee) to doList:

retryCtx, cancel := context.WithTimeout(ctx, c.options.requestTimeout)
backOff := time2.NewBackOff(retryCtx)

err := backoff.RetryNotify(func() error {
    return c.doList(ctx, request, hint, ch)  // ← uses ctx, not retryCtx
}, backOff, ...)

The read/write batch pattern uses the same timeout context for both the backoff and the RPC:

ctx, cancel := context.WithTimeout(context.Background(), b.requestTimeout)
backOff := time2.NewBackOff(ctx)
err = backoff.RetryNotify(func() error {
    response, err = b.doRequest(ctx, request, hint)  // ← same ctx
}, backOff, ...)

If the caller passes context.Background(), individual RPCs in doList have no deadline and could hang forever. We should pass retryCtx to doList instead of ctx.

Observation: error path when `dataSent && IsRetriable`

When data has already been sent and we hit a retriable error, we return backoff.Permanent(err). This is correct — the consumer sees partial data then an error result, and we avoid duplicates. Just want to call out this is intentional.

Will push a fix for the context bug.

mattisonchao

Self-review: PR #937 (List retry)

Code correctness

Retry logic — Follows the same backoff.RetryNotify + FindLeaderHint pattern as readBatch.doRequestWithRetries and writeBatch.doRequestWithRetries. The retryCtx (with requestTimeout) is correctly used for both the backoff and the RPC calls.

Duplicate prevention — The dataSent flag ensures we only retry before any data has been written to the channel. Once data flows, a retriable error becomes permanent. This is the right trade-off: we handle the common case (not-leader on initial connect) and avoid the complex case (mid-stream retry with deduplication).

Channel close semantics — listFromShard never closes the channel, same as the original. The caller (List()) handles closing via close(ch) in the single-shard case and via WaitGroup + close(ch) in the multi-shard case. No behavior change here.

Error wrapping — backoff.Permanent(err) wraps the error, but backoff.RetryNotify unwraps PermanentError before returning. The caller sees the original error.

Observations

requestTimeout may be tight for large List operations returning millions of keys. However, this is consistent with read/write batch behavior and can be tuned via WithRequestTimeout.
Both PRs export IsRetriable independently — whichever merges second will have a no-op change on rpc_errors.go. Git merge will handle this cleanly.
Test setup is duplicated from existing TestLeaderHintWithoutClient/TestLeaderHintWithClient. Could extract a helper, but keeping it consistent with the existing test style.

Verdict

LGTM after the retryCtx fix (already pushed).

List operations previously had no retry logic and passed nil as the leader hint. When the client connected to a non-leader node, the request would fail immediately instead of retrying with the correct leader. This adds retry with exponential backoff and leader hint extraction to listFromShard, following the same pattern used by read and write batches. To avoid duplicate keys, retries are only attempted before any data has been sent to the channel. Also exports IsRetriable from the batch package so it can be reused by the List and future RangeScan retry logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: mattisonchao <mattisonchao@gmail.com>

The assignment dispatcher on CI may take longer to propagate shard assignments. Increase the Eventually timeout to avoid flaky failures. Signed-off-by: Mattison <mattison@streamnative.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: mattisonchao <mattisonchao@gmail.com>

mattisonchao requested review from RobertIndie, coderzc and merlimat as code owners March 8, 2026 16:39

mattisonchao changed the title ~~fix: support leader hint retry for List operations~~ fix: support retry for List operations Mar 8, 2026

mattisonchao self-assigned this Mar 8, 2026

mattisonchao force-pushed the feat/list-leader-hint-retry branch from e77e220 to 1d988d0 Compare March 8, 2026 17:05

mattisonchao commented Mar 8, 2026

View reviewed changes

mattisonchao force-pushed the feat/list-leader-hint-retry branch from 1d988d0 to 1973d62 Compare March 8, 2026 18:08

mattisonchao commented Mar 8, 2026

View reviewed changes

mattisonchao force-pushed the feat/list-leader-hint-retry branch from 1973d62 to 7a99e77 Compare March 8, 2026 18:24

mattisonchao merged commit 28f8484 into main Mar 9, 2026
11 of 12 checks passed

mattisonchao deleted the feat/list-leader-hint-retry branch March 9, 2026 00:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: support retry for List operations#937

fix: support retry for List operations#937
mattisonchao merged 2 commits intomainfrom
feat/list-leader-hint-retry

mattisonchao commented Mar 8, 2026 •

edited

Loading

Uh oh!

mattisonchao left a comment

Uh oh!

mattisonchao left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mattisonchao commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

mattisonchao left a comment

Choose a reason for hiding this comment

Bug: ctx vs retryCtx mismatch

Observation: error path when dataSent && IsRetriable

Uh oh!

mattisonchao left a comment

Choose a reason for hiding this comment

Self-review: PR #937 (List retry)

Code correctness

Observations

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mattisonchao commented Mar 8, 2026 •

edited

Loading

Bug: `ctx` vs `retryCtx` mismatch

Observation: error path when `dataSent && IsRetriable`