fix: support retry for List operations#937
Conversation
e77e220 to
1d988d0
Compare
mattisonchao
left a comment
There was a problem hiding this comment.
Self-review: found a bug and a couple of observations.
Bug: ctx vs retryCtx mismatch
In listFromShard, we create retryCtx with a timeout but pass the original ctx (no timeout guarantee) to doList:
retryCtx, cancel := context.WithTimeout(ctx, c.options.requestTimeout)
backOff := time2.NewBackOff(retryCtx)
err := backoff.RetryNotify(func() error {
return c.doList(ctx, request, hint, ch) // ← uses ctx, not retryCtx
}, backOff, ...)The read/write batch pattern uses the same timeout context for both the backoff and the RPC:
ctx, cancel := context.WithTimeout(context.Background(), b.requestTimeout)
backOff := time2.NewBackOff(ctx)
err = backoff.RetryNotify(func() error {
response, err = b.doRequest(ctx, request, hint) // ← same ctx
}, backOff, ...)If the caller passes context.Background(), individual RPCs in doList have no deadline and could hang forever. We should pass retryCtx to doList instead of ctx.
Observation: error path when dataSent && IsRetriable
When data has already been sent and we hit a retriable error, we return backoff.Permanent(err). This is correct — the consumer sees partial data then an error result, and we avoid duplicates. Just want to call out this is intentional.
Will push a fix for the context bug.
1d988d0 to
1973d62
Compare
mattisonchao
left a comment
There was a problem hiding this comment.
Self-review: PR #937 (List retry)
Code correctness
Retry logic — Follows the same backoff.RetryNotify + FindLeaderHint pattern as readBatch.doRequestWithRetries and writeBatch.doRequestWithRetries. The retryCtx (with requestTimeout) is correctly used for both the backoff and the RPC calls.
Duplicate prevention — The dataSent flag ensures we only retry before any data has been written to the channel. Once data flows, a retriable error becomes permanent. This is the right trade-off: we handle the common case (not-leader on initial connect) and avoid the complex case (mid-stream retry with deduplication).
Channel close semantics — listFromShard never closes the channel, same as the original. The caller (List()) handles closing via close(ch) in the single-shard case and via WaitGroup + close(ch) in the multi-shard case. No behavior change here.
Error wrapping — backoff.Permanent(err) wraps the error, but backoff.RetryNotify unwraps PermanentError before returning. The caller sees the original error.
Observations
-
requestTimeoutmay be tight for large List operations returning millions of keys. However, this is consistent with read/write batch behavior and can be tuned viaWithRequestTimeout. -
Both PRs export
IsRetriableindependently — whichever merges second will have a no-op change onrpc_errors.go. Git merge will handle this cleanly. -
Test setup is duplicated from existing
TestLeaderHintWithoutClient/TestLeaderHintWithClient. Could extract a helper, but keeping it consistent with the existing test style.
Verdict
LGTM after the retryCtx fix (already pushed).
List operations previously had no retry logic and passed nil as the leader hint. When the client connected to a non-leader node, the request would fail immediately instead of retrying with the correct leader. This adds retry with exponential backoff and leader hint extraction to listFromShard, following the same pattern used by read and write batches. To avoid duplicate keys, retries are only attempted before any data has been sent to the channel. Also exports IsRetriable from the batch package so it can be reused by the List and future RangeScan retry logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: mattisonchao <mattisonchao@gmail.com>
1973d62 to
7a99e77
Compare
The assignment dispatcher on CI may take longer to propagate shard assignments. Increase the Eventually timeout to avoid flaky failures. Signed-off-by: Mattison <mattison@streamnative.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: mattisonchao <mattisonchao@gmail.com>
Summary
listFromShard(), following the same pattern used by read and write batchesIsRetriablefrom the batch package so it can be reused by List and RangeScan retry logicTest plan
TestLeaderHintListWithoutClient— verifies leader hint is returned for List RPC on non-leader nodeTestLeaderHintListWithClient— verifies end-to-end List with DizzyShardManager (forces client to hit wrong node first)