Skip to content

fix: support retry for List operations#937

Merged
mattisonchao merged 2 commits intomainfrom
feat/list-leader-hint-retry
Mar 9, 2026
Merged

fix: support retry for List operations#937
mattisonchao merged 2 commits intomainfrom
feat/list-leader-hint-retry

Conversation

@mattisonchao
Copy link
Copy Markdown
Member

@mattisonchao mattisonchao commented Mar 8, 2026

Summary

  • Add retry with exponential backoff and leader hint extraction to listFromShard(), following the same pattern used by read and write batches
  • To avoid duplicate keys, retries are only attempted before any data has been sent to the channel
  • Export IsRetriable from the batch package so it can be reused by List and RangeScan retry logic

Test plan

  • Added TestLeaderHintListWithoutClient — verifies leader hint is returned for List RPC on non-leader node
  • Added TestLeaderHintListWithClient — verifies end-to-end List with DizzyShardManager (forces client to hit wrong node first)
  • All existing leader hint tests pass
  • All batch and client tests pass

@mattisonchao mattisonchao changed the title fix: support leader hint retry for List operations fix: support retry for List operations Mar 8, 2026
@mattisonchao mattisonchao self-assigned this Mar 8, 2026
@mattisonchao mattisonchao force-pushed the feat/list-leader-hint-retry branch from e77e220 to 1d988d0 Compare March 8, 2026 17:05
Copy link
Copy Markdown
Member Author

@mattisonchao mattisonchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review: found a bug and a couple of observations.

Bug: ctx vs retryCtx mismatch

In listFromShard, we create retryCtx with a timeout but pass the original ctx (no timeout guarantee) to doList:

retryCtx, cancel := context.WithTimeout(ctx, c.options.requestTimeout)
backOff := time2.NewBackOff(retryCtx)

err := backoff.RetryNotify(func() error {
    return c.doList(ctx, request, hint, ch)  // ← uses ctx, not retryCtx
}, backOff, ...)

The read/write batch pattern uses the same timeout context for both the backoff and the RPC:

ctx, cancel := context.WithTimeout(context.Background(), b.requestTimeout)
backOff := time2.NewBackOff(ctx)
err = backoff.RetryNotify(func() error {
    response, err = b.doRequest(ctx, request, hint)  // ← same ctx
}, backOff, ...)

If the caller passes context.Background(), individual RPCs in doList have no deadline and could hang forever. We should pass retryCtx to doList instead of ctx.

Observation: error path when dataSent && IsRetriable

When data has already been sent and we hit a retriable error, we return backoff.Permanent(err). This is correct — the consumer sees partial data then an error result, and we avoid duplicates. Just want to call out this is intentional.

Will push a fix for the context bug.

@mattisonchao mattisonchao force-pushed the feat/list-leader-hint-retry branch from 1d988d0 to 1973d62 Compare March 8, 2026 18:08
Copy link
Copy Markdown
Member Author

@mattisonchao mattisonchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review: PR #937 (List retry)

Code correctness

Retry logic — Follows the same backoff.RetryNotify + FindLeaderHint pattern as readBatch.doRequestWithRetries and writeBatch.doRequestWithRetries. The retryCtx (with requestTimeout) is correctly used for both the backoff and the RPC calls.

Duplicate prevention — The dataSent flag ensures we only retry before any data has been written to the channel. Once data flows, a retriable error becomes permanent. This is the right trade-off: we handle the common case (not-leader on initial connect) and avoid the complex case (mid-stream retry with deduplication).

Channel close semanticslistFromShard never closes the channel, same as the original. The caller (List()) handles closing via close(ch) in the single-shard case and via WaitGroup + close(ch) in the multi-shard case. No behavior change here.

Error wrappingbackoff.Permanent(err) wraps the error, but backoff.RetryNotify unwraps PermanentError before returning. The caller sees the original error.

Observations

  1. requestTimeout may be tight for large List operations returning millions of keys. However, this is consistent with read/write batch behavior and can be tuned via WithRequestTimeout.

  2. Both PRs export IsRetriable independently — whichever merges second will have a no-op change on rpc_errors.go. Git merge will handle this cleanly.

  3. Test setup is duplicated from existing TestLeaderHintWithoutClient/TestLeaderHintWithClient. Could extract a helper, but keeping it consistent with the existing test style.

Verdict

LGTM after the retryCtx fix (already pushed).

List operations previously had no retry logic and passed nil as the
leader hint. When the client connected to a non-leader node, the
request would fail immediately instead of retrying with the correct
leader.

This adds retry with exponential backoff and leader hint extraction
to listFromShard, following the same pattern used by read and write
batches. To avoid duplicate keys, retries are only attempted before
any data has been sent to the channel.

Also exports IsRetriable from the batch package so it can be reused
by the List and future RangeScan retry logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: mattisonchao <mattisonchao@gmail.com>
@mattisonchao mattisonchao force-pushed the feat/list-leader-hint-retry branch from 1973d62 to 7a99e77 Compare March 8, 2026 18:24
The assignment dispatcher on CI may take longer to propagate shard
assignments. Increase the Eventually timeout to avoid flaky failures.

Signed-off-by: Mattison <mattison@streamnative.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: mattisonchao <mattisonchao@gmail.com>
@mattisonchao mattisonchao merged commit 28f8484 into main Mar 9, 2026
11 of 12 checks passed
@mattisonchao mattisonchao deleted the feat/list-leader-hint-retry branch March 9, 2026 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant