feat(storage): implement RetryChunkDeadline for grpc writes#11476
Conversation
tritone
left a comment
There was a problem hiding this comment.
Overall looks good, a few minor things
| policy RetryPolicy | ||
| shouldRetry func(err error) bool | ||
| maxAttempts *int | ||
| maxDuration time.Duration |
There was a problem hiding this comment.
So it looks like this isn't passed in by anything except writer currently. Maybe add a comment to note that?
| policy RetryPolicy | ||
| shouldRetry func(err error) bool | ||
| maxAttempts *int | ||
| maxDuration time.Duration |
There was a problem hiding this comment.
Also let's call it maxRetryDuration since it doesn't actually limit the total duration of calls.
There was a problem hiding this comment.
Done.
It stutters a bit (retryConfig.maxRetryDuration) but that was my initial instinct as well. It's clearer anyway since retryConfig can have any name.
| return (errors.As(err, &httpErr) && httpErr.Code == httpStatusCode) || | ||
| (errors.As(err, &grpcErr) && grpcErr.GRPCStatus().Code() == grpcStatusCode) | ||
| (errors.As(err, &grpcErr) && grpcErr.GRPCStatus().Code() == grpcStatusCode) || | ||
| status.Code(err) == grpcStatusCode |
There was a problem hiding this comment.
Why does this need 3 cases?
There was a problem hiding this comment.
Testbench errors are Status errors but not apierror.APIErrors, so the added test fails when checking the status code without this change
There was a problem hiding this comment.
Hmm. That seems like an issue with wrapping at some layer. But in any case L6974 should work for all grpc errors (both real service and testbench)
| if lastErr == nil { | ||
| return true, errors.New("storage: request not sent, choose a larger value for the retry deadline") | ||
| } | ||
| return true, fmt.Errorf("storage: retry timed out after %v attempts; last error: %w", attempts, lastErr) |
There was a problem hiding this comment.
This err should include the value of the timeout as well.
Branched from #11475
Adds a timer to the upload buffer. This is equivalent to the http implementation.
Changed from implementing directly in the upload code to implementing in
invoke.go. This makes it extensible (if we ever want to add a similar deadline to other calls), but also more prone to affecting other ops (just to be clear, I don't think it will given testing/how it is written). Reason for changing it is because we need to be able to stop the retries while still wrapping the error. If we do it a level above, if we return a "deadline done, here is last error" wrapping the error, it will continue to retry if the last error is retriable.