feat(spanner): retry spanner transactions and mutations when RST_STREAM error#6699
feat(spanner): retry spanner transactions and mutations when RST_STREAM error#6699
Conversation
rahul2393
commented
Sep 20, 2022
- Clients need to retry the Spanner transactions in case of RST_STREAM error returned from the backend
…AM internal errors is returned from backend.
| // the error was Session not found. | ||
| func runWithRetryOnAbortedOrSessionNotFound(ctx context.Context, f func(context.Context) error) error { | ||
| retryer := onCodes(DefaultRetryBackoff, codes.Aborted) | ||
| retryer := onCodes(DefaultRetryBackoff, codes.Aborted, codes.Internal) |
There was a problem hiding this comment.
This will make it retry on any Internal error, also those that are not retryable. This could cause an infinite (or at least very long-lasting) loop if someone tries to execute a transaction that happens to hit a feature that returns an internal error.
| // creations/retrivals. | ||
| return ts, err | ||
| // Make a retryer for Aborted and certain Internal errors. | ||
| retryer := onCodes(DefaultRetryBackoff, codes.Aborted, codes.Internal) |
There was a problem hiding this comment.
This also could cause an infinite loop.
| // returned by Cloud Spanner, or if none is returned, the calculated delay with | ||
| // a minimum of 10ms and maximum of 32s. There is no delay before the retry if | ||
| // the error was Session not found. | ||
| func runWithRetryOnAbortedOrSessionNotFound(ctx context.Context, f func(context.Context) error) error { |
There was a problem hiding this comment.
Could we have a couple of tests for the changes in this PR that verify that transactions are retried if the RST_STREAM error is returned, but not for any other random internal error?
There was a problem hiding this comment.
@olavloite Test "TestClient_ApplyAtLeastOnce_NonRetryableInternalErrors" added,
Please take a look at the code again.
This could cause an infinite => This is not true. The following code only says that:
if it is an internal error and its error message is not in the list, then return false.
If it is an internal error and it error message in the list, then let it pass to the next check r.Retryer.Retry(err).
It early returns false for non-retryable internal errors before the real retry check.
If the user has specified to not retry internal errors, then r.Retryer.Retry(err) will return false.
There was a problem hiding this comment.
Ah, I missed that, thanks! The specific error messages were already there, we were just not using them anymore (and I guess you are now going to dig up that I was the one that actually removed it ;-) ).