-
Notifications
You must be signed in to change notification settings - Fork 2.2k
tq: add exponential backoff for retries #4097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
84737a5 to
61396e7
Compare
bk2204
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks for the patch, and welcome to Git LFS!
This seems like a reasonable change to make and I'm fine with the default you've provided. I don't think ten seconds is exceptionally burdensome if the alternative is failure, especially considering we run operations in parallel. I had a few questions and comments, but overall I'm happy with the direction this is going.
| errMsg = ": " + err.Error() | ||
| } | ||
| tracerx.Printf("tq: enqueue retry #%d after %.2fs for %q (size: %d)%s", count, delay, t.Oid, t.Size, errMsg) | ||
| next = append(next, t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take it this is written inline and not as an independent method on the queue so that we can close over next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I can also make it an independent method on the queue that takes a *batch if you prefer that style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine as it is. I just wanted to be clear on what's going on.
f856b97 to
ebf274f
Compare
bk2204
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fast turnaround. This looks great except for one CI failure on Go 1.12 I've mentioned below. Once that's fixed, I think this will be ready to merge.
b28d399 to
1aefcb1
Compare
|
Thanks for the reviews. I fixed my new test and rebased on |
|
I think this just happens to be a test that randomly failed, since it looks like it worked on the other systems. I've tried to kick off a rebuild, but it turns out the Actions workflow doesn't like that. If you can just |
1aefcb1 to
76f2997
Compare
This adds new assertion methods used in future tests.
Previously, retries happened as fast as possible unless the server provided the Retry-After header. This is effective for certain types of errors, but not when the LFS server is experiencing a temporary but not instantaneous failure. Delaying between retries lets the server recover and the LFS operation complete. Delays start at a fixed 250ms for the first retry and double with each successive retry up to a configurable maximum delay, 10s by default. The maximum retry is configurable using lfs.transfer.maxretrydelay. Delays can be disabled by setting the max delay to 0.
76f2997 to
cf02216
Compare
bk2204
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for the patch.
A prior commit in this PR resolves a bug where a 429 response to an upload or download request causes a Go panic in the client if the response lacks a Retry-After header. The same condition, when it occurs in the response to a batch API request, does not trigger a Go panic; instead, though, we simply fail without retrying the batch API request at all. This stands in constrast to how we now handle 429 responses for object uploads and downloads when no Retry-After header is provided, because in that case, we perform multiple retries, following the exponential backoff logic introduced in PR git-lfs#4097. This difference stems in part from the fact that the download() function of the basicDownloadAdapter structure and the DoTransfer() function of the basicUploadAdapter structure both handle 429 responses by first calling the NewRetriableLaterError() function of the "errors" package to try to parse any Retry-After header, and if that returns nil, then calling the NewRetriableError() function, so they always return some form of retriable error after a 429 status code is received. We therefore modify the handleResponse() method of the Client structure in the "lfshttp" package to likewise always return a retriable error of some kind after a 429 response. If a Retry-After header is found and is able to be parsed, then a retriableLaterError (from the "errors" package) is returned; otherwise, a generic retriableError is returned. This change is not sufficient on its own, however. When the batch API returns 429 responses without a Retry-After header, the transfer queue now retries its requests following the exponential backoff logic, as we expect. If one of those eventually succeeds, though, the batch is still processed as if it encountered an unrecoverable failure, and the Git LFS client ultimately returns a non-zero exit code. The reason this occurs is because the enqueueAndCollectRetriesFor() method of the TransferQueue structure in the "tq" package sets the flag which causes it to return an error for the batch both when an object in the batch cannot be retried (because it has reached its retry limit) or when an object in the batch can be retried but no specific retry wait time was provided by a retriableLaterError. The latter of these two cases is what is now triggered when the batch API returns a 429 status code and no Retry-After header. In commit a3ecbcc of PR git-lfs#4573 this code was updated to improve how batch API 429 responses with Retry-After headers are handled, building on the original code introduced in PR git-lfs#3449 and some fixes in PR git-lfs#3930. This commit added the flag, named hasNonScheduledErrors, which is set if any objects in a batch which experiences an error either can not be retried, or can be retried but don't have a specific wait time as provided by a Retry-After header. If the flag is set, then the error encountered during the processing of the batch is returned by the enqueueAndCollectRetriesFor() method, and although it is wrapped by NewRetriableError function, because the error is returned instead of just a nil, it is collected into the errors channel of the queue by the collectBatches() caller method, and this ultimately causes the client to report the error and return a non-zero exit code. By constrast, the handleTransferResult() method of the TransferQueue structure treats retriable errors from individual object uploads and downloads in the same way for both errors with a specified wait time and those without. To bring our handling of batch API requests into alignment with this approach, we can simply avoid setting the flag variable when a batch encounters an error and an object can be retried but without a specified wait time. We also rename the flag variable to hasNonRetriableObjects, which better reflects its meaning, as it signals the fact that at least one object in the batch can not be retried. As well, we update some related comments to clarify the current actions and intent of this section of code in the enqueueAndCollectRetriesFor() method. We then add a test to the t/t-batch-retries-ratelimit.sh test suite like the ones we added to the t/t-batch-storage-retries-ratelimit.sh script in a previous commit in this PR. The test relies on a new sentinel value in the test repository name which now recognize in our lfstest-gitserver test server, and which causes the test server to return a 429 response to batch API requests, but without a Retry-After header. This test fails without both of the changes we make in this commit to ensure we handle 429 batch API responses without Retry-After headers.
In a subsequent commit in this PR we will resolve a problem whereby we do not honour the delay specified in a Retry-After header when it is received along with a 429 HTTP status code in response to a request to upload or download an individual Git LFS object. As part of this change, we will update several of the relevant tests in our t/t-batch-storage-retries-ratelimit.sh test script to confirm that we only retry our requests to the server after the interval specified by the Retry-After header has elapsed. Prior to making these changes, we first update two other tests in the t/t-batch-storage-retries-ratelimit.sh test script to check that multiple retry attempts are made when no Retry-After header is sent with a 429 response to an object upload or download request. This behaviour from the Git LFS client is correct, as we expect it to use an exponential backoff retry strategy under these conditions, a feature that was implemented in PR git-lfs#4097. The checks we add to the "batch storage upload causes retries (missing header)" and "batch storage download causes retries (missing header)" tests verify that we see multiple "retrying object" and "enqueue retry" trace log messages during their "git lfs push" and "git lfs pull" commands, respectively. Because only a single Git LFS object is transferred to the server by either of these tests, it is sufficient for us to check that the "enqueue retry" messages appear with both "#1" and "git-lfs#2" retry counts to confirm that all these represent retries for that single object. We cannot yet checks like these to the other tests in the t/t-batch-storage-retries-ratelimit.sh test script, though, until we resolve the problem described above, namely that at present we fail to honour the delay specified in a Retry-After header received in response to an individual Git LFS object transfer request. We can, however, add checks of the number of retry attempts to some of the tests in the t/t-batch-retries-ratelimit.sh test script, as these tests validate the retry attempts made when a 429 response is received to a Git LFS Batch API request, not an individual object transfer request. Because the Git LFS client honours the delay specified in a Retry-After header sent a 429 response to a Batch API request, we can verify that only a single retry is performed for each object (since the specified interval has elapsed) in this case, so we add checks of this behaviour to our "batch upload causes retries" and "batch upload with multiple files causes retries" tests. For the latter test, we verify that although we see two "enqueue retry" trace log messages, they both have the retry count "#1" and so must apply to different Git LFS objects, as we expect. We can also confirm that multiple retries for a single object are performed when a Batch API request receives a 429 response without a Retry-After header, so we add checks of this behaviour to our "batch upload causes retries (missing header)" test. In this instance we expect the same exponential backoff retry strategy implemented in PR git-lfs#4097 to be used by the client, and use the same type of checks we used in the "(missing header)" tests in the t/t-batch-storage-retries-ratelimit.sh test script.
As reported in git-lfs#6007, at present the Git LFS client does not honour the delay specified in a Retry-After header when it is received along with a 429 HTTP status code in response to a request to upload or download an individual Git LFS object. Since PR git-lfs#4097 we have employed an exponential backoff retry strategy after receiving a 429 response without a Retry-After header. This PR also modified the logic used to handle retries of object transfer requests after those receive a 429 response with a Retry-After header and a defined delay interval or fixed retry time, and in making these modifications may have inadvertently caused the problem described above. Starting with PR git-lfs#3449, when a 429 response with a Retry-After header is received in reply to an object transfer request, the handleTransferResult() method of the TransferQueue structure in our "tq" package sets the ReadyTime element of the objectTuple structure representing the object to reflect the interval or fixed time value specified in the Retry-After header. The handleTransferResult() method then sends this objectTuple to the "retries" channel passed to the method as an argument. The enqueueAndCollectRetriesFor() method then reads this channel and appends each objectTuple structure to the set of objects that will be enumerated in the next request to the Git LFS Batch API. Prior to PR git-lfs#4097 this was done with a simple call to the built-in append() Go function inside the loop over the "retries" channel at the end of the enqueueAndCollectRetriesFor() method. In PR git-lfs#4097 this method was refactored so that it defines an anonymous function to append an objectTuple to the set of objects assigned to the next batch request, assigns this anonymous function to the enqueueRetry variable, and then calls the function via the variable whenever a retry request should be made for a given object, including within the loop over the "retries" channel at the end of the method. In this specific use case the function is called with a nil value for its readyTime argument, presumably because the objectTuple's ReadyTime element may already contain the time value determined from a Retry-After header. However, unlike the earlier implementation from PR git-lfs#3449, where the objectTuple was simply appended to the set of objects for the next batch request, the anonymous function assigned to the enqueueRetry variable always resets the objectTuple's ReadyTime value, either to the value of the readyTime function argument, if that is not nil, or to a time value calculated by the ReadyTime() method of the retryCounter structure element of the TransferQueue structure. This latter condition, where the retryCounter structure's ReadyTime() method is used to set the objectTuple's ReadyTime element, is what implements the exponential backoff retry strategy added in PR git-lfs#4097. But this behaviour is now utilized even when the objectTuple has a valid ReadyTime value already set by the handleTransferResult() method, and so this causes the Git LFS client to fail to honour the delays specified in Retry-After headers sent with 429 responses to individual object transfer requests. The same problem does not occur when a Retry-After header is sent with a 429 response to a Batch API request because in that case the enqueueRetry variable's anonymous function is called with a non-nil readyTime argument by the enqueueAndCollectRetriesFor() when it handles the error returned by the Batch() function. The regression only affects responses to individual object transfers because the enqueueAndCollectRetriesFor() method now always calls the enqueueRetry variable's anonymous function with a nil value for its readyTime argument when the method is looping over the "retries" channel. To resolve this problem, we can not simply adjust the anonymous function to respect any pre-existing value in the objectTuple's ReadyTime element, as this may be in the past if more than one retry request needs to be made for an object. That is, the value in the ReadyTime element may have been set by a previous call to the anonymous function, and so we need to reset it in every such call. Instead, we introduce a new retryLaterTime element to the objectTuple structure, and set it in the handleTransferResult() method rather than the ReadyTime element. Next, in the enqueueRetry variable's anonymous function, we check if this value is non-zero, and if so, we use it to populate the ReadyTime element of the objectTuple. We then set the retryLaterTime element back to the zero-equivalent time value so it will not be reused in subsequent calls for any future retry requests for the same object. With these changes in place, we can then update the three tests in our t/t-batch-storage-retries-ratelimit.sh test script where we now expect to see a single pair of "retrying object" and "enqueue retry" trace log messages generated by the tests' "git lfs push", "git lfs pull", and "git lfs clone" commands. These checks follow the same approach we introduced to other tests in this test script and the t/t-batch-retries-ratelimit.sh test script in several prior commits in this PR. We were not previously able to add such checks to the three tests modified in this commit because they would have revealed that multiple retry requests were being made for the single Git LFS object in each test, since the Git LFS client was ignoring the delay specified in the Retry-After headers sent by our lfstest-gitserver test utility program. With the changes in this commit, we can now add checks to these tests to confirm that only one retry request is made in each test.
In PR git-lfs#3449 we added support for the detection of 429 HTTP status code responses to either Git LFS Batch API requests or individual object transfer requests, with an implementation which expected a Retry-After header with a specified retry time or delay period. Commit 2197f76 of that PR introduced two trace log output statements which attempted to report the delay period calculated from the Retry-After header, one each in the enqueueAndCollectRetriesFor() and handleTransferResult() methods of the TransferQueue structure in our "tq" package. Both of these statements attempted to format a value of the Duration type (from the "time" package of the Go standard library) into a number of seconds, expressed as a string, for their trace log messages. However, the type returned by the Seconds() method of the Duration type is a float64 type, so the trace log messages would contain format error markers along with the floating-point value, for instance, "%!s(float64=8.999544366) seconds". The format string for one of these trace log output statements was later updated in commit cf02216 of PR git-lfs#4097, when an exponential backoff technique was introduced to better respect servers which sent responses with 429 status codes but no Retry-After headers. Specifically, the trace log message in the enqueueAndCollectRetriesFor() method was revised to replace the "%s" formatting verb for the float64 return type from the Duration type's Seconds() method with a "%.2f" verb, which correctly converts the float64 value into a string. We now correct the related trace log message in the handleTransferResult() to use the same "%2f" formatting verb for the return type from the Duration type's Seconds() method. We also adjust the text of the message slightly to align more closely with the corresponding message in the enqueueAndCollectRetriesFor() method.
In a subsequent commit in this PR we will resolve a problem whereby we do not honour the delay specified in a Retry-After header when it is received along with a 429 HTTP status code in response to a request to upload or download an individual Git LFS object. As part of this change, we will update several of the relevant tests in our t/t-batch-storage-retries-ratelimit.sh test script to confirm that we only retry our requests to the server after the interval specified by the Retry-After header has elapsed. Prior to making these changes, we first update two other tests in the t/t-batch-storage-retries-ratelimit.sh test script to check that multiple retry attempts are made when no Retry-After header is sent with a 429 response to an object upload or download request. This behaviour from the Git LFS client is correct, as we expect it to use an exponential backoff retry strategy under these conditions, a feature that was implemented in PR git-lfs#4097. The checks we add to the "batch storage upload causes retries (missing header)" and "batch storage download causes retries (missing header)" tests verify that we see multiple "retrying object" and "enqueue retry" trace log messages during their "git lfs push" and "git lfs pull" commands, respectively. Because only a single Git LFS object is transferred to the server by either of these tests, it is sufficient for us to check that the "enqueue retry" messages appear with both "#1" and "git-lfs#2" retry counts to confirm that all these represent retries for that single object. We cannot yet checks like these to the other tests in the t/t-batch-storage-retries-ratelimit.sh test script, though, until we resolve the problem described above, namely that at present we fail to honour the delay specified in a Retry-After header received in response to an individual Git LFS object transfer request. We can, however, add checks of the number of retry attempts to some of the tests in the t/t-batch-retries-ratelimit.sh test script, as these tests validate the retry attempts made when a 429 response is received to a Git LFS Batch API request, not an individual object transfer request. Because the Git LFS client honours the delay specified in a Retry-After header sent a 429 response to a Batch API request, we can verify that only a single retry is performed for each object (since the specified interval has elapsed) in this case, so we add checks of this behaviour to our "batch upload causes retries" and "batch upload with multiple files causes retries" tests. For the latter test, we verify that although we see two "enqueue retry" trace log messages, they both have the retry count "#1" and so must apply to different Git LFS objects, as we expect. We can also confirm that multiple retries for a single object are performed when a Batch API request receives a 429 response without a Retry-After header, so we add checks of this behaviour to our "batch upload causes retries (missing header)" test. In this instance we expect the same exponential backoff retry strategy implemented in PR git-lfs#4097 to be used by the client, and use the same type of checks we used in the "(missing header)" tests in the t/t-batch-storage-retries-ratelimit.sh test script.
As reported in git-lfs#6007, at present the Git LFS client does not honour the delay specified in a Retry-After header when it is received along with a 429 HTTP status code in response to a request to upload or download an individual Git LFS object. Since PR git-lfs#4097 we have employed an exponential backoff retry strategy after receiving a 429 response without a Retry-After header. This PR also modified the logic used to handle retries of object transfer requests after those receive a 429 response with a Retry-After header and a defined delay interval or fixed retry time, and in making these modifications may have inadvertently caused the problem described above. Starting with PR git-lfs#3449, when a 429 response with a Retry-After header is received in reply to an object transfer request, the handleTransferResult() method of the TransferQueue structure in our "tq" package sets the ReadyTime element of the objectTuple structure representing the object to reflect the interval or fixed time value specified in the Retry-After header. The handleTransferResult() method then sends this objectTuple to the "retries" channel passed to the method as an argument. The enqueueAndCollectRetriesFor() method then reads this channel and appends each objectTuple structure to the set of objects that will be enumerated in the next request to the Git LFS Batch API. Prior to PR git-lfs#4097 this was done with a simple call to the built-in append() Go function inside the loop over the "retries" channel at the end of the enqueueAndCollectRetriesFor() method. In PR git-lfs#4097 this method was refactored so that it defines an anonymous function to append an objectTuple to the set of objects assigned to the next batch request, assigns this anonymous function to the enqueueRetry variable, and then calls the function via the variable whenever a retry request should be made for a given object, including within the loop over the "retries" channel at the end of the method. In this specific use case the function is called with a nil value for its readyTime argument, presumably because the objectTuple's ReadyTime element may already contain the time value determined from a Retry-After header. However, unlike the earlier implementation from PR git-lfs#3449, where the objectTuple was simply appended to the set of objects for the next batch request, the anonymous function assigned to the enqueueRetry variable always resets the objectTuple's ReadyTime value, either to the value of the readyTime function argument, if that is not nil, or to a time value calculated by the ReadyTime() method of the retryCounter structure element of the TransferQueue structure. This latter condition, where the retryCounter structure's ReadyTime() method is used to set the objectTuple's ReadyTime element, is what implements the exponential backoff retry strategy added in PR git-lfs#4097. But this behaviour is now utilized even when the objectTuple has a valid ReadyTime value already set by the handleTransferResult() method, and so this causes the Git LFS client to fail to honour the delays specified in Retry-After headers sent with 429 responses to individual object transfer requests. The same problem does not occur when a Retry-After header is sent with a 429 response to a Batch API request because in that case the enqueueRetry variable's anonymous function is called with a non-nil readyTime argument by the enqueueAndCollectRetriesFor() when it handles the error returned by the Batch() function. The regression only affects responses to individual object transfers because the enqueueAndCollectRetriesFor() method now always calls the enqueueRetry variable's anonymous function with a nil value for its readyTime argument when the method is looping over the "retries" channel. To resolve this problem, we can not simply adjust the anonymous function to respect any pre-existing value in the objectTuple's ReadyTime element, as this may be in the past if more than one retry request needs to be made for an object. That is, the value in the ReadyTime element may have been set by a previous call to the anonymous function, and so we need to reset it in every such call. Instead, we introduce a new retryLaterTime element to the objectTuple structure, and set it in the handleTransferResult() method rather than the ReadyTime element. Next, in the enqueueRetry variable's anonymous function, we check if this value is non-zero, and if so, we use it to populate the ReadyTime element of the objectTuple. We then set the retryLaterTime element back to the zero-equivalent time value so it will not be reused in subsequent calls for any future retry requests for the same object. With these changes in place, we can then update the three tests in our t/t-batch-storage-retries-ratelimit.sh test script where we now expect to see a single pair of "retrying object" and "enqueue retry" trace log messages generated by the tests' "git lfs push", "git lfs pull", and "git lfs clone" commands. These checks follow the same approach we introduced to other tests in this test script and the t/t-batch-retries-ratelimit.sh test script in several prior commits in this PR. We were not previously able to add such checks to the three tests modified in this commit because they would have revealed that multiple retry requests were being made for the single Git LFS object in each test, since the Git LFS client was ignoring the delay specified in the Retry-After headers sent by our lfstest-gitserver test utility program. With the changes in this commit, we can now add checks to these tests to confirm that only one retry request is made in each test.
In PR git-lfs#3449 we added support for the detection of 429 HTTP status code responses to either Git LFS Batch API requests or individual object transfer requests, with an implementation which expected a Retry-After header with a specified retry time or delay period. Commit 2197f76 of that PR introduced two trace log output statements which attempted to report the delay period calculated from the Retry-After header, one each in the enqueueAndCollectRetriesFor() and handleTransferResult() methods of the TransferQueue structure in our "tq" package. Both of these statements attempted to format a value of the Duration type (from the "time" package of the Go standard library) into a number of seconds, expressed as a string, for their trace log messages. However, the type returned by the Seconds() method of the Duration type is a float64 type, so the trace log messages would contain format error markers along with the floating-point value, for instance, "%!s(float64=8.999544366) seconds". The format string for one of these trace log output statements was later updated in commit cf02216 of PR git-lfs#4097, when an exponential backoff technique was introduced to better respect servers which sent responses with 429 status codes but no Retry-After headers. Specifically, the trace log message in the enqueueAndCollectRetriesFor() method was revised to replace the "%s" formatting verb for the float64 return type from the Duration type's Seconds() method with a "%.2f" verb, which correctly converts the float64 value into a string. We now correct the related trace log message in the handleTransferResult() to use the same "%2f" formatting verb for the return type from the Duration type's Seconds() method. We also adjust the text of the message slightly to align more closely with the corresponding message in the enqueueAndCollectRetriesFor() method.
Background
Our GitHub Enterprise server occasionally returns 500 errors for certain objects for a few seconds at a time (we've seen up to 10) and then recovers. We're tracking this separately with GitHub support, but the problem is made worse by LFS's retry behavior. Because retries happen as fast as possible, even a higher-than-default number of retries is not enough to get past these temporary errors.
Proposed Changes
This PR adds exponential backoff for retries with a configurable maximum delay and enables them by default. The delays starts at 250ms and double with each retry up to a maximum of 10 seconds. The new
lfs.transfer.maxretrydelayproperty controls the maximum value and can also disable delays if set to 0.Ten seconds is an arbitrary maximum, but seemed like a good compromise between waiting long enough for a server to recover and waiting so long that LFS appears to stall. As before, if a server provides the
Retry-Afterheader, that value is used instead and the retry is no counted against the limit.With the default configuration (8 retries, 10s max delay), a failed operation will now take 35.75 seconds before failing. I'm not sure if this is too long or not, so it may make sense to adjust either the number of retries, the base delay, or the maximum delay.
Other Considerations