Skip to content

Conversation

@chrisd8088
Copy link
Member

This PR resolves the problem reported in #6007 regarding the Git LFS client's failure to honour the delay interval or fixed retry time specified in Retry-After headers, if the headers are sent with 429 HTTP response status codes in reply to individual Git LFS object transfer requests.

In addition, we revise a number of related tests so they now check the number of retry attempts made after 429 status code responses to both Batch API requests and individual object transfer requests, and then add the same type of checks to the appropriate tests to verify that the bug fix in this PR is effective.

As well, we fix several tests of our 429 status code response handling during git lfs clone operations. At present, these tests only receive responses from our test server with 429 status codes and Retry-After headers when pushing objects to establish the test conditions, and not when performing the subsequent clone operations the tests are intended to validate.

This PR will be most easily reviewed on a commit-by-commit basis.

Resolves #6007.
/cc @max-wittig as reporter.

@chrisd8088 chrisd8088 changed the title Honour Retry-After headers Honour 429 object transfer response Retry-After headers Mar 14, 2025
@max-wittig
Copy link

@chrisd8088 I don't know the codebase to well to be of help here, but if I should test something let me know!

@chrisd8088
Copy link
Member Author

I don't know the codebase to well to be of help here, but if I should test something let me know!

Thanks, @max-wittig! It may be a while before I'm able to move this out of draft mode and get a review, but in the meantime, if you want to download the appropriate CI job artifacts for your platform (maybe the macOS ones?) and give the git-lfs binary a try in your environment, that would be much appreciated! The CI job summary page has an "Artifacts" section down at the bottom-right, and you can pick the appropriate Zip archive from there.

@chrisd8088 chrisd8088 added the bug label Mar 21, 2025
@chrisd8088 chrisd8088 marked this pull request as ready for review March 22, 2025 00:08
@chrisd8088 chrisd8088 requested a review from a team as a code owner March 22, 2025 00:08
Prior to updating the tests in our t/t-batch-retries-ratelimit.sh
and t/t-batch-storage-retries-ratelimit.sh test scripts, which we expect
to do in subsequent commits in this PR, we first correct the text of
the error messages output by several tests if the "git lfs clone"
command were to fail unexpectedly.

We also adjust the indentation of the lines which output these messages,
along with a few other lines in the same source files where we have
tab characters in place of space characters.
In a subsequent commit in this PR we will resolve a problem whereby
we do not honour the delay specified in a Retry-After header when
it is received along with a 429 HTTP status code in response to a
request to upload or download an individual Git LFS object.

As part of this change, we will update several of the relevant tests
in our t/t-batch-storage-retries-ratelimit.sh test script to confirm
that we only retry our requests to the server after the interval
specified by the Retry-After header has elapsed.

Prior to making these changes, we first update two other tests
in the t/t-batch-storage-retries-ratelimit.sh test script to check
that multiple retry attempts are made when no Retry-After header is
sent with a 429 response to an object upload or download request.
This behaviour from the Git LFS client is correct, as we expect it
to use an exponential backoff retry strategy under these conditions,
a feature that was implemented in PR git-lfs#4097.

The checks we add to the "batch storage upload causes retries (missing
header)" and "batch storage download causes retries (missing header)"
tests verify that we see multiple "retrying object" and "enqueue retry"
trace log messages during their "git lfs push" and "git lfs pull"
commands, respectively.  Because only a single Git LFS object is
transferred to the server by either of these tests, it is sufficient
for us to check that the "enqueue retry" messages appear with both
"#1" and "git-lfs#2" retry counts to confirm that all these represent retries
for that single object.

We cannot yet checks like these to the other tests in the
t/t-batch-storage-retries-ratelimit.sh test script, though, until we
resolve the problem described above, namely that at present we fail to
honour the delay specified in a Retry-After header received in response
to an individual Git LFS object transfer request.

We can, however, add checks of the number of retry attempts to some
of the tests in the t/t-batch-retries-ratelimit.sh test script, as
these tests validate the retry attempts made when a 429 response is
received to a Git LFS Batch API request, not an individual object
transfer request.

Because the Git LFS client honours the delay specified in a Retry-After
header sent a 429 response to a Batch API request, we can verify that
only a single retry is performed for each object (since the specified
interval has elapsed) in this case, so we add checks of this behaviour
to our "batch upload causes retries" and "batch upload with multiple
files causes retries" tests.  For the latter test, we verify that
although we see two "enqueue retry" trace log messages, they both have
the retry count "#1" and so must apply to different Git LFS objects,
as we expect.

We can also confirm that multiple retries for a single object are
performed when a Batch API request receives a 429 response without a
Retry-After header, so we add checks of this behaviour to our "batch
upload causes retries (missing header)" test.  In this instance we
expect the same exponential backoff retry strategy implemented in
PR git-lfs#4097 to be used by the client, and use the same type of checks
we used in the "(missing header)" tests in the
t/t-batch-storage-retries-ratelimit.sh test script.
In commit 4fe0a8d of PR git-lfs#3449 we
updated our lfstest-gitserver test utility program to simulate rate
limits on specific resources, allowing it to be used to validate the
changes made in that PR to respect 429 HTTP status code responses
sent by servers to the Git LFS client.

In the lfstest-gitserver utility, the checkRateLimit() function
is called when specific resources are requested by the client, and
it determines whether the test server should allow the request to
proceed as normal or whether a 429 response should be returned.

When each resource is first requested, a timer is enabled by the
checkRateLimit() function and the function returns values to the
caller indicating that a 429 response with a Retry-After header should
be sent to the client.  Once the timer has expired, the function
permits a fixed number of subsequent requests for the same resource
to proceed without delay, after which the timer is re-enabled and
future requests will again receive 429 responses.

To track the timer and request count values for each resource,
the checkRateLimit() function constructs a key from three of its
four arguments.  However, the function's first argument, named "api",
is at present left unused.

This oversight stems from the design of the similar
incrementRetriesFor(), which also constructs per-resource keys to
help track request counts.  These counts are used when validating the
retry behaviour of the Git LFS client after it receives repeated
5xx HTTP status codes while uploading or downloading objects.

In both functions, the "api" argument is unused in the construction
of per-resource key strings.  The argument is used by the
incrementRetriesFor() function to construct a value it checks
against the entry, if any, in the oidHandlers map for a given
object ID.  This map determines whether a specific object's content
(and therefore Git LFS SHA-256 object ID) indicates that the test
utility server should behave differently for that object.  In the
case of the incrementRetriesFor() function, this indicates whether
a retry count should be kept and reported for the requests for the
same object.

Arguably, we could simply drop the "api" argument from the set
of values passed to the checkRateLimit() function.  However, since
all our callers of both this function and the incrementRetriesFor()
function take care to pass an appropriate value for this argument,
either "batch" or "storage", we instead just incorporate the "api"
argument into the per-resource keys constructed by both functions,
and refactor this key generation into a small getResourceKey()
helper function.

This change should allow for the checkRateLimit() and
incrementRetriesFor() functions to be used more broadly in the
future, without concern that we may accidentally replicate the
same resource keys across tests of different APIs.
The "batch clone causes retries" test was added to our
t/t-batch-retries-ratelimit.sh test script in commit
540d93a of PR git-lfs#3449, with the
intent that it should validate the Git LFS client's handling of 429
HTTP status code responses to individual object transfer requests
during clone operations.

Later, in commit 32f571d of PR git-lfs#4573,
this test was revised to instead validate how the client handles
429 responses to Git LFS Batch API requests during clone operations.
In the same commit, the "batch clone with multiple files causes retries"
test was also added, which performs the same checks but when more than
one Git LFS object is being transferred.

Like the other tests in this test script, these tests depend on our
lfstest-gitserver utility program to track the requests made for
certain resources and generate 429 responses under specific conditions.

In particular, when a resource is first requested, a timer is enabled
by the checkRateLimit() function and the function returns values to
the caller indicating that a 429 response with a Retry-After header
should be sent to the client.  Once the timer has expired, the function
permits a fixed number of subsequent requests for the same resource
to proceed without delay, after which the timer is re-enabled and
future requests will again receive 429 responses.

In the case of the "batch clone causes retries" and "batch clone with
multiple files causes retries" tests, the tests clone a bare repository,
create Git LFS objects in it, and push that change to the test server.
The tests then clone the repository again, and expect that during this
second clone the server will apply its rate-limiting behaviour.  The
intent of the tests is that the Git LFS client should be forced to retry
its Batch API requests during this clone operation until it no longer
receives 429 responses from the test server.

However, in practice, the push operation that uploads the Git LFS objects
to the server is the one which is rate-limited and delayed by the test
server, because the checkRateLimit() function starts its per-resource
timer the first time it is invoked for a given resource.  (The resource
is defined by just the API name "batch" and the test repository name,
as the other elements of the resource key are left blank in the calls
to the function for these repositories.)  The second clone operation
then proceeds without delay and without receiving a 429 response.

To resolve this problem with the tests, we add a /limits/ endpoint
to our lfstest-gitserver program which accepts query arguments
defining a given resource key, and a value to set as that key's
remaining number of non-rate-limited requests.  If the value provided
is the string "max", the default maximum value as defined by the
refillTokenCount constant is used.

Next, we define a set_server_rate_limit() test helper function in
our t/testhelpers.sh shell library, which makes a GET request to our
new /limits/ endpoint and passes its input arguments as query
parameters.

We can then invoke this helper function twice in our "batch clone
causes retries" and "batch clone with multiple files causes retries"
tests, the first time before pushing the commit with the new Git LFS
objects, and the second time afterwards.  For the first call we provide
a "max" value for the number of available non-rate-limited requests,
ensuring the push proceeds without delay, and for the second call we
reset the number of available non-rate-limited requests to zero, so
that the clone operation will now receive a 429 response to its Git
LFS Batch API request and be forced to wait and retry that request.

Finally, this allows us to check the number of retries performed by
the client during the second clone operation.  To do this we utilize
the same approach we introduced in a prior commit in this PR for
several other tests in the same test script, where we check the
number of times the "enqueue retry" trace log message is output.
For the test with multiple Git LFS objects, we confirm that we see
this message exactly once for each object, and only with the "#1"
retry count value.

These checks requires that we set the GIT_TRACE environment variable
for the "git lfs clone" commands and capture the commands' output
into clone.log files.  Because we use shell pipelines and the tee(1)
command for this purpose, we also have to adjust how we check the
exit code from the "git lfs clone" commands since they are the first
command in the pipelines.
As reported in git-lfs#6007, at present the Git LFS client does not honour
the delay specified in a Retry-After header when it is received along
with a 429 HTTP status code in response to a request to upload or
download an individual Git LFS object.

Since PR git-lfs#4097 we have employed an exponential backoff retry strategy
after receiving a 429 response without a Retry-After header.  This
PR also modified the logic used to handle retries of object transfer
requests after those receive a 429 response with a Retry-After header
and a defined delay interval or fixed retry time, and in making these
modifications may have inadvertently caused the problem described
above.

Starting with PR git-lfs#3449, when a 429 response with a Retry-After header
is received in reply to an object transfer request, the
handleTransferResult() method of the TransferQueue structure in our
"tq" package sets the ReadyTime element of the objectTuple structure
representing the object to reflect the interval or fixed time value
specified in the Retry-After header.  The handleTransferResult() method
then sends this objectTuple to the "retries" channel passed to the
method as an argument.

The enqueueAndCollectRetriesFor() method then reads this channel
and appends each objectTuple structure to the set of objects that
will be enumerated in the next request to the Git LFS Batch API.
Prior to PR git-lfs#4097 this was done with a simple call to the built-in
append() Go function inside the loop over the "retries" channel
at the end of the enqueueAndCollectRetriesFor() method.

In PR git-lfs#4097 this method was refactored so that it defines an
anonymous function to append an objectTuple to the set of objects
assigned to the next batch request, assigns this anonymous
function to the enqueueRetry variable, and then calls the function
via the variable whenever a retry request should be made for a
given object, including within the loop over the "retries" channel
at the end of the method.

In this specific use case the function is called with a nil value
for its readyTime argument, presumably because the objectTuple's
ReadyTime element may already contain the time value determined
from a Retry-After header.

However, unlike the earlier implementation from PR git-lfs#3449, where
the objectTuple was simply appended to the set of objects for the
next batch request, the anonymous function assigned to the enqueueRetry
variable always resets the objectTuple's ReadyTime value, either to
the value of the readyTime function argument, if that is not nil,
or to a time value calculated by the ReadyTime() method of the
retryCounter structure element of the TransferQueue structure.

This latter condition, where the retryCounter structure's ReadyTime()
method is used to set the objectTuple's ReadyTime element, is what
implements the exponential backoff retry strategy added in PR git-lfs#4097.

But this behaviour is now utilized even when the objectTuple has
a valid ReadyTime value already set by the handleTransferResult()
method, and so this causes the Git LFS client to fail to honour
the delays specified in Retry-After headers sent with 429 responses
to individual object transfer requests.

The same problem does not occur when a Retry-After header is sent
with a 429 response to a Batch API request because in that case
the enqueueRetry variable's anonymous function is called with a
non-nil readyTime argument by the enqueueAndCollectRetriesFor()
when it handles the error returned by the Batch() function.

The regression only affects responses to individual object transfers
because the enqueueAndCollectRetriesFor() method now always calls
the enqueueRetry variable's anonymous function with a nil value
for its readyTime argument when the method is looping over the
"retries" channel.

To resolve this problem, we can not simply adjust the anonymous
function to respect any pre-existing value in the objectTuple's
ReadyTime element, as this may be in the past if more than one
retry request needs to be made for an object.  That is, the value
in the ReadyTime element may have been set by a previous call
to the anonymous function, and so we need to reset it in every
such call.

Instead, we introduce a new retryLaterTime element to the
objectTuple structure, and set it in the handleTransferResult()
method rather than the ReadyTime element.  Next, in the enqueueRetry
variable's anonymous function, we check if this value is non-zero,
and if so, we use it to populate the ReadyTime element of the
objectTuple.  We then set the retryLaterTime element back to the
zero-equivalent time value so it will not be reused in subsequent
calls for any future retry requests for the same object.

With these changes in place, we can then update the three tests in
our t/t-batch-storage-retries-ratelimit.sh test script where we now
expect to see a single pair of "retrying object" and "enqueue retry"
trace log messages generated by the tests' "git lfs push",
"git lfs pull", and "git lfs clone" commands.

These checks follow the same approach we introduced to other tests
in this test script and the t/t-batch-retries-ratelimit.sh test
script in several prior commits in this PR.  We were not previously
able to add such checks to the three tests modified in this commit
because they would have revealed that multiple retry requests were
being made for the single Git LFS object in each test, since the
Git LFS client was ignoring the delay specified in the Retry-After
headers sent by our lfstest-gitserver test utility program.  With
the changes in this commit, we can now add checks to these tests
to confirm that only one retry request is made in each test.
In PR git-lfs#3449 we added support for the detection of 429 HTTP status code
responses to either Git LFS Batch API requests or individual object
transfer requests, with an implementation which expected a Retry-After
header with a specified retry time or delay period.  Commit
2197f76 of that PR introduced two
trace log output statements which attempted to report the delay
period calculated from the Retry-After header, one each in the
enqueueAndCollectRetriesFor() and handleTransferResult() methods of
the TransferQueue structure in our "tq" package.

Both of these statements attempted to format a value of the Duration
type (from the "time" package of the Go standard library) into a number
of seconds, expressed as a string, for their trace log messages.

However, the type returned by the Seconds() method of the Duration type
is a float64 type, so the trace log messages would contain format
error markers along with the floating-point value, for instance,
"%!s(float64=8.999544366) seconds".

The format string for one of these trace log output statements was
later updated in commit cf02216
of PR git-lfs#4097, when an exponential backoff technique was introduced to
better respect servers which sent responses with 429 status codes
but no Retry-After headers.  Specifically, the trace log message
in the enqueueAndCollectRetriesFor() method was revised to replace
the "%s" formatting verb for the float64 return type from the
Duration type's Seconds() method with a "%.2f" verb, which correctly
converts the float64 value into a string.

We now correct the related trace log message in the
handleTransferResult() to use the same "%2f" formatting verb for
the return type from the Duration type's Seconds() method.  We also
adjust the text of the message slightly to align more closely with the
corresponding message in the enqueueAndCollectRetriesFor() method.
@chrisd8088 chrisd8088 force-pushed the honour-object-retry-later branch from 1fee9ea to e9fb2d8 Compare March 22, 2025 15:43

if readyTime == nil {
if !t.retryLaterTime.IsZero() {
t.ReadyTime = t.retryLaterTime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if retryLaterTime is far in the future? Would git-lfs seem to hang? Should we restrict this to something sane like a maximum wait time of 10min or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. We already respect the Retry-After header when handling Batch API responses, so the scenario you describe is possible now, even without this PR's changes.

If we want to guard against excessive Retry-After settings, I think we should probably make that enhancement in another PR, since this one is intended to address the bug in #6007 in particular. We probably also want to allow for some kind of configuration of whatever cutoff we establish, so users can choose to allow longer waits than our default maximum of, say, 10 minutes, or even to just always honour any Retry-After setting if they want. And of course we'd want to expand our test suite to validate this new functionality with both Batch API and individual object transfer requests.

As for an actual implementation of such a maximum on Retry-After values, we could insert it at the end of the TransferQueue.canRetryObjectLater() method, so it checks and limits the time value it otherwise simply returns, or, alternatively, check and limit the time value calculated when we parse the Retry-After itself (noting that there are two valid return paths in the NewRetriableLaterError() function, so we'd need to refactor it so both passed through the new code that implemented the limit).

All of which is to say, I think I'll go ahead and merge this PR, but make a note that we might want to revisit this question in a subsequent PR.

@chrisd8088 chrisd8088 merged commit d5db3c8 into git-lfs:main Apr 10, 2025
10 checks passed
@chrisd8088 chrisd8088 deleted the honour-object-retry-later branch April 10, 2025 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Retry-After header wait time when HTTP 429 is received is not considered. Does not wait.

3 participants