Retry Requests with HTTP Status 429#4048
Conversation
| // MARK: - Private | ||
|
|
||
| private extension HTTPClient { | ||
| internal extension HTTPClient { |
There was a problem hiding this comment.
Modified the access modified here so I could do some testing
There was a problem hiding this comment.
Would using the @Testable import RevenueCat thing have made this visible for testing? I can't exactly remember the correc usecase for it but figured I'd ask 😇
There was a problem hiding this comment.
I wish! Unfortunately @testable only grants you access to internal stuff, not private
There was a problem hiding this comment.
honestly any time we're making private stuff internal for testing, it's a sign that it should likely be extracted into a separate internal entity instead. Not only for easier testing, but also to reduce the amount of stuff a single object (in this case, HTTPClient) is doing
| } | ||
|
|
||
| /// The number of times that we have retried the request | ||
| var retryCount: UInt = 0 |
There was a problem hiding this comment.
I'm using a UInt instead of an Int here so there won't be any negative values 😎
| } | ||
|
|
||
| internal extension HTTPClient.ExponentialRetryOptions { | ||
| static let `default` = HTTPClient.ExponentialRetryOptions( |
There was a problem hiding this comment.
I'm totally open to changing these default values to something else if we want - this is just what came to the top of my head first!
There was a problem hiding this comment.
I wonder if it would make sense to have different defaults per endpoint type... As in, maybe we don't need to retry some non-critical requests like diagnostics, product entitlement mapping,...
|
|
||
| // MARK: - Request Retry Logic | ||
| extension HTTPClient { | ||
| internal func retryRequestIfNeeded( |
There was a problem hiding this comment.
This is the only function I wasn't able to directly test - let me know if y'all have any ideas! 😄
There was a problem hiding this comment.
just double-checking that this method has been tested directly now and the comment is outdated, right?
There was a problem hiding this comment.
Yeah, I added some unit tests to test it directly, and it's also covered under the integration tests now!
|
For end-to-end testing, I'd suggest using some proxy server like this one that you can control and set stuff like "give me a 429 for the next request only" |
| case isLoadShedder = "X-RevenueCat-Fortress" | ||
| case requestID = "X-Request-ID" | ||
| case amazonTraceID = "X-Amzn-Trace-ID" | ||
| case retryAfter = "Retry-After" |
There was a problem hiding this comment.
todo: confirm header name with backend devs
| // retryCount == self.retryOptions.maxNumberOfRetries | ||
| guard request.retryCount > self.retryOptions.maxNumberOfRetries else { return } | ||
| guard let httpURLResponse = httpURLResponse else { return } | ||
| guard shouldRetryRequest(withStatusCode: httpURLResponse.httpStatusCode) else { return } |
There was a problem hiding this comment.
I'd probably make this the first check since conceptually it seems more important than the others
There was a problem hiding this comment.
we're always returning on these, maybe we could just chain them with && and ,?
There was a problem hiding this comment.
then again we may wanna log, especially for the case where we've reached max retries on 429
There was a problem hiding this comment.
Logging here when we've reached the max retry count is a great idea - I'll add one in and combine the other two guard statements.
| retryCount: UInt | ||
| ) -> TimeInterval { | ||
| // If this is our first retry and there's no eTag header, ensure there's no delay | ||
| if httpURLResponse.allHeaderFields[ResponseHeader.eTag] == nil && retryCount == 1 { |
There was a problem hiding this comment.
if we just make the min backoff time zero we can skip this bit
There was a problem hiding this comment.
I'd imagine that most of the time, if you hit a 429 and retry immediately, by the time you hit the backend again it'll already be solved
There was a problem hiding this comment.
Nice, removed!
| internal func calculateDefaultExponentialBackoffTimeInterval(withRetryCount retryCount: UInt) -> TimeInterval { | ||
| // Never wait for an original request | ||
| guard retryCount > 0 else { return 0 } | ||
|
|
||
| let baseBackoff = self.retryOptions.baseBackoff | ||
| if baseBackoff < 1 { | ||
| // Since base is less than 1, we need to calculate the backoff by multiplying the baseBackoff by | ||
| // 2^(retryCount-1). This will result in: | ||
| // | ||
| // retryCount 1: backoff = baseBackoff * 2^0 = 0.25 | ||
| // retryCount 2: backoff = baseBackoff * 2^1 = 0.5 | ||
| // retryCount 3: backoff = baseBackoff * 2^2 = 1.0 | ||
| return min(baseBackoff * pow(2.0, Double(retryCount - 1)), self.retryOptions.maxBackoff) | ||
| } else { | ||
| // If baseBackoff is greater than or equal to 1, just use baseBackoff^retryCount: | ||
| // retryCount 1: backoff = baseBackoff^1 | ||
| // retryCount 2: backoff = baseBackoff^2 | ||
| // ... | ||
| return min(pow(baseBackoff, Double(retryCount)), self.retryOptions.maxBackoff) | ||
| } |
There was a problem hiding this comment.
we only have 3 possible values in that we only retry up to three times, we might as well just explicitly set the values
| ).to(equal(1)) | ||
| } | ||
|
|
||
| } |
There was a problem hiding this comment.
You can also use OHTTPStubs to fake responses from the backend and test.
The files named somethingBackendTests have plenty of examples you can mostly reuse, especially BackendPostReceiptDataTests
There was a problem hiding this comment.
feels like we should get at least one backend post receipt test in with the exact case we want to cover (first request is a 429, second goes through).
We can also use our BackendIntegrationTests suite to actually force a 429, remove the forced 429, post something to the backend, ensure it went through.
OfflineStoreKitIntegrationTests has some examples of doing something similar
|
great work on this! |
| let retriableStatusCodes: Set<HTTPStatusCode> | ||
| let baseBackoff: TimeInterval | ||
| let maxBackoff: TimeInterval | ||
| let maxNumberOfRetries: UInt |
There was a problem hiding this comment.
I was wondering, do we need both a maxBackoff and maxNumberOfRetries? One or the other could be computed of the other + baseBackoff right? Edit: Sorry I was reviewing this. I see it's different. So I think it's ok to leave both!
There was a problem hiding this comment.
In my updates to some of the other PR comments, I've hardcoded the backoff intervals. After that change was done, most of this struct was redundant so it's been removed!
|
@aboedo I've updated the PR with updates from your comments! Specifically, I've made the following changes:
|
| case .timeInterval(let timeInterval): | ||
| return max(timeInterval, 0) |
There was a problem hiding this comment.
I can't think of any instances where the timeInterval would be <0 in the scope of this PR, but this is just defensive programming in case someone ever tries to pass in a negative TimeInterval as a delay in the future
| TimeInterval(0.25), | ||
| TimeInterval(0.5) |
There was a problem hiding this comment.
can't say that I have evidence to back this up, but I'd maybe have the backoff be a bit more aggressive, this does in the best case make three requests in the same second
There was a problem hiding this comment.
I'd be closer to 0.5 and 1 sec, or even 0.75 and 3
There was a problem hiding this comment.
That works for me! I'll make the adjustments to use 0sec, 0.75sec, and 3sec. 👍
| if !requestRetryScheduled { | ||
| request.completionHandler?(response) | ||
| } |
There was a problem hiding this comment.
are we sending a retry attempt number header or something so we can debug these in the future?
There was a problem hiding this comment.
also maybe we can add this to SDK diagnostics (cc @tonidero )
There was a problem hiding this comment.
Not yet, but this is an excellent thing to add. I'll look into adding the retry attempt number to the headers!
There was a problem hiding this comment.
Right, we are already tracking http requests response codes:
.We could add the retry count there. I'm a bit reticent to add another tag to the metric in Grafana (we are being mindful of increasing the cardinality of the data), but it might be useful especially if/when we move to using snowflake or similar for this data.
If we want to do this, we just need to append the new property as a valid property for the http_request_performed diagnostic event in the backend and we can start sending it from the SDK. Lmk if you want me to tackle this!
|
|
||
| // MARK: - Request Retry Logic | ||
| extension HTTPClient { | ||
| internal func retryRequestIfNeeded( |
There was a problem hiding this comment.
just double-checking that this method has been tested directly now and the comment is outdated, right?
| /// - Parameter retryCount: The count of the retry attempt, starting from 1. | ||
| /// - Returns: The backoff time interval for the given retry count. If the retry count exceeds | ||
| /// the predefined list of backoff intervals, it returns 0. | ||
| internal func defaultExponentialBackoffTimeInterval(withRetryCount retryCount: UInt) -> TimeInterval { |
There was a problem hiding this comment.
a lot of stuff here is made internal for testing.
If we really need to test these directly, we should move to a different dedicated entity
There was a problem hiding this comment.
[can be addressed as a follow-up]
There was a problem hiding this comment.
also this is now simple enough that maybe it doesn't merit its own method - it just looks up a value in an array and has a default if it goes out of bounds.
it could even be simplified further like
guard retryCount >= 0 && retryCount < retryBackoffIntervals.count else { return retryBackoffIntervals.last ?? 0 }
return self.retryBackoffIntervals[backoffIntervalIndex]There was a problem hiding this comment.
removed the function!
| case .none: return 0 | ||
| case .`default`: return 0 | ||
| case .long: return Self.maxJitter | ||
| case .timeInterval(let timeInterval): |
There was a problem hiding this comment.
if we don't need the jitter, we might as well make another method in operation dispatcher that just takes an exact value as TimeInterval rather as an overload to the one we're using instead of passing in a min and max value that are the same
There was a problem hiding this comment.
I think the argument can be made either way. If we were to make that change, we'd then have two similar dispatch functions that happen after a period of time:
func dispatchOnWorkerThread(delay: Delay)func dispatchOnWorkerThread(after timeInterval: TimeInterval)
When I read these function signatures, it isn't clear which one I should choose because the function signatures signify the same thing.
If we were to go through with this change, I'd recommend renaming Delay to something like JitterableDelay to indicate the difference that one of these APIs includes a jitter and the other doesn't. I'm down to make this change - any objections to that renaming?
There was a problem hiding this comment.
Update: I've gone ahead and renamed Delay to JitterableDelay and created the new func dispatchOnWorkerThread(after timeInterval: TimeInterval) function!
| self.logger.verifyMessageWasLogged("Queued request GET /v1/subscribers/identify for retry in 0.0 seconds.") | ||
| self.logger.verifyMessageWasLogged("Queued request GET /v1/subscribers/identify for retry in 0.25 seconds.") | ||
| self.logger.verifyMessageWasLogged("Queued request GET /v1/subscribers/identify for retry in 0.5 seconds.") | ||
| self.logger.verifyMessageWasLogged("Request GET /v1/subscribers/identify failed all 3 retries.") |
There was a problem hiding this comment.
these are quite dangerous in that the minute we change the logs we mess up tests. That'd be fine if we were trying to specifically test that we're logging, but here we're testing "Performs all retries if always gets retryable status code", so we should instead make sure to test for that.
We test for request count to hit 4, maybe we could also mock the operation dispatcher as a separate test and ensure that it's getting the right delays in?
There was a problem hiding this comment.
Love this! I've moved the log assertions to their own tests and created separate test to verify that the correct delay times are being sent to the OperationDispatcher.
| ).to(equal(1)) | ||
| } | ||
|
|
||
| } |
There was a problem hiding this comment.
feels like we should get at least one backend post receipt test in with the exact case we want to cover (first request is a 429, second goes through).
We can also use our BackendIntegrationTests suite to actually force a 429, remove the forced 429, post something to the backend, ensure it went through.
OfflineStoreKitIntegrationTests has some examples of doing something similar
| return "Request was handled by load shedder: \(path.relativePath)" | ||
|
|
||
| case let .api_request_queued_for_retry(httpMethod, path, backoffInterval): | ||
| return "Queued request \(httpMethod) \(path) for retry in \(backoffInterval) seconds." |
There was a problem hiding this comment.
Would it make sense to log the retry number in this message as well?
There was a problem hiding this comment.
Love this, will add it in!
|
@MarkVillacampa pointed out that the forcedError stuff in the integration tests can probably be accomplished much more easily and safely with OhHTTPStubs, so I'm going to look at doing that! :D |
tonidero
left a comment
There was a problem hiding this comment.
LGTM! But might be good to get some other checks as well
|
@MarkVillacampa I've made the following changes from your review:
|
| ] | ||
| } | ||
|
|
||
| func testVerifyPurchaseDoesntGrantEntitlementsAfter429RetriesExhausted() async throws { |
There was a problem hiding this comment.
Is there a way we can test that exactly 3 retries were made here?
There was a problem hiding this comment.
Discussed this with Mark in Slack - we added in verifications for the number of times that the request was failed before eventually succeeding with the backend
| func dispatchOnWorkerThread(delay: JitterableDelay = .none, block: @escaping @Sendable () -> Void) { | ||
| if delay.hasDelay { | ||
| self.workerQueue.asyncAfter(deadline: .now() + delay.random(), execute: block) | ||
| } else { | ||
| self.workerQueue.async(execute: block) | ||
| } | ||
| } | ||
|
|
||
| func dispatchOnWorkerThread(delay: Delay = .none, block: @escaping @Sendable () async -> Void) { | ||
| func dispatchOnWorkerThread(after timeInterval: TimeInterval, block: @escaping @Sendable () -> Void) { |
There was a problem hiding this comment.
This might be a small Josh nit but delay and after seem like very similar argument names and I would probably be confused by trying to figure out the difference between these 😅
Not a thing that has to change in the PR but just a thought 🤷♂️
There was a problem hiding this comment.
I totally agree with you! The only difference is that delay: JitterableDelay doesn't let you specify the exact delay and includes a random jitter, while after timeInterval: TimeInterval allows you to specify the exact waiting time without a random jitter
There was a problem hiding this comment.
maybe we can rename the param to jitterableDelay:
It might be slightly confusing on first glance but at least you'd hesitate long enough to actually think of which one to use
| // MARK: - Private | ||
|
|
||
| private extension HTTPClient { | ||
| internal extension HTTPClient { |
There was a problem hiding this comment.
Would using the @Testable import RevenueCat thing have made this visible for testing? I can't exactly remember the correc usecase for it but figured I'd ask 😇
| import OHHTTPStubs | ||
| import OHHTTPStubsSwift |
There was a problem hiding this comment.
Nit: I know very little about what our standard is for organizing import but seems like these could be grouped without a new line inbetween 🤷♂️
| func dispatchOnWorkerThread(delay: JitterableDelay = .none, block: @escaping @Sendable () -> Void) { | ||
| if delay.hasDelay { | ||
| self.workerQueue.asyncAfter(deadline: .now() + delay.random(), execute: block) | ||
| } else { | ||
| self.workerQueue.async(execute: block) | ||
| } | ||
| } | ||
|
|
||
| func dispatchOnWorkerThread(delay: Delay = .none, block: @escaping @Sendable () async -> Void) { | ||
| func dispatchOnWorkerThread(after timeInterval: TimeInterval, block: @escaping @Sendable () -> Void) { |
There was a problem hiding this comment.
maybe we can rename the param to jitterableDelay:
It might be slightly confusing on first glance but at least you'd hesitate long enough to actually think of which one to use
| // MARK: - Private | ||
|
|
||
| private extension HTTPClient { | ||
| internal extension HTTPClient { |
There was a problem hiding this comment.
honestly any time we're making private stuff internal for testing, it's a sign that it should likely be extracted into a separate internal entity instead. Not only for easier testing, but also to reduce the amount of stuff a single object (in this case, HTTPClient) is doing
| } | ||
|
|
||
| // MARK: - Request Retry Logic | ||
| extension HTTPClient { |
There was a problem hiding this comment.
This extension has some logic that might be better suited to its own file, such that the only stuff that's remaining in HTTPClient is retryRequestIfNeeded
There was a problem hiding this comment.
Doesn't need to be addressed in this PR, but would be a good follow-up
Motivation
In the past, the SDK has ignored HTTP Status 429s. This PR introduces a retry ability with an exponential backoff policy to the SDK's HTTPClient so that 429s can be retried.
Description
Configuration
The implementation allows consumers of the HTTP client to specify which HTTP status codes to retry for. The exponential backoff policy will retry requests up to three times, with the following backoff times:
Server-Defined Backoff Durations
If the server returns the
Retry-Afterheader with an integer value, the SDK will use that integer as the number of milliseconds to wait before retrying again instead of computing a backoff duration.New
X-Retry-AfterHeader For Outbound Network RequestsThis PR introduces a new
X-Retry-Afterheader for all outbound network requests, whose value is an integer. For the initial request (with no retries), the value is 0. For the first retry, the value is 1. For the second retry, the value is 2, and so on.Etag Management
To keep our etag behavior consistent, the backoff duration for a first retry will always be 0 seconds.
Logging
A debug message in the format
"Queued request \(httpMethod) \(path) for retry number \(retryNumber) in \(backoffInterval) seconds."is logged when a request is queued for retry. If a request fails all retries, a message with the format"Request \(httpMethod) \(path) failed all \(retryCount) retries."will be logged.Testing
A good number of unit tests have been written, for both the individual parts of the HTTP client introduced in this PR and for testing the entire request flow through the HTTPClient.