Retry Requests with HTTP Status 429 by fire-at-will · Pull Request #4048 · RevenueCat/purchases-ios

fire-at-will · 2024-07-12T15:14:02Z

Motivation

In the past, the SDK has ignored HTTP Status 429s. This PR introduces a retry ability with an exponential backoff policy to the SDK's HTTPClient so that 429s can be retried.

Description

Configuration

The implementation allows consumers of the HTTP client to specify which HTTP status codes to retry for. The exponential backoff policy will retry requests up to three times, with the following backoff times:

First retry: 0sec
Second retry: 0.75sec
Third retry: 3sec

Server-Defined Backoff Durations

If the server returns the Retry-After header with an integer value, the SDK will use that integer as the number of milliseconds to wait before retrying again instead of computing a backoff duration.

New `X-Retry-After` Header For Outbound Network Requests

This PR introduces a new X-Retry-After header for all outbound network requests, whose value is an integer. For the initial request (with no retries), the value is 0. For the first retry, the value is 1. For the second retry, the value is 2, and so on.

Etag Management

To keep our etag behavior consistent, the backoff duration for a first retry will always be 0 seconds.

Logging

A debug message in the format "Queued request \(httpMethod) \(path) for retry number \(retryNumber) in \(backoffInterval) seconds." is logged when a request is queued for retry. If a request fails all retries, a message with the format "Request \(httpMethod) \(path) failed all \(retryCount) retries." will be logged.

Testing

A good number of unit tests have been written, for both the individual parts of the HTTP client introduced in this PR and for testing the entire request flow through the HTTPClient.

fire-at-will · 2024-07-12T19:24:11Z

 // MARK: - Private

-private extension HTTPClient {
+internal extension HTTPClient {


Modified the access modified here so I could do some testing

Would using the @Testable import RevenueCat thing have made this visible for testing? I can't exactly remember the correc usecase for it but figured I'd ask 😇

I wish! Unfortunately @testable only grants you access to internal stuff, not private

honestly any time we're making private stuff internal for testing, it's a sign that it should likely be extracted into a separate internal entity instead. Not only for easier testing, but also to reduce the amount of stuff a single object (in this case, HTTPClient) is doing

fire-at-will · 2024-07-12T19:24:40Z

+        }
+
+        /// The number of times that we have retried the request
+        var retryCount: UInt = 0


I'm using a UInt instead of an Int here so there won't be any negative values 😎

fire-at-will · 2024-07-12T19:25:22Z

+}
+
+internal extension HTTPClient.ExponentialRetryOptions {
+    static let `default` = HTTPClient.ExponentialRetryOptions(


I'm totally open to changing these default values to something else if we want - this is just what came to the top of my head first!

I wonder if it would make sense to have different defaults per endpoint type... As in, maybe we don't need to retry some non-critical requests like diagnostics, product entitlement mapping,...

fire-at-will · 2024-07-12T19:35:13Z


+// MARK: - Request Retry Logic
+extension HTTPClient {
+    internal func retryRequestIfNeeded(


This is the only function I wasn't able to directly test - let me know if y'all have any ideas! 😄

just double-checking that this method has been tested directly now and the comment is outdated, right?

Yeah, I added some unit tests to test it directly, and it's also covered under the integration tests now!

aboedo · 2024-07-12T20:01:39Z

For end-to-end testing, I'd suggest using some proxy server like this one that you can control and set stuff like "give me a 429 for the next request only"

aboedo · 2024-07-12T20:10:10Z

        case isLoadShedder = "X-RevenueCat-Fortress"
        case requestID = "X-Request-ID"
        case amazonTraceID = "X-Amzn-Trace-ID"
+        case retryAfter = "Retry-After"


todo: confirm header name with backend devs

aboedo · 2024-07-12T20:15:28Z

+        // retryCount == self.retryOptions.maxNumberOfRetries
+        guard request.retryCount > self.retryOptions.maxNumberOfRetries else { return }
+        guard let httpURLResponse = httpURLResponse else { return }
+        guard shouldRetryRequest(withStatusCode: httpURLResponse.httpStatusCode) else { return }


I'd probably make this the first check since conceptually it seems more important than the others

we're always returning on these, maybe we could just chain them with && and ,?

then again we may wanna log, especially for the case where we've reached max retries on 429

Logging here when we've reached the max retry count is a great idea - I'll add one in and combine the other two guard statements.

aboedo · 2024-07-12T20:23:02Z

+        retryCount: UInt
+    ) -> TimeInterval {
+        // If this is our first retry and there's no eTag header, ensure there's no delay
+        if httpURLResponse.allHeaderFields[ResponseHeader.eTag] == nil && retryCount == 1 {


if we just make the min backoff time zero we can skip this bit

I'd imagine that most of the time, if you hit a 429 and retry immediately, by the time you hit the backend again it'll already be solved

Nice, removed!

aboedo · 2024-07-12T20:24:18Z

+    internal func calculateDefaultExponentialBackoffTimeInterval(withRetryCount retryCount: UInt) -> TimeInterval {
+        // Never wait for an original request
+        guard retryCount > 0 else { return 0 }
+
+        let baseBackoff = self.retryOptions.baseBackoff
+        if baseBackoff < 1 {
+            // Since base is less than 1, we need to calculate the backoff by multiplying the baseBackoff by
+            // 2^(retryCount-1). This will result in:
+            //
+            // retryCount 1: backoff = baseBackoff * 2^0 = 0.25
+            // retryCount 2: backoff = baseBackoff * 2^1 = 0.5
+            // retryCount 3: backoff = baseBackoff * 2^2 = 1.0
+            return min(baseBackoff * pow(2.0, Double(retryCount - 1)), self.retryOptions.maxBackoff)
+        } else {
+            // If baseBackoff is greater than or equal to 1, just use baseBackoff^retryCount:
+            // retryCount 1: backoff = baseBackoff^1
+            // retryCount 2: backoff = baseBackoff^2
+            // ...
+            return min(pow(baseBackoff, Double(retryCount)), self.retryOptions.maxBackoff)
+        }


we only have 3 possible values in that we only retry up to three times, we might as well just explicitly set the values

aboedo · 2024-07-12T20:28:59Z

+        ).to(equal(1))
+    }
+
+}


You can also use OHTTPStubs to fake responses from the backend and test.

The files named somethingBackendTests have plenty of examples you can mostly reuse, especially BackendPostReceiptDataTests

feels like we should get at least one backend post receipt test in with the exact case we want to cover (first request is a 429, second goes through).

We can also use our BackendIntegrationTests suite to actually force a 429, remove the forced 429, post something to the backend, ensure it went through.

OfflineStoreKitIntegrationTests has some examples of doing something similar

aboedo · 2024-07-12T20:29:10Z

great work on this!

tonidero · 2024-07-15T07:48:06Z

+        let retriableStatusCodes: Set<HTTPStatusCode>
+        let baseBackoff: TimeInterval
+        let maxBackoff: TimeInterval
+        let maxNumberOfRetries: UInt


I was wondering, do we need both a maxBackoff and maxNumberOfRetries? One or the other could be computed of the other + baseBackoff right? Edit: Sorry I was reviewing this. I see it's different. So I think it's ok to leave both!

In my updates to some of the other PR comments, I've hardcoded the backoff intervals. After that change was done, most of this struct was redundant so it's been removed!

fire-at-will · 2024-07-15T19:25:10Z

@aboedo I've updated the PR with updates from your comments! Specifically, I've made the following changes:

Removed the jitter from the timeInterval delay
Fixed the logic that determines if if the request has exceeded the maximum number of retries.
Refactored some guard statements that could be combined
Added a log message to log when a request has used up all of its retries and still failed
Hard coded the retry backoff intervals. As a part of this change, the ExponentialRetryOptions become redundant and was removed. Now, the HTTPClient's constructor only takes in a set of HTTP status codes to retry for, which defaults to only 429s
Fixed a bug where a HTTPClient.Request's callback was being executed before the request succeeded or all retries had been exhausted. Now, a HTTPClient.Request's callback is only executed after the request succeeds or uses up all of its retries
Wrote a few tests using OhHTTPStubs to test the entire retry flow in the HTTPClient. Thanks for pointing out that library to me! 😄

aboedo · 2024-07-16T15:39:29Z

+        case .timeInterval(let timeInterval):
+            return max(timeInterval, 0)


when would this be 0?

I can't think of any instances where the timeInterval would be <0 in the scope of this PR, but this is just defensive programming in case someone ever tries to pass in a negative TimeInterval as a delay in the future

aboedo · 2024-07-16T15:40:27Z

+        TimeInterval(0.25),
+        TimeInterval(0.5)


can't say that I have evidence to back this up, but I'd maybe have the backoff be a bit more aggressive, this does in the best case make three requests in the same second

I'd be closer to 0.5 and 1 sec, or even 0.75 and 3

That works for me! I'll make the adjustments to use 0sec, 0.75sec, and 3sec. 👍

aboedo · 2024-07-16T15:41:27Z

+            if !requestRetryScheduled {
+                request.completionHandler?(response)
+            }


are we sending a retry attempt number header or something so we can debug these in the future?

also maybe we can add this to SDK diagnostics (cc @tonidero )

Not yet, but this is an excellent thing to add. I'll look into adding the retry attempt number to the headers!

Right, we are already tracking http requests response codes:

purchases-ios/Sources/Diagnostics/DiagnosticsTracker.swift

Line 26 in 373abdd

func trackHttpRequestPerformed(endpointName: String,

.

We could add the retry count there. I'm a bit reticent to add another tag to the metric in Grafana (we are being mindful of increasing the cardinality of the data), but it might be useful especially if/when we move to using snowflake or similar for this data.
If we want to do this, we just need to append the new property as a valid property for the http_request_performed diagnostic event in the backend and we can start sending it from the SDK. Lmk if you want me to tackle this!

aboedo · 2024-07-16T15:42:14Z


+// MARK: - Request Retry Logic
+extension HTTPClient {
+    internal func retryRequestIfNeeded(


just double-checking that this method has been tested directly now and the comment is outdated, right?

aboedo · 2024-07-16T15:46:15Z

+    /// - Parameter retryCount: The count of the retry attempt, starting from 1.
+    /// - Returns: The backoff time interval for the given retry count. If the retry count exceeds
+    ///   the predefined list of backoff intervals, it returns 0.
+    internal func defaultExponentialBackoffTimeInterval(withRetryCount retryCount: UInt) -> TimeInterval {


a lot of stuff here is made internal for testing.
If we really need to test these directly, we should move to a different dedicated entity

[can be addressed as a follow-up]

also this is now simple enough that maybe it doesn't merit its own method - it just looks up a value in an array and has a default if it goes out of bounds.

it could even be simplified further like

guard retryCount >= 0 && retryCount < retryBackoffIntervals.count else { return retryBackoffIntervals.last ?? 0 } return self.retryBackoffIntervals[backoffIntervalIndex]

removed the function!

aboedo · 2024-07-16T16:02:05Z

        case .none: return 0
        case .`default`: return 0
        case .long: return Self.maxJitter
+        case .timeInterval(let timeInterval):


if we don't need the jitter, we might as well make another method in operation dispatcher that just takes an exact value as TimeInterval rather as an overload to the one we're using instead of passing in a min and max value that are the same

I think the argument can be made either way. If we were to make that change, we'd then have two similar dispatch functions that happen after a period of time:

func dispatchOnWorkerThread(delay: Delay)

func dispatchOnWorkerThread(after timeInterval: TimeInterval)

When I read these function signatures, it isn't clear which one I should choose because the function signatures signify the same thing.

If we were to go through with this change, I'd recommend renaming Delay to something like JitterableDelay to indicate the difference that one of these APIs includes a jitter and the other doesn't. I'm down to make this change - any objections to that renaming?

Update: I've gone ahead and renamed Delay to JitterableDelay and created the new func dispatchOnWorkerThread(after timeInterval: TimeInterval) function!

aboedo · 2024-07-16T16:07:40Z

+        self.logger.verifyMessageWasLogged("Queued request GET /v1/subscribers/identify for retry in 0.0 seconds.")
+        self.logger.verifyMessageWasLogged("Queued request GET /v1/subscribers/identify for retry in 0.25 seconds.")
+        self.logger.verifyMessageWasLogged("Queued request GET /v1/subscribers/identify for retry in 0.5 seconds.")
+        self.logger.verifyMessageWasLogged("Request GET /v1/subscribers/identify failed all 3 retries.")


these are quite dangerous in that the minute we change the logs we mess up tests. That'd be fine if we were trying to specifically test that we're logging, but here we're testing "Performs all retries if always gets retryable status code", so we should instead make sure to test for that.

We test for request count to hit 4, maybe we could also mock the operation dispatcher as a separate test and ensure that it's getting the right delays in?

Love this! I've moved the log assertions to their own tests and created separate test to verify that the correct delay times are being sent to the OperationDispatcher.

aboedo · 2024-07-16T16:10:19Z

+        ).to(equal(1))
+    }
+
+}


feels like we should get at least one backend post receipt test in with the exact case we want to cover (first request is a 429, second goes through).

We can also use our BackendIntegrationTests suite to actually force a 429, remove the forced 429, post something to the backend, ensure it went through.

OfflineStoreKitIntegrationTests has some examples of doing something similar

MarkVillacampa · 2024-07-17T14:26:32Z

            return "Request was handled by load shedder: \(path.relativePath)"

+        case let .api_request_queued_for_retry(httpMethod, path, backoffInterval):
+            return "Queued request \(httpMethod) \(path) for retry in \(backoffInterval) seconds."


Would it make sense to log the retry number in this message as well?

Love this, will add it in!

fire-at-will · 2024-07-17T14:38:57Z

@MarkVillacampa pointed out that the forcedError stuff in the integration tests can probably be accomplished much more easily and safely with OhHTTPStubs, so I'm going to look at doing that! :D

tonidero

LGTM! But might be good to get some other checks as well

fire-at-will · 2024-07-17T15:24:39Z

@MarkVillacampa I've made the following changes from your review:

Replaced the forcedServerErrors mechanism with OhHTTPStubs in the integration tests and removed the associated code from HTTPClient (YAY!!) 🙌
Added the retry count number to the log message where we state that we've queued the request for a retry

MarkVillacampa · 2024-07-17T15:36:10Z

+        ]
+    }
+
+    func testVerifyPurchaseDoesntGrantEntitlementsAfter429RetriesExhausted() async throws {


Is there a way we can test that exactly 3 retries were made here?

Discussed this with Mark in Slack - we added in verifications for the number of times that the request was failed before eventually succeeding with the backend

joshdholtz

A few small nits/thoughts but otherwise looks good 👍

joshdholtz · 2024-07-18T15:43:20Z

+    func dispatchOnWorkerThread(delay: JitterableDelay = .none, block: @escaping @Sendable () -> Void) {
        if delay.hasDelay {
            self.workerQueue.asyncAfter(deadline: .now() + delay.random(), execute: block)
        } else {
            self.workerQueue.async(execute: block)
        }
    }

-    func dispatchOnWorkerThread(delay: Delay = .none, block: @escaping @Sendable () async -> Void) {
+    func dispatchOnWorkerThread(after timeInterval: TimeInterval, block: @escaping @Sendable () -> Void) {


This might be a small Josh nit but delay and after seem like very similar argument names and I would probably be confused by trying to figure out the difference between these 😅

Not a thing that has to change in the PR but just a thought 🤷‍♂️

I totally agree with you! The only difference is that delay: JitterableDelay doesn't let you specify the exact delay and includes a random jitter, while after timeInterval: TimeInterval allows you to specify the exact waiting time without a random jitter

maybe we can rename the param to jitterableDelay:

It might be slightly confusing on first glance but at least you'd hesitate long enough to actually think of which one to use

joshdholtz · 2024-07-18T15:48:11Z

 // MARK: - Private

-private extension HTTPClient {
+internal extension HTTPClient {


Would using the @Testable import RevenueCat thing have made this visible for testing? I can't exactly remember the correc usecase for it but figured I'd ask 😇

joshdholtz · 2024-07-18T15:53:59Z

+import OHHTTPStubs
+import OHHTTPStubsSwift


Nit: I know very little about what our standard is for organizing import but seems like these could be grouped without a new line inbetween 🤷‍♂️

aboedo · 2024-07-19T16:33:59Z

+    func dispatchOnWorkerThread(delay: JitterableDelay = .none, block: @escaping @Sendable () -> Void) {
        if delay.hasDelay {
            self.workerQueue.asyncAfter(deadline: .now() + delay.random(), execute: block)
        } else {
            self.workerQueue.async(execute: block)
        }
    }

-    func dispatchOnWorkerThread(delay: Delay = .none, block: @escaping @Sendable () async -> Void) {
+    func dispatchOnWorkerThread(after timeInterval: TimeInterval, block: @escaping @Sendable () -> Void) {


maybe we can rename the param to jitterableDelay:

It might be slightly confusing on first glance but at least you'd hesitate long enough to actually think of which one to use

aboedo · 2024-07-19T17:09:32Z

 // MARK: - Private

-private extension HTTPClient {
+internal extension HTTPClient {


honestly any time we're making private stuff internal for testing, it's a sign that it should likely be extracted into a separate internal entity instead. Not only for easier testing, but also to reduce the amount of stuff a single object (in this case, HTTPClient) is doing

aboedo · 2024-07-19T17:13:55Z

+}
+
+// MARK: - Request Retry Logic
+extension HTTPClient {


This extension has some logic that might be better suited to its own file, such that the only stuff that's remaining in HTTPClient is retryRequestIfNeeded

Doesn't need to be addressed in this PR, but would be a good follow-up

fire-at-will added the pr:fix A bug fix label Jul 12, 2024

fire-at-will commented Jul 12, 2024

View reviewed changes

Comment thread Sources/Misc/Concurrency/OperationDispatcher.swift Outdated

fire-at-will commented Jul 12, 2024

View reviewed changes

fire-at-will requested a review from a team July 12, 2024 19:35

fire-at-will marked this pull request as ready for review July 12, 2024 19:35

fire-at-will self-assigned this Jul 12, 2024

aboedo reviewed Jul 12, 2024

View reviewed changes

tonidero reviewed Jul 15, 2024

View reviewed changes

fire-at-will requested a review from aboedo July 15, 2024 19:25

MarkVillacampa reviewed Jul 15, 2024

View reviewed changes

Comment thread Sources/Networking/HTTPClient/HTTPClient.swift Outdated

MarkVillacampa reviewed Jul 15, 2024

View reviewed changes

Comment thread Sources/Networking/HTTPClient/HTTPClient.swift Outdated

aboedo reviewed Jul 16, 2024

View reviewed changes

fire-at-will added 12 commits July 16, 2024 16:57

keep track of retry counts

7d3d883

immediately retry supported status codes

21dfb9e

execute retried request

bdc6c03

retry logic sans etags

e5db1b9

dont delay retries for first retry and no eTag header

5fe37d3

delay conformance to Equatable

6428262

testing

dfa171f

linting

6a74a22

stop retrying after a max count

177ba49

make baseBackoff & maxBackoff configurable

9af7146

add logging

3117b86

remove jitter from timeInterval delay

85c809d

add more unit tests

f31bd21

MarkVillacampa reviewed Jul 17, 2024

View reviewed changes

tonidero approved these changes Jul 17, 2024

View reviewed changes

Comment thread Sources/Misc/DangerousSettings.swift Outdated

Comment thread Sources/Networking/HTTPClient/HTTPClient.swift Outdated

fire-at-will added 3 commits July 17, 2024 10:15

replace integration test forcedServerErrors with OhHTTPStubs

5a3992e

remove pathRequestCounts

5914732

add retry count to log message

21399f5

fire-at-will requested a review from MarkVillacampa July 17, 2024 15:24

MarkVillacampa reviewed Jul 17, 2024

View reviewed changes

fire-at-will added 4 commits July 17, 2024 11:25

validate stubbed request count in integration tests

7426811

fix test

619a361

linting

3634049

linting

3c2fb0e

joshdholtz approved these changes Jul 18, 2024

View reviewed changes

fire-at-will added 2 commits July 19, 2024 11:34

tweaks from code review

2aefc4b

sort imports for linter

2dc4477

aboedo approved these changes Jul 19, 2024

View reviewed changes

fire-at-will added 7 commits July 19, 2024 12:52

rename param to jitterableDelay

972f504

Add Is-Retryable Header

3b20186

rename to jitterableDelay

da10908

linting

95848e4

linting

f886206

server provided backoff interval validation

ff28f23

treat Retry-After header value as seconds

03fc420

fire-at-will merged commit a405537 into main Jul 23, 2024

fire-at-will deleted the retry-429s branch July 23, 2024 17:41

fire-at-will mentioned this pull request Jul 23, 2024

Release/5.2.1 #4102

Merged

vegaro mentioned this pull request Jul 29, 2024

[Customer Center] Fix BackendGetCustomerCenterConfigTests #4124

Merged

tonidero mentioned this pull request Mar 20, 2025

[Diagnostics] add is_retry to http_request_performed event RevenueCat/purchases-android#2276

Merged

2 tasks

		case .timeInterval(let timeInterval):
		return max(timeInterval, 0)

Conversation

fire-at-will commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description

Configuration

Server-Defined Backoff Durations

New X-Retry-After Header For Outbound Network Requests

Etag Management

Logging

Testing

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aboedo commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aboedo commented Jul 12, 2024

Uh oh!

tonidero Jul 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fire-at-will commented Jul 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fire-at-will commented Jul 12, 2024 •

edited

Loading

New `X-Retry-After` Header For Outbound Network Requests

aboedo commented Jul 12, 2024 •

edited

Loading

tonidero Jul 15, 2024 •

edited

Loading

fire-at-will commented Jul 15, 2024 •

edited

Loading