Retry handler design spec. update #18

irvinesunday · 2019-06-12T22:41:35Z

This PR updates the design spec. for the retry handler middleware proposing to change the count-based request retries to time-based request retries.

From task: microsoftgraph/msgraph-sdk-dotnet/issues/441

MIchaelMainer · 2019-06-12T23:02:42Z

middleware/RetryHandler.md

+- The client code can specify a custom value for the maximum delay.
+- Retries will be based on the cumulative retry delay compared against the maximum delay specified by the client code.
+- Upon expiry of the maximum delay, when the cumulative retry delay is greater than the specified maximum delay, the `Task` should be cancelled and an `exception` thrown with a relevant message.
+- In the case whereby the client code specifies a maximum delay that is less than the received `retry-after` header value from the server, the cumulative retry delay will run for the duration of the `retry-after` value.


If the client code specifies a maximum delay less than the retry-after, then we should respect the client's choice and fail despite the service indicating a retry-after.

@MIchaelMainer by failing, do you propose we fail fast and do not even attempt to delay the retry request as specified by the retry-after or exponential back-off values? Or do we enter the Delay() Task and retry at least once then fail? (see my comment on Peter's review).

Good question. I need to be more specific.

If the retry-after exceeds the max delay, we should fail fast. The service is telling us to wait longer than value max delay submitted by the dev's code. No need to wait until we reach the max delay time as we don't expect the server to be able to handle the request.

If the exponential back-off values exceeds the max delay value, I think we could wait to retry once more at the max delay time, and then fail if that fails.

No retry should occur after reaching the max delay time. Customer specified should always be conservatively respected.

middleware/RetryHandler.md

peombwa · 2019-06-12T23:43:25Z

middleware/RetryHandler.md

 - Retries MUST respect the `retry-after` response header if provided.
 - Where no `retry-after` header is provided by the server, an exponential backoff with random offset hueristic should be used to determine the retry delay.
- Retries should be limited to a maximum value.
+- Retries should be limited to a maximum delay value. The default value for this is set at 180 seconds.


We should also have a minimum max delay value to avoid customers specifying max delay values such as 0 or 0.5, which would essentially result in no retries. This minimum allowed value can be the current DEFAULT_MAX_RETRY value of 3.

cc/ @MIchaelMainer @darrelmiller I'm not sure whether we want this behavior since we already have a ShouldRetry delegate that customers should use to disable retries.

@peombwa I believe your concern is similar to the case whereby a customer specifies a MaxRetryTime (max delay) value less than the retry-after value, in that, in both cases the Delay Task will be delayed with the server-specified retry-after or exponential back-off delay values at least once. See the implementation:

cumulativeDelay = 0.0; while (cumulativeDelay < RetryOption.MaxRetryTime) { Task Delay(response, retryCount, RetryOption.Delay, cancellationToken) }

The cumulativeDelay value will be incremented by the retry-after or exponential back-off values in the call to Task Delay() and then fail in the subsequent retry attempt.

I don't think we should have a minimum delay value. We should always respect a customer specified MaxRetryTime. If the customer doesn't want to wait more than 500ms, then we should always respect that even if that means not retrying, or retrying once more at their specified MaxRetryTime (for exponential back off only - see my comment above for retry-after). Short circuiting retry before MaxRetryTime should only occur in ShouldRetry.

I agree with @MIchaelMainer . I don't think we should add the complexity of a minimum MaxRetryTime. I also wonder if RetriesTimeLimit is a clearer name for this property.

I think this statement should change from

Retries should be limited to a maximum delay value. The default value for this is set at 180 seconds.

to

Retries MAY be limited to a maximum delay value. The default value for this is set at 180 seconds.

We should leave our current behavior as the default behavior and make this behavior a new option.

If we implement a default then this isn't an option unless we have a new overload that says add a timelimit and use the default value. I'd rather just make the time limit only have an effect if one is explicitly provided. And we shouldn't use the word "delay" to describe the TimeLimit, it is confusing.

irvinesunday · 2019-06-13T10:18:58Z

middleware/RetryHandler.md

- Retries will be based on the cumulative retry delay against the maximum retry time specified by the client code.
- Upon expiry of the maximum retry time, the Task should be cancelled and an exception thrown.
+- Retries should be limited to a maximum delay value. The default value for this is set at 180 seconds.
+- The client code can specify a custom value for the maximum delay.


@darrelmiller, @peombwa, @MIchaelMainer Open question: Are we going to require the max delay time to be explicitly set by customers or do we leave it optional?

Leave it as optional.

middleware/RetryHandler.md

darrelmiller · 2019-06-18T12:26:01Z

middleware/RetryHandler.md

+- Retries should be limited to a maximum delay value. The default value for this is set at 180 seconds.
+- The client code can specify a custom value for the maximum delay.
+- Retries will be based on the cumulative retry delays compared against the maximum delay specified by the client code, cumulatively added across all retries within the context of a single call. This is applicable to both 429 and 503/504.
+- Upon expiry of the maximum delay, when the cumulative retry delay is greater than the specified maximum delay, the `Task` should be cancelled and an `exception` thrown with a relevant message.


I don't think we should be throwing exception because we exceeded the RetriesTimeLimit. We should simply not retry again. We could try and cancel a currently running task, but I'm not sure this is wise. This would could cause the server to do a bunch of work that is then wasted.

The goal of this property should be to set a limit, that if exceeded prevents additional retries. We should also account for exceeding the limit while waiting to retry, which is covered in the next point.

We are not trying to create a hard limit that guarantees that the call will only take X number of seconds. I think that would be challenging to achieve reliably.

Content update Co-Authored-By: Darrel <darrmi@microsoft.com>

irvinesunday · 2019-06-27T14:36:53Z

@darrelmiller, @peombwa, @MIchaelMainer I've edited this doc. to address both scenarios for time-based and count-based retries. From Darrel's point of view - the time-based retry option wasn't supposed to replace count-based retry that was already in place, but to be added as a new feature. Are we all in agreement with this?

MIchaelMainer · 2019-06-21T22:29:46Z

middleware/RetryHandler.md

 - Retries MUST respect the `retry-after` response header if provided.
 - Where no `retry-after` header is provided by the server, an exponential backoff with random offset hueristic should be used to determine the retry delay.
- Retries should be limited to a maximum value.
+- Retries should be limited to a maximum delay value. The default value for this is set at 180 seconds.


I think this statement should change from

Retries should be limited to a maximum delay value. The default value for this is set at 180 seconds.

to

Retries MAY be limited to a maximum delay value. The default value for this is set at 180 seconds.

We should leave our current behavior as the default behavior and make this behavior a new option.

MIchaelMainer · 2019-07-01T18:11:16Z

We are in agreement.

darrelmiller · 2019-07-05T02:20:13Z

middleware/RetryHandler.md

- Upon expiry of the maximum delay, when the cumulative retry delay is greater than the specified maximum delay, the `Task` should be cancelled and an `exception` thrown with a relevant message.
- In the case where the client receives a `retry-after` value that is greater than the remaining `RetriesTimeLimit` the client should return the failed response immediately.
- Only requests with payloads that are buffered/rewindable are supported.  Payloads with forward only streams will be have the responses returned without any retry attempt.
+- Customers can specify a custom value for the `RetriesTimeLimit` greater than 0 to introduce time-based evaluated request retries alongside the default count-based request retry. The default is set at 0 second.


There is no need to say anything about the default. There is effectively no default.

middleware/RetryHandler.md

Co-Authored-By: Darrel <darrmi@microsoft.com>

Irvine Sunday added 2 commits June 12, 2019 15:19

Retry Handler design spec update

a7bf911

Further updates and addition to the Retry Handler spec.

93bc4d2

irvinesunday requested review from MIchaelMainer, darrelmiller and peombwa June 12, 2019 22:41

MIchaelMainer reviewed Jun 12, 2019

View reviewed changes

peombwa reviewed Jun 12, 2019

View reviewed changes

irvinesunday commented Jun 13, 2019

View reviewed changes

Updating context

a894e82

darrelmiller reviewed Jun 18, 2019

View reviewed changes

middleware/RetryHandler.md Outdated Show resolved Hide resolved

darrelmiller reviewed Jun 18, 2019

View reviewed changes

middleware/RetryHandler.md Outdated Show resolved Hide resolved

darrelmiller reviewed Jun 18, 2019

View reviewed changes

irvinesunday and others added 2 commits June 20, 2019 11:54

Update middleware/RetryHandler.md

07d1fb7

Content update Co-Authored-By: Darrel <darrmi@microsoft.com>

Update middleware/RetryHandler.md

0d48b30

Content update Co-Authored-By: Darrel <darrmi@microsoft.com>

irvinesunday mentioned this pull request Jun 21, 2019

Updating Retry Handler to use time limit instead of count limit microsoftgraph/msgraph-sdk-dotnet-core#14

Merged

Updating doc to handle both time-based and count-based retry scenarios

e7f9801

MIchaelMainer approved these changes Jul 1, 2019

View reviewed changes

Irvine Sunday added 2 commits July 2, 2019 15:36

Content update

5e729b8

Content update

ccc3dc0

darrelmiller reviewed Jul 5, 2019

View reviewed changes

middleware/RetryHandler.md Outdated Show resolved Hide resolved

irvinesunday and others added 3 commits July 8, 2019 10:57

Updating the time-based retry request evaluation criteria

0131530

Co-Authored-By: Darrel <darrmi@microsoft.com>

There is no default value for the retriesTimeLimit

ddd06b1

Bumped up the copyright license year

16772b2

darrelmiller approved these changes Jul 10, 2019

View reviewed changes

darrelmiller merged commit 4c5c83b into master Jul 10, 2019

darrelmiller deleted the is/retry-handler-update branch July 10, 2019 13:51

Retry handler design spec. update #18

Retry handler design spec. update #18

Uh oh!

Conversation

irvinesunday commented Jun 12, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MIchaelMainer Jun 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

darrelmiller Jun 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

irvinesunday commented Jun 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MIchaelMainer commented Jul 1, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MIchaelMainer Jun 17, 2019 •

edited

Loading

darrelmiller Jun 18, 2019 •

edited

Loading