Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Retry on GOAWAY#39127

Merged
krwq merged 7 commits intodotnet:masterfrom
krwq:retry-on-goaway
Jul 13, 2019
Merged

Retry on GOAWAY#39127
krwq merged 7 commits intodotnet:masterfrom
krwq:retry-on-goaway

Conversation

@krwq
Copy link
Member

@krwq krwq commented Jul 2, 2019

Fixes: https://github.com/dotnet/corefx/issues/39013

There is still some behavior which needs to be figured out so until then marking this as draft:
https://github.com/dotnet/corefx/issues/39013#issuecomment-507820160

More tests will be added in:
https://github.com/dotnet/corefx/issues/34638

Regardless of above there seem to be some flakiness in the test now which I'm investigating

await newConneciton.SendResponseDataAsync(retriedStreamId, new byte[0], endStream: true).ConfigureAwait(false);

await newConneciton.WriteFrameAsync(goAwayFrame).ConfigureAwait(false);
var goAwayFrameOnNewConnection = new GoAwayFrame(retriedStreamId, (int)ProtocolErrors.ENHANCE_YOUR_CALM, new byte[0], 0);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw. I have previously sent GOAWAY with 0 but that happened after we have already sent complete response - currently client assumes that server has processed the stream completely but server claimed with GOAWAY that it hasn't - it feels like this is something spec should clarify further as I'm not sure if the client should somehow retroactively thrown in that case.

if (NetEventSource.IsEnabled) Trace(frameHeader.StreamId, $"{nameof(lastValidStream)}={lastValidStream}, {nameof(errorCode)}={errorCode}");

AbortStreams(lastValidStream, new Http2ProtocolException(errorCode));
StartTerminatingConnection(lastValidStream, new Http2ProtocolException(errorCode));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's worthwhile passing the Http2ProtocolException through here. The requests are going to be retried, and thus the user should never see this exception. I suppose it doesn't cause any harm, but it is different than how we handle REFUSED_STREAM today.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geoffkizer the reason I added that was so that user have a way of distinguishing between NO_ERROR vs other state in case they implement their own retry logic. Also I've noticed the initially written test expected to see that being propagated so it's possible the user might want to have more diagnostic info as well. (also even though retry logic is there after certain amount of tries/time we still bubble up the exception which looks more useful with extra info)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the user is seeing this exception, then there's something wrong going on. We should always retry when this exception is thrown. Do you know why this is getting surfaced to the user?

This may be related to the issue below where you are sending a GOAWAY with last valid stream = 0. I.e., the server is lying and telling us that requests which have already been processed actually have not been, and since we've gotten past the point in request processing where we can retry (i.e. response headers have been received) then we're actually ending up aborting this request. I guess that's fine, but at least we should add a comment regarding this...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are passing this exception in the GOAWAY case, can we also pass it in the REFUSED_STREAM case? Seems like we could have a similar issue with the server incorrectly sending REFUSED_STREAM after it's processed the stream.

Copy link
Member Author

@krwq krwq Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed this being bubbled when timeout happens which I actually think it makes sense although I do not think the code was designed to do that and likely is a coincidence (i.e. client tried doing request, retried few times but server kept on aborting with some error and eventually you see one of the server sent bubbled up).

I believe this was causing the stack to be printed by the async state machine timer but test eventually passed (test shouldn't pass but exception seems correct)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's strange. Do you know which test this was happening during? This was a remote server test, I assume? It seems weird that we'd be getting REFUSED_STREAM at all during remote server tests... would be good to understand why.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll pass it also in the REFUSED_STREAM case. Unfortunately wasn't able to repro bubbling up condition since I got it last time so possibly this was due to some bug I got when I was modifying the code.

@krwq
Copy link
Member Author

krwq commented Jul 3, 2019

I'm still flushing out some flakiness in these tests, will update once I feel like this test won't cause random CI failures

@krwq krwq force-pushed the retry-on-goaway branch from 5e60748 to a85a97f Compare July 11, 2019 00:04
@krwq
Copy link
Member Author

krwq commented Jul 11, 2019

not ready for review yet

@krwq krwq force-pushed the retry-on-goaway branch from a85a97f to 4058abc Compare July 11, 2019 00:34
@krwq krwq marked this pull request as ready for review July 11, 2019 00:37
@krwq krwq force-pushed the retry-on-goaway branch from 4058abc to a4394c1 Compare July 11, 2019 16:56
/// // test code
/// }
/// </summary>
internal class Watchdog : ICriticalNotifyCompletion

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool, thanks for adding!

@krwq krwq force-pushed the retry-on-goaway branch from 9d61ef6 to 9320014 Compare July 12, 2019 21:20
@krwq
Copy link
Member Author

krwq commented Jul 12, 2019

only last commit has changes for the latest feedback. had to rebase because of another merge conflict

@krwq
Copy link
Member Author

krwq commented Jul 13, 2019

Ok, per offline conversation with @geoffkizer I've made the retry case more deterministic without having to wait. So far I haven't noticed any non-determinisim or flakiness (running for 30min) will leave it running over the weekend

@geoffkizer
Copy link

Generally LGTM, a couple minor comments

@krwq krwq merged commit 1841042 into dotnet:master Jul 13, 2019
@krwq
Copy link
Member Author

krwq commented Jul 15, 2019

Just FYI, I've left this running in a loop over a weekend and haven't seen any flakiness

@karelz karelz added this to the 3.0 milestone Jul 16, 2019
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* Retry on GOAWAY

* send correct GOAWAY on retried connection, update comment

* Harden against flakiness

* Fix: Connection refused error is different on Debian

* apply feedback

* Make partial body/received headers case deterministic, pass error code on REFUSED_STREAM

* apply latest feedback


Commit migrated from dotnet/corefx@1841042
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HTTP2: Retry on GOAWAY

4 participants