Skip to content

Add general request retry mechanism#2194

Merged
jtopjian merged 5 commits intogophercloud:masterfrom
andrewbaxter:general-retries
Aug 2, 2021
Merged

Add general request retry mechanism#2194
jtopjian merged 5 commits intogophercloud:masterfrom
andrewbaxter:general-retries

Conversation

@andrewbaxter
Copy link
Copy Markdown
Contributor

Prior to starting a PR, please make sure you have read our
contributor tutorial.

Prior to a PR being reviewed, there needs to be a Github issue that the PR
addresses. Replace the brackets and text below with that issue number.

For #2183

Links to the line numbers/files in the OpenStack source code that support the
code in this PR:

None - this is purely client-side.

@andrewbaxter
Copy link
Copy Markdown
Contributor Author

This is passing tests (at least the ones I tried locally) but I haven't played around with it much yet. I wanted initial feedback if you thought this needed to be redesigned.

Notes on the implementation:

  • I left RetryBackoffFunc and the reauth stuff because I didn't want to make this backwards incompatible. I think it should be possible to make a default GeneralRetryFunc that handles both reauth and the 429 stuff if so desired.
  • There's no MaxRetry count field (and the Backoff one isn't used) since this can be handled by the GeneralRetryFunc itself.
  • I added two simple tests, but I can add more if you think it would be useful. I didn't want to go crazy in case you wanted changes to the implementation.
  • I didn't include any premade GeneralRetryFunc - I think one that retries GETs and maybe PATCHs would be reasonable, but I haven't had time to dig into this yet and I'm sure some tuning would be required to make something that works well out of the box.

@andrewbaxter
Copy link
Copy Markdown
Contributor Author

I've tested this myself with tc to drop packets, but that's pretty limited and I can't exactly cause our openstack provider to go down ;) ... for what it's worth. I was able to confirm that operation in normal circumstances didn't break though I believe.

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci bot commented Jul 19, 2021

Build failed.

@coveralls
Copy link
Copy Markdown

coveralls commented Jul 19, 2021

Coverage Status

Coverage decreased (-0.04%) to 79.826% when pulling c46343f on andrewbaxter:general-retries into 2509091 on gophercloud:master.

@jtopjian
Copy link
Copy Markdown
Contributor

@andrewbaxter Thanks for working on this. This is a really good start.

The first area that stuck out was with the RetryBackoff implementation. To make things better named, I think it's best to rename the existing RetryFunc to RetryBackoffFunc. This change will also need done in the unit tests. This will have a slight backwards incompatibility, but I think it's worth it in the long run.

After that, we can rename GeneralRetryFunc to simply RetryFunc.

Looking at the implementation, instead of creating a doRequest that wraps the renamed doInnerRequest, isn't it possible to keep a single doRequest and implement the retry mechanism like so:

...
// existing code
    case http.StatusServiceUnavailable:
      err = ErrDefault503{respErr}
      if error503er, ok := errType.(Err503er); ok {
        err = error503er.Error503(respErr)
      }
    }

    if err == nil {
      err = respErr
    }

// add retry mechanism here
    if err != nil && client.RetryFunc != nil {
      var e error
      state.retries = state.retries + 1
      e = client.RetryFunc(client.Context, method, url, options, err, state.retries)
      if e != nil {
        return resp, e
      }

      return client.doRequest(method, url, options, state)
    }

// resume existing code
    return resp, err
  }
...

I tested this out and the unit tests pass, but maybe I am missing something.

Let me know your thoughts on the above. Once we discuss that, we can talk about extra tests, default functions, etc.

I'm not concerned about the failing acceptance test yet. OpenLab can be quirky sometimes and exhibit transient behaviour. However, if they continue to fail, it might be this RetryFunc somehow causing request issues.

@andrewbaxter
Copy link
Copy Markdown
Contributor Author

andrewbaxter commented Jul 20, 2021

Thanks! I'll get on those changes.

For the single doRequest - it wasn't clear to me if Go would do tail call optimization here and a loop would keep the stack short. To make sure I understand, you mean recurse doRequest rather than have a loop? If that's the preference I'm happy to make that change.

Okay yeah, I didn't look at the acceptance tests quite yet (wasn't sure where to find the failing output exactly) 😅 I hope they do pass but I'll dig deeper after resolving the above.

Thanks for the review!

@jtopjian
Copy link
Copy Markdown
Contributor

To make sure I understand, you mean recurse doRequest rather than have a loop?

Correct.

We're already doing reauth retries and backoff retries in the single doRequest, so we might as well stick with consistency (unless we discover an issue with general retrying). As for optimization, again, since we're already doing other types of retries, I think we'd be looking at a major refactor.

Okay yeah, I didn't look at the acceptance tests quite yet (wasn't sure where to find the failing output exactly)

If you click on the link for the failed test, you can access job-output.txt.gz and sift through 1.4mb of text 🙂 Searching for --- FAIL will take you to the failed tests. From what I can tell, there were one or two failed requests that were the cause of the failures.

@jtopjian
Copy link
Copy Markdown
Contributor

recheck

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci bot commented Jul 20, 2021

Build succeeded.

@andrewbaxter
Copy link
Copy Markdown
Contributor Author

I made the above changes, but there are at least 4 separate error return statements in doRequest which could result from temporary issues (upstream, network connectivity, etc). I ended up copying the RetryFunc call snippet into each of them, but I'm not sure if there's a better way to handle this.

@jtopjian
Copy link
Copy Markdown
Contributor

@andrewbaxter Thanks for continuing to work on this.

I think the first two additions of calling RetryFunc are all that's needed (Lines 428 and 573).

On Line 593, the error would be an IO-based error, where the content is being discarded in order to trigger an EOF so that the connection can safely close. I don't think this is something that should be retried if it fails since it's not part of the HTTP/network request (other than ensuring a connection is closed, and if we retry, we might end up accumulating open connections).

And on Line 606, the error would be from unmarshalling a JSON response. At this point, the HTTP connection was successful, data was successfully retrieved, the HTTP connection was closed, but there was some kind of problem unmarshalling the data. IMO, this error should be handled within the application.

Thoughts?

@andrewbaxter
Copy link
Copy Markdown
Contributor Author

Okay, makes sense for 593. Retrying would be a strange response to some internal connection management issue here, and it looks like the response is returned (just along with the error).

For 606, I've seen servers partially write responses (with matching content lengths) randomly, and that's a situation where I'd want to retry - similarly to other temporary upstream issues.

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci bot commented Jul 21, 2021

Build succeeded.

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci bot commented Jul 21, 2021

Build succeeded.

@jtopjian
Copy link
Copy Markdown
Contributor

For 606, I've seen servers partially write responses (with matching content lengths) randomly, and that's a situation where I'd want to retry - similarly to other temporary upstream issues.

That's fair and sounds good to me.

From what I can tell, this is looking good.

@kayrus @ozerovandrei do either of you have any thoughts on this one?

Copy link
Copy Markdown
Contributor

@jtopjian jtopjian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thank you!

@jtopjian
Copy link
Copy Markdown
Contributor

jtopjian commented Aug 2, 2021

@andrewbaxter Thank you for working on this!

@jtopjian jtopjian merged commit 96bec7c into gophercloud:master Aug 2, 2021
@andrewbaxter
Copy link
Copy Markdown
Contributor Author

Awesome, thanks for the reviews and merge!

@andrewbaxter andrewbaxter deleted the general-retries branch August 5, 2021 09:24
@kayrus
Copy link
Copy Markdown
Contributor

kayrus commented Aug 20, 2021

This change must be reflected in gophercloud/utils as well:

vendor/github.com/gophercloud/utils/client/client.go:368:9: cannot use func literal (type func(context.Context, *gophercloud.ErrUnexpectedResponseCode, error, uint) error) as type gophercloud.RetryFunc in return argument

@jtopjian jtopjian mentioned this pull request Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants