Skip to content

upstream: Implement retry concurrency budgets#9069

Merged
mattklein123 merged 77 commits intoenvoyproxy:masterfrom
tonya11en:percentage_retries
Jan 7, 2020
Merged

upstream: Implement retry concurrency budgets#9069
mattklein123 merged 77 commits intoenvoyproxy:masterfrom
tonya11en:percentage_retries

Conversation

@tonya11en
Copy link
Copy Markdown
Member

@tonya11en tonya11en commented Nov 19, 2019

This patch implements RetryBudget, a new configuration option in the retry policy that limits allowed outstanding retries. The budget is specified as a percentage of the current outstanding requests/connections. Each retry budget must specify a minimum concurrency value to the budgets, so that one can retry when there are low concurrency values. This configuration is optional and will not change existing behavior unless configured. If configured, the retry budgets will override any max_retries circuit breaker configuration.

For example, a budget of 20% with a minimum retry concurrency of 3 will allow 5 active retries while there are 25 active requests. If there are 2 active requests, there are still 3 active retries allowed because of the minimum retry concurrency.

This approach to limiting retries and mitigating retry storms is useful, because it allows one to think in terms of worst-case increases in traffic due to retries. Retry circuit breakers are fixed values and can potentially limit retries when the mesh can handle the increased volume. This is especially beneficial for adaptive concurrency for high RPS services.

Fixes: #8727
Risk Level: Low
Testing: Unit tests
Docs Changes: Stats and proto.

tonya11en and others added 7 commits November 12, 2019 22:25
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tallen@lyft.com>
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #9069 was opened by tonya11en.

see: more, trace.

Signed-off-by: Tony Allen <tallen@lyft.com>
@tonya11en
Copy link
Copy Markdown
Member Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

🔨 rebuilding ci/circleci: coverage (failed build)

🐱

Caused by: a #9069 (comment) was created by @tonya11en.

see: more, trace.

@mattklein123 mattklein123 self-assigned this Nov 20, 2019
@tonya11en
Copy link
Copy Markdown
Member Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

🔨 rebuilding ci/circleci: coverage (failed build)

🐱

Caused by: a #9069 (comment) was created by @tonya11en.

see: more, trace.

Tony Allen added 3 commits November 20, 2019 16:03
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tallen@lyft.com>
Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, just a couple comments


// If a retry budget was configured, we cannot exceed the configured percentage of total
// outstanding requests/connections.
const uint64_t current_active = cluster_.resourceManager(priority_).connections().count() +
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit iffy to me: today kinda this works before for H/1.1 and H/2 make use of either max_connections OR max_requests so this works, but there are requested features that might introduce max_connections to H/2 as well (#7403). Envoy also doesn't reclaim H/1.1 connections until it needs to, so I think using max_connections to get a sense for the current concurrency isn't perfect.

I think ideally we'd be counting # of requests for HTTP/1.1 and just do max_requests + max_pending, but I don't think we can just add that enforcement safely at this point (since users might have bumped max_con but not max_req). Maybe just note this limitation in the docs for now?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted in the docs. Let me know if the latest patch communicates this well enough.

Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tony@allen.gg>
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #9069 was synchronize by tonya11en.

see: more, trace.

Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
@tonya11en
Copy link
Copy Markdown
Member Author

/wait-any

Signed-off-by: Tony Allen <tony@allen.gg>
@mattklein123
Copy link
Copy Markdown
Member

Apologies but can you merge master and format one more time? It should fix all the CI/random format issues.

/wait

Signed-off-by: Tony Allen <tony@allen.gg>
@mattklein123
Copy link
Copy Markdown
Member

Sorry needs another master merge.

/wait

@tonya11en
Copy link
Copy Markdown
Member Author

tonya11en commented Jan 3, 2020 via email

Signed-off-by: Tony Allen <tony@allen.gg>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo merge issue. Great work!

/wait

Signed-off-by: Tony Allen <tony@allen.gg>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

// 2019/11/15 9040 35029 35500 build: update protobuf to 3.10.1
// 2019/11/15 9040 35061 35500 upstream: track whether cluster is local
// 2019/12/10 8779 35053 35000 use var-length coding for name lengths
// 2019/12/20 8779 35053 35000 use var-length coding for name lengths
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a merge issue. To avoid going back and forth on this again can you fix this in your next change?

@repokitteh-read-only repokitteh-read-only bot removed the api label Jan 7, 2020
@mattklein123 mattklein123 merged commit 3ed917f into envoyproxy:master Jan 7, 2020
zuercher added a commit to zuercher/envoy that referenced this pull request Jan 8, 2020
Since envoyproxy#9069, macOS builds have been failing because they use slightly
more memory for stats than the new limit. Bump limit to next even
multiple of 1000.

Risk Level: low, test change only
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Stephan Zuercher <zuercher@gmail.com>
@tonya11en tonya11en deleted the percentage_retries branch February 19, 2020 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Percentage based retry circuit breaker

6 participants