Skip to content

upstream: load distribution in Total Panic mode#9343

Merged
snowp merged 23 commits intoenvoyproxy:masterfrom
cpakulski:issue/4685
Jan 24, 2020
Merged

upstream: load distribution in Total Panic mode#9343
snowp merged 23 commits intoenvoyproxy:masterfrom
cpakulski:issue/4685

Conversation

@cpakulski
Copy link
Copy Markdown
Contributor

@cpakulski cpakulski commented Dec 13, 2019

Description: Changed how load is calculated when all priority levels are in panic mode. Each priority level receives percentage of the traffic related to the number of hosts in that priority regardless of the health status of the hosts. This smooths out how traffic is shifted when hosts become unhealthy. See #4685 for design proposal and discussion.

Risk Level: Medium

Testing: Unit tests.

Docs Changes: Yes - updated section about panic threshold.

Release Notes: N/A
Fixes #4685

… panic mode.

Health status is ignored and each priority level takes load based on number of
hosts in given priority level in relation to the number of all hosts in all
priority levels.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Left a few comments to get you started.

@cpakulski
Copy link
Copy Markdown
Contributor Author

@snowp Thank you for code review. I am working on addressing your comments. There is one corner case which popped up during unit testing. This is when there is no heathy hosts across all priority levels. After my latest changes the algorithm just keeps running in total panic mode and distributes load to all priorities. However, there are some assumptions in the existing code that if there are no healthy hosts, chooseHost method should return nullptr. Interestingly, not all LBs have such assumptions. Ring LB has and its unit tests fail. Also aggregate cluster unit tests also assume that chooseHost returns nullptr. I think that TotalPanic mode should apply only when all levels are in panic mode and there is at least one healthy host in the cluster. I will change my code to work that way and will push code again.

@snowp
Copy link
Copy Markdown
Contributor

snowp commented Dec 20, 2019

That seems like an odd behavior, I would have expected panic mode to trigger even when there are no healthy hosts. Otherwise you might be saved by panic mode, only for your traffic to start failing once all hosts go unhealthy. @mattklein123 do you know the what the original intended behavior was around panic routing /w no healthy upstreams?

@mattklein123
Copy link
Copy Markdown
Member

That seems like an odd behavior, I would have expected panic mode to trigger even when there are no healthy hosts. Otherwise you might be saved by panic mode, only for your traffic to start failing once all hosts go unhealthy. @mattklein123 do you know the what the original intended behavior was around panic routing /w no healthy upstreams?

Definitely a bug. Please fix!

@cpakulski
Copy link
Copy Markdown
Contributor Author

Thanks for clarification! Will adjust the code accordingly.

@stale
Copy link
Copy Markdown

stale bot commented Dec 27, 2019

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 27, 2019
@cpakulski
Copy link
Copy Markdown
Contributor Author

wip: ping

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jan 2, 2020
Refactored unit tests to reflect load calculation in TotalPanic.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

🔨 rebuilding ci/circleci: api (failed build)

🐱

Caused by: a #9343 (comment) was created by @cpakulski.

see: more, trace.

std::accumulate to calculate total number of hosts.

Small fixes after code review.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Copy Markdown
Contributor Author

@snowp I addressed all your code review comments accept #9343 (comment). I have to check the code to understand how a backend node ends up in excludedHost list and if we should route the traffic there while in panic mode.

…xcluded hosts.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Copy Markdown
Contributor Author

@snowp I addressed the last issue #9343 (comment). excludedHosts array is filled in https://github.com/envoyproxy/envoy/blob/v1.12.2/source/common/upstream/upstream_impl.cc#L855-L872 and traffic is router to excludedHosts in TotalPanic state. The code is ready for another pass.

Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this looks pretty good, just a couple of comments

Added unit tests to make sure that excluded_hosts are not
taken into account when calculating load in TotalPanic mode.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

🐴 hold your horses - no failures detected, yet.

🐱

Caused by: a #9343 (comment) was created by @cpakulski.

see: more, trace.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating, this is getting close to ready I think

when calculating load in TotalPanic mode.
Added unit tests for the case above.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

🤷‍♀️ nothing to rebuild.

🐱

Caused by: a #9343 (comment) was created by @cpakulski.

see: more, trace.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I think this looks good modulo the one comment. @alyssawilk do you also want to take a look?

Can you bump the risk level of this change? Given how it touches core load balancing code I'd say its at least a medium risky change.

@cpakulski
Copy link
Copy Markdown
Contributor Author

cpakulski commented Jan 17, 2020

@snowp. The issue in that one comment was fixed in fe83128. Without that fix my unit tests were failing.

Bumped risk level to Medium.

Thanks for reviewing my changes.

Copy link
Copy Markdown
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few nits from me

// This is corner case when panic is disabled and there is no hosts available.
// But the load must be distributed somewhere.
// This is just a placeholder, because no traffic will be directed to
// priority 0 anyways, because it is empty.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this comment doesn't seem quite right. P[0] may have hosts (it's not necessarily empty). In theory it would just fail to accept queries right?

// This is corner case when panic is disabled and there is no hosts available.
// But the load must be distributed somewhere.
// This is just a placeholder, because no traffic will be directed to
// priority 0 anyways, because it is empty.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think there's a comment above recalculatePerPriorityState which I think may need updating.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

🤷‍♀️ nothing to rebuild.

🐱

Caused by: a #9343 (comment) was created by @cpakulski.

see: more, trace.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
alyssawilk
alyssawilk previously approved these changes Jan 23, 2020
Copy link
Copy Markdown
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snowp any final comments from you?

Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM just seems to be an issue with the change log

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@snowp snowp merged commit 757e0fb into envoyproxy:master Jan 24, 2020
@cpakulski cpakulski deleted the issue/4685 branch January 24, 2020 01:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

upstream: redesign per priority load calculation when all priority levels are in panic mode

4 participants