upstream: load distribution in Total Panic mode#9343
upstream: load distribution in Total Panic mode#9343snowp merged 23 commits intoenvoyproxy:masterfrom
Conversation
… panic mode. Health status is ignored and each priority level takes load based on number of hosts in given priority level in relation to the number of all hosts in all priority levels. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
docs/root/intro/arch_overview/upstream/load_balancing/panic_threshold.rst
Outdated
Show resolved
Hide resolved
docs/root/intro/arch_overview/upstream/load_balancing/panic_threshold.rst
Outdated
Show resolved
Hide resolved
|
@snowp Thank you for code review. I am working on addressing your comments. There is one corner case which popped up during unit testing. This is when there is no heathy hosts across all priority levels. After my latest changes the algorithm just keeps running in total panic mode and distributes load to all priorities. However, there are some assumptions in the existing code that if there are no healthy hosts, chooseHost method should return nullptr. Interestingly, not all LBs have such assumptions. Ring LB has and its unit tests fail. Also aggregate cluster unit tests also assume that chooseHost returns nullptr. I think that TotalPanic mode should apply only when all levels are in panic mode and there is at least one healthy host in the cluster. I will change my code to work that way and will push code again. |
|
That seems like an odd behavior, I would have expected panic mode to trigger even when there are no healthy hosts. Otherwise you might be saved by panic mode, only for your traffic to start failing once all hosts go unhealthy. @mattklein123 do you know the what the original intended behavior was around panic routing /w no healthy upstreams? |
Definitely a bug. Please fix! |
|
Thanks for clarification! Will adjust the code accordingly. |
|
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
|
wip: ping |
Refactored unit tests to reflect load calculation in TotalPanic. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
/retest |
|
🔨 rebuilding |
std::accumulate to calculate total number of hosts. Small fixes after code review. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
@snowp I addressed all your code review comments accept #9343 (comment). I have to check the code to understand how a backend node ends up in |
…xcluded hosts. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
@snowp I addressed the last issue #9343 (comment). |
snowp
left a comment
There was a problem hiding this comment.
Thanks this looks pretty good, just a couple of comments
Added unit tests to make sure that excluded_hosts are not taken into account when calculating load in TotalPanic mode. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
/retest |
|
🐴 hold your horses - no failures detected, yet. |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
snowp
left a comment
There was a problem hiding this comment.
Thanks for iterating, this is getting close to ready I think
when calculating load in TotalPanic mode. Added unit tests for the case above. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
/retest |
|
🤷♀️ nothing to rebuild. |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
snowp
left a comment
There was a problem hiding this comment.
Nice, I think this looks good modulo the one comment. @alyssawilk do you also want to take a look?
Can you bump the risk level of this change? Given how it touches core load balancing code I'd say its at least a medium risky change.
alyssawilk
left a comment
There was a problem hiding this comment.
Looks good! Just a few nits from me
| // This is corner case when panic is disabled and there is no hosts available. | ||
| // But the load must be distributed somewhere. | ||
| // This is just a placeholder, because no traffic will be directed to | ||
| // priority 0 anyways, because it is empty. |
There was a problem hiding this comment.
Yeah, this comment doesn't seem quite right. P[0] may have hosts (it's not necessarily empty). In theory it would just fail to accept queries right?
| // This is corner case when panic is disabled and there is no hosts available. | ||
| // But the load must be distributed somewhere. | ||
| // This is just a placeholder, because no traffic will be directed to | ||
| // priority 0 anyways, because it is empty. |
There was a problem hiding this comment.
Also I think there's a comment above recalculatePerPriorityState which I think may need updating.
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
/retest |
|
🤷♀️ nothing to rebuild. |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
alyssawilk
left a comment
There was a problem hiding this comment.
@snowp any final comments from you?
snowp
left a comment
There was a problem hiding this comment.
This LGTM just seems to be an issue with the change log
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Description: Changed how load is calculated when all priority levels are in panic mode. Each priority level receives percentage of the traffic related to the number of hosts in that priority regardless of the health status of the hosts. This smooths out how traffic is shifted when hosts become unhealthy. See #4685 for design proposal and discussion.
Risk Level: Medium
Testing: Unit tests.
Docs Changes: Yes - updated section about panic threshold.
Release Notes: N/A
Fixes #4685