load balancer: fix?: panic mode is disabled only when healthy_panic_threshold is 0% by mnktsts2 · Pull Request #7478 · envoyproxy/envoy

mnktsts2 · 2019-07-06T03:26:06Z

For an explanation of how to fill out the fields, please see the relevant section
in PULL_REQUESTS.md

Description:

Currently, in load_balancer_impl.cc: recalculatePerPriorityPanic(), even if common_lb_config.healthy_panic_threshold is 0%, a load balancer enters panic mode whenever normalized_total_availability is 0%.
I guess, a user who intentionally set to healthy_panic_threshold = 0 expects to immediately return error responses if there is no available host checked by a load balancer.
(In fact, current load_balancer_impl.cc: isGlobalPanic() decide not to enter panic mode whenever healthy_panic_threshold is 0%.)
So I suggest that panic mode is disabled only when healthy_panic_threshold is 0%.
I want this change for automatic degenerating lower priority or optional back-end services.

Risk Level:

Low
(It seems that setting healthy_panic_threshold == 0 is a special case originally. It won't happen unless a user intend to disable panic mode, because default value of healthy_panic_threshold is 50%.)

Testing:

Done: unit and Integration tests with ./ci/run_envoy_docker.sh './ci/do_ci.sh bazel.release'
Done: manual tests with attached config/script files
- config-enable_panic.yaml and config-disable_panic.yaml are envoy's config files to route httpbin.org.
- config-enable_panic.yaml has healthy_panic_threshold: 10 for entering panic mode.
- config-disable_panic.yaml has healthy_panic_threshold: 0 for disabling panic mode.
- For each test case, run envoy, request to localhost:10000/status/504, check its response code and envoy stats of cluster.backend.lb_healthy_panic
- I verified that "panic mode is disabled only when healthy_panic_threshold is 0%"

Test case	healthy_panic_threshold	panic mode	response code	lb_healthy_panic
current envoy with config-enable_panic.yaml	10	enter	504	> 0
current envoy with config-disable_panic.yaml	0	enter	504	> 0
modified envoy with config-enable_panic.yaml	10	enter	504	> 0
modified envoy with config-disable_panic.yaml	0	not enter	503	= 0

Docs Changes:

N/A
(There is no description for the special case of healthy_panic_threshold == 0 in the panic_threshold docs)

Release Notes:

N/A

[Optional Fixes #Issue]

N/A
(But there is a related ticket on github/istio)

[Optional Deprecated:]

N/A

… is 0% ? Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

…ic_threshold_is_0 Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

rene-m-hernandez · 2019-07-08T21:30:54Z

FWIW, it looks like prior to 1.9, setting panic threshold to 0 disabled it, which is the behavior we were expecting. That no longer seems to be the case.

mnktsts2 · 2019-07-09T00:29:25Z

@rene-m-hernandez As you say, setting panic threshold to 0 disabled panic mode in v1.7.1, and not in v1.9.1

snowp

Thanks for the fix, this makes sense to me. Could you add a test that verifies that we're not entering panic in this case?

snowp · 2019-07-08T16:35:35Z

source/common/upstream/load_balancer_impl.cc

      calculateNormalizedTotalAvailability(per_priority_health_, per_priority_degraded_);

-  if (normalized_total_availability == 0) {
+  uint64_t global_panic_threshold = std::min<uint64_t>(


i'd probably call this just panic_threshold since it's not related to global panic.

also a small nit: this can be const

Thanks for your review. I'll push the following change.

add a test that verifies that we're not entering panic in healthy_panic_threshold == 0

use panic_threshold as variable name

use const

* use panic_threshold as variable name * use const Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

…ic_threshold == 0 (envoyproxy#7478) Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

snowp

Nice, just a comment style nit

/wait

snowp · 2019-07-10T21:56:03Z

source/common/upstream/load_balancer_impl.cc

 //   if # of healthy hosts in priority set is low.
 // - normalized total health is 0%. All hosts are down. Redirect 100% of traffic to P=0 and enable
 // panic mode.
+//   However, disable panic mode only when healthy panic threshold is 0%


maybe change this comment so that on the previous bullet you say "- normalized total health is 0% and panic threshold is > 0" and phrase this line similarly?

Yes, that's better.
Then, how about the following changes ??

Modify line 120-121:

// - normalized total health is 0%. All hosts are down. Redirect 100% of traffic to P=0. // And if panic threshold > 0% then enable panic mode for P=0, otherwise disable.

Delete line 122:

// However, disable panic mode only when healthy panic threshold is 0%

Sounds good to me!

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

snowp

LGTM, thanks!

alyssawilk

Yeah, I think this is definitely more consistent behavior - thanks for the fix!

Can you update one or both of the docs to make it clear 0 disables?
docs/root/intro/arch_overview/upstream/load_balancing/panic_threshold.rst
api/envoy/api/v2/cds.proto

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

mnktsts2 · 2019-07-15T18:07:24Z

Can you update one or both of the docs to make it clear 0 disables?
docs/root/intro/arch_overview/upstream/load_balancing/panic_threshold.rst
api/envoy/api/v2/cds.proto

Sure. I updated both of the docs you indicated.
But, I'm not a fluent English speaker, so I'm hoping that you or someone could help me.

mnktsts2 · 2019-07-16T03:56:41Z

/retest

repokitteh-read-only · 2019-07-16T03:56:45Z

🔨 rebuilding ci/circleci: coverage (failed build)

🐱

Caused by: a #7478 (comment) was created by @mnktsts2.

see: more, trace.

alyssawilk

This is fantastic! I've added some minor rewording suggestions, but otherwise LGTM :-)

alyssawilk · 2019-07-16T13:21:17Z

docs/root/intro/arch_overview/upstream/load_balancing/panic_threshold.rst

 | 5%          |  65%        |  7%      | YES          |   93%    | NO           |  98%        |
 +-------------+-------------+----------+--------------+----------+--------------+-------------+

+Setting the panic threshold to 0%, panic mode can be disabled.


-> Panic mode can be disabled by setting the panic threshold to 0%.

alyssawilk · 2019-07-16T13:25:31Z

docs/root/intro/arch_overview/upstream/load_balancing/panic_threshold.rst


+Setting the panic threshold to 0%, panic mode can be disabled.
+
+If all hosts becomes unhealthy, normalized total health is 0%, all of traffic redirect to P=0.


reworded a bit:

If all hosts become unhealthy normalized total health is 0%, and if the panic threshold is above 0% all traffic will be redirected to P=0. However, if the panic threshold is 0% for any priority, that priority will never enter panic mode. In this case if all hosts are unhealthy, Envoy will fail to select a host and will instead immediately return error responses with "503 - no healthy upstream".

alyssawilk · 2019-07-16T13:25:50Z

docs/root/intro/arch_overview/upstream/load_balancing/panic_threshold.rst

+Consequently, for example in HTTP traffic, Envoy will immediately return error responses 
+with "503 - no healthy upstream".
+
+-----------+-------------+-------------+----------+--------------+----------+--------------+----------------------------+


I think we can not do the charts here, but leave them in if you think they help.

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

mnktsts2 · 2019-07-16T15:56:20Z

Thanks for your help !!
I've corrected the sentences and removed the chart , as you presented.

mnktsts2 · 2019-07-16T23:52:53Z

/retest

repokitteh-read-only · 2019-07-16T23:52:57Z

🔨 rebuilding ci/circleci: Build Error (failed build)

🐱

Caused by: a #7478 (comment) was created by @mnktsts2.

see: more, trace.

mnktsts2 · 2019-07-17T01:01:55Z

It's need to retest after resolving CircleCI's general billing issue ... ?

mnktsts2 · 2019-07-17T06:23:06Z

/retest

repokitteh-read-only · 2019-07-17T06:23:09Z

🔨 rebuilding ci/circleci: Build Error (failed build)

🐱

Caused by: a #7478 (comment) was created by @mnktsts2.

see: more, trace.

alyssawilk · 2019-07-17T13:30:40Z

Sorry, our CircleCi was down because of unrelated billing issues.
It should be working now :-)

alyssawilk · 2019-07-17T13:31:51Z

should....
/retest

repokitteh-read-only · 2019-07-17T13:31:54Z

🔨 rebuilding ci/circleci: Build Error (failed build)

🐱

Caused by: a #7478 (comment) was created by @alyssawilk.

see: more, trace.

alyssawilk · 2019-07-17T13:41:58Z

Huh, unsure why retest / rebuild doesn't work but over on 7603 pushing a new commit (master merge) worked fine, so worst case you can try that and I'll LGTM again

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

mnktsts2 · 2019-07-17T14:23:06Z

umm...

pitiwari · 2019-07-19T23:31:04Z

@mnktsts2 @alyssawilk we are using branch 1.9.1 and want to set healthy_panic_threshold=0 via runtime or config. Without this fix do we expect the panic mode to be disabled or we need this commit ?. And will it work for both the cases, setting via runtime and config

mnktsts2 · 2019-07-20T01:25:53Z

@pitiwari I think that panic mode can not be disabled without changing the condition to enter panic mode, like this fix. And if the condition change is applied to v1.9.1 it seems to work in the both cases (runtime and config) ... ?

In v1.9.1:

envoy/source/common/upstream/load_balancer_impl.cc

Lines 145 to 150 in ea248e2

    
           if (normalized_total_health == 0) { 
        
             // Everything is terrible. All load should be to P=0. Turn on panic mode. 
        
             ASSERT(per_priority_load_[0] == 100); 
        
             per_priority_panic_[0] = true; 
        
             return; 
        
           }

…hreshold is 0% (envoyproxy#7478) Currently, in load_balancer_impl.cc: recalculatePerPriorityPanic(), even if common_lb_config.healthy_panic_threshold is 0%, a load balancer enters panic mode whenever normalized_total_availability is 0%. I guess, a user who intentionally set to healthy_panic_threshold = 0 expects to immediately return error responses if there is no available host checked by a load balancer. (In fact, current load_balancer_impl.cc: isGlobalPanic() decide not to enter panic mode whenever healthy_panic_threshold is 0%.) So I suggest that panic mode is disabled only when healthy_panic_threshold is 0%. I want this change for automatic degenerating lower priority or optional back-end services. Risk Level: Low (It seems that setting healthy_panic_threshold == 0 is a special case originally. It won't happen unless a user intend to disable panic mode, because default value of healthy_panic_threshold is 50%.) Testing: unit and Integration tests with ./ci/run_envoy_docker.sh './ci/do_ci.sh bazel.release' Docs Changes: inline Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

mnktsts2 added 2 commits July 6, 2019 03:28

[Fix?] Should panic mode is disabled only when global_panic_threshold…

75fd3c7

… is 0% ? Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

add a comment for a case of healthy panic threshold == 0%

7c876a4

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

mnktsts2 force-pushed the panic_threshold_is_0 branch from f45908d to 7c876a4 Compare July 6, 2019 03:29

update expected mock call count corresponding to the change

4381cbf

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

mnktsts2 changed the title ~~[Fix?] Should panic mode is disabled only when healthy_panic_threshold is 0% ?~~ load_balancer: fix: panic mode is disabled only when healthy_panic_threshold is 0% ? Jul 6, 2019

mnktsts2 changed the title ~~load_balancer: fix: panic mode is disabled only when healthy_panic_threshold is 0% ?~~ load balancer: fix?: panic mode is disabled only when healthy_panic_threshold is 0% Jul 6, 2019

Merge branch 'master' of https://github.com/envoyproxy/envoy into pan…

6b11bdd

…ic_threshold_is_0 Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

mattklein123 assigned snowp Jul 8, 2019

snowp suggested changes Jul 9, 2019

View reviewed changes

mnktsts2 added 3 commits July 10, 2019 14:02

refactor a variable name and a modifier for envoyproxy#7478

6cd38ed

* use panic_threshold as variable name * use const Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

add a test that verifies that we're not entering panic in healthy_pan…

16ac445

…ic_threshold == 0 (envoyproxy#7478) Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

delete spaces in empty lines for check format

52ec032

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

snowp suggested changes Jul 10, 2019

View reviewed changes

repokitteh-read-only bot added the waiting label Jul 10, 2019

change comments for making conditions clear (envoyproxy#7478)

d67c5e8

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

repokitteh-read-only bot removed the waiting label Jul 11, 2019

snowp previously approved these changes Jul 14, 2019

View reviewed changes

snowp assigned alyssawilk Jul 15, 2019

alyssawilk reviewed Jul 15, 2019

View reviewed changes

alyssawilk added the waiting:any label Jul 15, 2019

update docs about disabling panic mode (envoyproxy#7478)

bd8da68

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

mnktsts2 dismissed snowp’s stale review via bd8da68 July 15, 2019 17:52

repokitteh-read-only bot removed the waiting:any label Jul 15, 2019

alyssawilk reviewed Jul 16, 2019

View reviewed changes

alyssawilk added the waiting label Jul 16, 2019

mnktsts2 added 2 commits July 16, 2019 15:35

reword sentences about disabling panic mode (envoyproxy#7478)

0d03076

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

delete over-enthusiastic spaces.

6c14454

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

repokitteh-read-only bot removed the waiting label Jul 16, 2019

alyssawilk approved these changes Jul 17, 2019

View reviewed changes

Merge remote-tracking branch 'origin/master' into panic_threshold_is_0

a094aee

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>

mnktsts2 mentioned this pull request Jul 17, 2019

WIP: ci trying for #7478 #7616

Closed

Merge branch 'master' into panic_threshold_is_0

34964df

alyssawilk merged commit ad9926f into envoyproxy:master Jul 17, 2019

mnktsts2 mentioned this pull request Jul 18, 2019

Disable Envoy's panic mode as default istio/istio#15609

Merged


		Setting the panic threshold to 0%, panic mode can be disabled.

		If all hosts becomes unhealthy, normalized total health is 0%, all of traffic redirect to P=0.

Conversation

mnktsts2 commented Jul 6, 2019

Uh oh!

rene-m-hernandez commented Jul 8, 2019

Uh oh!

mnktsts2 commented Jul 9, 2019

Uh oh!

snowp left a comment

Choose a reason for hiding this comment

Uh oh!

snowp Jul 8, 2019

Choose a reason for hiding this comment

Uh oh!

mnktsts2 Jul 10, 2019

Choose a reason for hiding this comment

Uh oh!

snowp left a comment

Choose a reason for hiding this comment

Uh oh!

snowp Jul 10, 2019

Choose a reason for hiding this comment

Uh oh!

mnktsts2 Jul 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snowp Jul 11, 2019

Choose a reason for hiding this comment

Uh oh!

snowp left a comment

Choose a reason for hiding this comment

Uh oh!

alyssawilk left a comment

Choose a reason for hiding this comment

Uh oh!

mnktsts2 commented Jul 15, 2019

Uh oh!

mnktsts2 commented Jul 16, 2019

Uh oh!

repokitteh-read-only bot commented Jul 16, 2019

Uh oh!

alyssawilk left a comment

Choose a reason for hiding this comment

Uh oh!

alyssawilk Jul 16, 2019

Choose a reason for hiding this comment

Uh oh!

alyssawilk Jul 16, 2019

Choose a reason for hiding this comment

Uh oh!

alyssawilk Jul 16, 2019

Choose a reason for hiding this comment

Uh oh!

mnktsts2 commented Jul 16, 2019

Uh oh!

mnktsts2 commented Jul 16, 2019

Uh oh!

repokitteh-read-only bot commented Jul 16, 2019

Uh oh!

mnktsts2 commented Jul 17, 2019

Uh oh!

mnktsts2 commented Jul 17, 2019

Uh oh!

repokitteh-read-only bot commented Jul 17, 2019

Uh oh!

alyssawilk commented Jul 17, 2019

Uh oh!

alyssawilk commented Jul 17, 2019

Uh oh!

repokitteh-read-only bot commented Jul 17, 2019

Uh oh!

alyssawilk commented Jul 17, 2019

Uh oh!

mnktsts2 commented Jul 17, 2019

Uh oh!

pitiwari commented Jul 19, 2019

Uh oh!

mnktsts2 commented Jul 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mnktsts2 Jul 11, 2019 •

edited

Loading

mnktsts2 commented Jul 20, 2019 •

edited

Loading