Skip to content

load balancer: fix?: panic mode is disabled only when healthy_panic_threshold is 0%#7478

Merged
alyssawilk merged 13 commits intoenvoyproxy:masterfrom
mnktsts2:panic_threshold_is_0
Jul 17, 2019
Merged

load balancer: fix?: panic mode is disabled only when healthy_panic_threshold is 0%#7478
alyssawilk merged 13 commits intoenvoyproxy:masterfrom
mnktsts2:panic_threshold_is_0

Conversation

@mnktsts2
Copy link
Copy Markdown
Contributor

@mnktsts2 mnktsts2 commented Jul 6, 2019

For an explanation of how to fill out the fields, please see the relevant section
in PULL_REQUESTS.md

Description:

  • Currently, in load_balancer_impl.cc: recalculatePerPriorityPanic(), even if common_lb_config.healthy_panic_threshold is 0%, a load balancer enters panic mode whenever normalized_total_availability is 0%.
  • I guess, a user who intentionally set to healthy_panic_threshold = 0 expects to immediately return error responses if there is no available host checked by a load balancer.
  • (In fact, current load_balancer_impl.cc: isGlobalPanic() decide not to enter panic mode whenever healthy_panic_threshold is 0%.)
  • So I suggest that panic mode is disabled only when healthy_panic_threshold is 0%.
  • I want this change for automatic degenerating lower priority or optional back-end services.

Risk Level:

  • Low
  • (It seems that setting healthy_panic_threshold == 0 is a special case originally. It won't happen unless a user intend to disable panic mode, because default value of healthy_panic_threshold is 50%.)

Testing:

  • Done: unit and Integration tests with ./ci/run_envoy_docker.sh './ci/do_ci.sh bazel.release'
  • Done: manual tests with attached config/script files
    • config-enable_panic.yaml and config-disable_panic.yaml are envoy's config files to route httpbin.org.
    • config-enable_panic.yaml has healthy_panic_threshold: 10 for entering panic mode.
    • config-disable_panic.yaml has healthy_panic_threshold: 0 for disabling panic mode.
    • For each test case, run envoy, request to localhost:10000/status/504, check its response code and envoy stats of cluster.backend.lb_healthy_panic
    • I verified that "panic mode is disabled only when healthy_panic_threshold is 0%"
Test case healthy_panic_threshold panic mode response code lb_healthy_panic
current envoy with config-enable_panic.yaml 10 enter 504 > 0
current envoy with config-disable_panic.yaml 0 enter 504 > 0
modified envoy with config-enable_panic.yaml 10 enter 504 > 0
modified envoy with config-disable_panic.yaml 0 not enter 503 = 0

Docs Changes:

Release Notes:

  • N/A

[Optional Fixes #Issue]

[Optional Deprecated:]

  • N/A

mnktsts2 added 2 commits July 6, 2019 03:28
… is 0% ?

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>
Signed-off-by: mnktsts2 <mnktsts2@gmail.com>
@mnktsts2 mnktsts2 force-pushed the panic_threshold_is_0 branch from f45908d to 7c876a4 Compare July 6, 2019 03:29
Signed-off-by: mnktsts2 <mnktsts2@gmail.com>
@mnktsts2 mnktsts2 changed the title [Fix?] Should panic mode is disabled only when healthy_panic_threshold is 0% ? load_balancer: fix: panic mode is disabled only when healthy_panic_threshold is 0% ? Jul 6, 2019
@mnktsts2 mnktsts2 changed the title load_balancer: fix: panic mode is disabled only when healthy_panic_threshold is 0% ? load balancer: fix?: panic mode is disabled only when healthy_panic_threshold is 0% Jul 6, 2019
…ic_threshold_is_0

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>
@rene-m-hernandez
Copy link
Copy Markdown

FWIW, it looks like prior to 1.9, setting panic threshold to 0 disabled it, which is the behavior we were expecting. That no longer seems to be the case.

@mnktsts2
Copy link
Copy Markdown
Contributor Author

mnktsts2 commented Jul 9, 2019

@rene-m-hernandez As you say, setting panic threshold to 0 disabled panic mode in v1.7.1, and not in v1.9.1

Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, this makes sense to me. Could you add a test that verifies that we're not entering panic in this case?

calculateNormalizedTotalAvailability(per_priority_health_, per_priority_degraded_);

if (normalized_total_availability == 0) {
uint64_t global_panic_threshold = std::min<uint64_t>(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd probably call this just panic_threshold since it's not related to global panic.

also a small nit: this can be const

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review. I'll push the following change.

  • add a test that verifies that we're not entering panic in healthy_panic_threshold == 0
  • use panic_threshold as variable name
  • use const

mnktsts2 added 3 commits July 10, 2019 14:02
* use panic_threshold as variable name
* use const

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>
…ic_threshold == 0 (envoyproxy#7478)

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>
Signed-off-by: mnktsts2 <mnktsts2@gmail.com>
Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, just a comment style nit

/wait

// if # of healthy hosts in priority set is low.
// - normalized total health is 0%. All hosts are down. Redirect 100% of traffic to P=0 and enable
// panic mode.
// However, disable panic mode only when healthy panic threshold is 0%
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe change this comment so that on the previous bullet you say "- normalized total health is 0% and panic threshold is > 0" and phrase this line similarly?

Copy link
Copy Markdown
Contributor Author

@mnktsts2 mnktsts2 Jul 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's better.
Then, how about the following changes ??

  • Modify line 120-121:
// - normalized total health is 0%. All hosts are down. Redirect 100% of traffic to P=0.
//   And if panic threshold > 0% then enable panic mode for P=0, otherwise disable.
  • Delete line 122:
// However, disable panic mode only when healthy panic threshold is 0%

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me!

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>
snowp
snowp previously approved these changes Jul 14, 2019
Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Copy Markdown
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this is definitely more consistent behavior - thanks for the fix!

Can you update one or both of the docs to make it clear 0 disables?
docs/root/intro/arch_overview/upstream/load_balancing/panic_threshold.rst
api/envoy/api/v2/cds.proto

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>
@mnktsts2
Copy link
Copy Markdown
Contributor Author

Can you update one or both of the docs to make it clear 0 disables?
docs/root/intro/arch_overview/upstream/load_balancing/panic_threshold.rst
api/envoy/api/v2/cds.proto

Sure. I updated both of the docs you indicated.
But, I'm not a fluent English speaker, so I'm hoping that you or someone could help me.

@mnktsts2
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

🔨 rebuilding ci/circleci: coverage (failed build)

🐱

Caused by: a #7478 (comment) was created by @mnktsts2.

see: more, trace.

Copy link
Copy Markdown
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic! I've added some minor rewording suggestions, but otherwise LGTM :-)

| 5% | 65% | 7% | YES | 93% | NO | 98% |
+-------------+-------------+----------+--------------+----------+--------------+-------------+

Setting the panic threshold to 0%, panic mode can be disabled.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> Panic mode can be disabled by setting the panic threshold to 0%.


Setting the panic threshold to 0%, panic mode can be disabled.

If all hosts becomes unhealthy, normalized total health is 0%, all of traffic redirect to P=0.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reworded a bit:

If all hosts become unhealthy normalized total health is 0%, and if the panic threshold is above 0% all traffic will be redirected to P=0.  However, if the panic threshold is 0% for any priority, that priority will never enter panic mode.  In this case if all hosts are unhealthy, Envoy will fail to select a host and will instead immediately return error responses with "503 - no healthy upstream".

Consequently, for example in HTTP traffic, Envoy will immediately return error responses
with "503 - no healthy upstream".

+-----------+-------------+-------------+----------+--------------+----------+--------------+----------------------------+
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can not do the charts here, but leave them in if you think they help.

mnktsts2 added 2 commits July 16, 2019 15:35
Signed-off-by: mnktsts2 <mnktsts2@gmail.com>
Signed-off-by: mnktsts2 <mnktsts2@gmail.com>
@mnktsts2
Copy link
Copy Markdown
Contributor Author

Thanks for your help !!
I've corrected the sentences and removed the chart , as you presented.

@mnktsts2
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

🔨 rebuilding ci/circleci: Build Error (failed build)

🐱

Caused by: a #7478 (comment) was created by @mnktsts2.

see: more, trace.

@mnktsts2
Copy link
Copy Markdown
Contributor Author

It's need to retest after resolving CircleCI's general billing issue ... ?

@mnktsts2
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

🔨 rebuilding ci/circleci: Build Error (failed build)

🐱

Caused by: a #7478 (comment) was created by @mnktsts2.

see: more, trace.

@alyssawilk
Copy link
Copy Markdown
Contributor

Sorry, our CircleCi was down because of unrelated billing issues.
It should be working now :-)

@alyssawilk
Copy link
Copy Markdown
Contributor

should....
/retest

@repokitteh-read-only
Copy link
Copy Markdown

🔨 rebuilding ci/circleci: Build Error (failed build)

🐱

Caused by: a #7478 (comment) was created by @alyssawilk.

see: more, trace.

@alyssawilk
Copy link
Copy Markdown
Contributor

Huh, unsure why retest / rebuild doesn't work but over on 7603 pushing a new commit (master merge) worked fine, so worst case you can try that and I'll LGTM again

@mnktsts2
Copy link
Copy Markdown
Contributor Author

umm...

@alyssawilk alyssawilk merged commit ad9926f into envoyproxy:master Jul 17, 2019
@pitiwari
Copy link
Copy Markdown
Contributor

@mnktsts2 @alyssawilk we are using branch 1.9.1 and want to set healthy_panic_threshold=0 via runtime or config. Without this fix do we expect the panic mode to be disabled or we need this commit ?. And will it work for both the cases, setting via runtime and config

@mnktsts2
Copy link
Copy Markdown
Contributor Author

mnktsts2 commented Jul 20, 2019

@pitiwari I think that panic mode can not be disabled without changing the condition to enter panic mode, like this fix. And if the condition change is applied to v1.9.1 it seems to work in the both cases (runtime and config) ... ?

In v1.9.1:

if (normalized_total_health == 0) {
// Everything is terrible. All load should be to P=0. Turn on panic mode.
ASSERT(per_priority_load_[0] == 100);
per_priority_panic_[0] = true;
return;
}

TAOXUY pushed a commit to TAOXUY/envoy that referenced this pull request Jul 22, 2019
…hreshold is 0% (envoyproxy#7478)

Currently, in load_balancer_impl.cc: recalculatePerPriorityPanic(), even if common_lb_config.healthy_panic_threshold is 0%, a load balancer enters panic mode whenever normalized_total_availability is 0%.
I guess, a user who intentionally set to healthy_panic_threshold = 0 expects to immediately return error responses if there is no available host checked by a load balancer.
(In fact, current load_balancer_impl.cc: isGlobalPanic() decide not to enter panic mode whenever healthy_panic_threshold is 0%.)
So I suggest that panic mode is disabled only when healthy_panic_threshold is 0%.
I want this change for automatic degenerating lower priority or optional back-end services.

Risk Level: Low
(It seems that setting healthy_panic_threshold == 0 is a special case originally. It won't happen unless a user intend to disable panic mode, because default value of healthy_panic_threshold is 50%.)
Testing: unit and Integration tests with ./ci/run_envoy_docker.sh './ci/do_ci.sh bazel.release'
Docs Changes: inline

Signed-off-by: mnktsts2 <mnktsts2@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants