Skip to content

Add support for healthy panic threshold#496

Merged
rshriram merged 1 commit intoistio:release-1.1from
knrc:api-486
Nov 8, 2018
Merged

Add support for healthy panic threshold#496
rshriram merged 1 commit intoistio:release-1.1from
knrc:api-486

Conversation

@knrc
Copy link
Copy Markdown

@knrc knrc commented May 18, 2018

No description provided.

@googlebot googlebot added the cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. label May 18, 2018
@istio-testing
Copy link
Copy Markdown
Collaborator

Hi @knrc. Thanks for your PR.

I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knrc
Copy link
Copy Markdown
Author

knrc commented May 18, 2018

This PR is related to issue #486

@knrc
Copy link
Copy Markdown
Author

knrc commented May 28, 2018

/assign @sebastienvas

@sdake
Copy link
Copy Markdown
Member

sdake commented May 28, 2018

@sebastienvas noticed there is no OWNERS file - perhaps you can help fix that?

Cheers
-steve

@knrc
Copy link
Copy Markdown
Author

knrc commented May 30, 2018

@costinm Given your comments in the environment meeting should this be included for 1.0?

@knrc
Copy link
Copy Markdown
Author

knrc commented Jun 4, 2018

/retest

@istio-testing
Copy link
Copy Markdown
Collaborator

@knrc: you can't request testing unless you are a istio member.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@christian-posta
Copy link
Copy Markdown

christian-posta commented Jun 4, 2018 via email

// known to be healthy, if the percentage of healthy hosts drops below this panic threshold
// the proxy will disregard the health status and balance across all hosts. The default
// panic threshold is 50%.
int32 healthy_panic_threshold = 5;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you also implementing this? if not, you need to add a not-implemented-hide tag so that this does not show up in the docs.

@knrc
Copy link
Copy Markdown
Author

knrc commented Jun 6, 2018

@rshriram Yes, I already have an implementation for this. I asked @costinm how to proceed and he advised me to submit the API changes before the implementation.

Copy link
Copy Markdown
Member

@rshriram rshriram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized. This is not part of outlier detection at all.

@knrc
Copy link
Copy Markdown
Author

knrc commented Jun 6, 2018

Following up here to document a slack conversation.

The healthy panic threshold covers both active and passive health checks, in the outlier docs it contains the following

The host is ejected for some number of milliseconds. Ejection means that the host is marked
unhealthy and will not be used during load balancing unless the load balancer is in a panic scenario.

@rshriram there may be a better place for this in the API, do you have any suggestions? The code I have currently creates the following: https://pastebin.com/89g589RK

Should this be moved under LoadBalancerSettings?

incfly pushed a commit to incfly/api that referenced this pull request Jun 13, 2018
* Convert Pilot ip mesh attributes to bytes

* Update comment

* Update after review comment
@knrc
Copy link
Copy Markdown
Author

knrc commented Oct 5, 2018

The istio project PR is istio/istio#9176, it will need updating after api is merged

@knrc
Copy link
Copy Markdown
Author

knrc commented Oct 9, 2018

@rshriram Can you please take another look at this and the istio/istio#9176 PR? If you let me know your feedback I can make the appropriate changes.

uint64 minimum_ring_size = 4;
};

message CommonLbConfig {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please get an LGTM from networking folks. My comments here are structural.

Please be consistent with naming (i.e. CommonLbConfig => CommonLBConfig) and add comments to the message itself.

I'd also recommend putting this into the individual LB protos, if it makes sense. Otherwise, when editing, people need to add a "common" stanza for being able to set these things. It is much more intuitive to put these settings together when editing.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I can take a look at both of those changes

SimpleLB simple = 1;
ConsistentHashLB consistent_hash = 2;
}
CommonLbConfig common = 3;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding the parameter directly, without the 'common' ? If it is common, it doesn't need an extra indent level and struct. I know in envoy the config is deep and full of 'common' - but it makes it harder to author.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, @ozevren has made the same suggestion.

// known to be healthy, if the percentage of healthy hosts drops below this panic threshold
// the proxy will disregard the health status and balance across all hosts. The default
// panic threshold is 50%.
int32 healthy_panic_threshold = 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very good with names - but not sure this is the most intuitive. Maybe include_unhealthy_threshold ?
"Panic" is a bit confusing. I assume this matches envoy name ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the name used in envoy. I can change this to your name.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed up a new version but forgot to modify the name, will do this shortly and push up again

@costinm
Copy link
Copy Markdown
Contributor

costinm commented Oct 12, 2018

Do we want to add a section on health checking ? I think there are few other related things we may want to configure. It's good to include the setting in the yaml examples, to get an idea how it 'feels'

@rshriram
Copy link
Copy Markdown
Member

Okay. Sorry for the bike shedding. I see your point about the threshold applying to outlier detection. Please put it back in outlier detection and call it panic_threshold with comments saying outlier detection will be disabled if healthy hosts fall below this percentage. Default is 0%

@costinm
Copy link
Copy Markdown
Contributor

costinm commented Oct 15, 2018

I guess I'm a bit confused. From the comments, it doesn't seem related to outlier - but the behavior
when endpoints are not healthy and general load balancing.

Can you clarify in the docs if it is outlier-related or not, and how the 'health' is determined ? Does it require envoy actively checking health (and the associated settings), or outlier ejection, or both ?

I don't think we have settings or code for the active health checking. At some point we'll add them - would they belong to 'outlier' section ?

Like Shriram, I'm sorry for bike shedding, but APIs are very hard to change after they are released.

@knrc
Copy link
Copy Markdown
Author

knrc commented Oct 22, 2018

I guess I'm a bit confused. From the comments, it doesn't seem related to outlier - but the behavior
when endpoints are not healthy and general load balancing.

Can you clarify in the docs if it is outlier-related or not, and how the 'health' is determined ? Does it require envoy actively checking health (and the associated settings), or outlier ejection, or both ?

I don't think we have settings or code for the active health checking. At some point we'll add them - would they belong to 'outlier' section ?

Like Shriram, I'm sorry for bike shedding, but APIs are very hard to change after they are released.

Sorry, for some reason I missed these comments. The sentences I referred to on Jun 7 are still in the envoy outlier detection docs.

I'll dig through the envoy code later today and document the flow here, in the meantime ignore the update I pushed.

BTW don't worry about bike shedding, I would rather we took the time and got this right.

// known to be healthy, if the percentage of healthy hosts drops below this panic threshold
// the proxy will disregard the health status and balance across all hosts. The default
// panic threshold is 50%.
int32 healthy_panic_threshold = 3;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move it to outlier detection? and call it panic_threshold.. So it would make more sense in context of outlier detection

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably worth discussing with @costinm as he wanted it in the LB section and called something like include_unhealthy_threshold instead of using panic. The current name matches the envoy configuration and it applies to general LB features but impacts outlier detection.

When a host is ejected by the outlier detection it will set the host's health flag to contain Host::HealthFlag::FAILED_OUTLIER_CHECK, this will then cause the load balancer code to remove it from the list of unhealthy hosts until it is either unejected by the outlier detection or the panic occurs.

@rshriram @costinm What do you think? Keep it where it is or move it? Keep the name or change it?

@christian-posta
Copy link
Copy Markdown

christian-posta commented Oct 23, 2018 via email

@rshriram
Copy link
Copy Markdown
Member

Over the years, I have been trying to maintain some consistency in networking APIs. For good or worse, we choose to have clearer explanations and more user friendly fields rather than mirroring everything from Envoy (otherwise there is no point in having a separate Istio API).
I prefer panic_threshold / detection_reset_low_watermark.

While it might make sense in Envoy to have this field in laod balancer (for future purposes), as far as Istio is concerned, it is extremely confusing to end users to have a panic_threshold field in load balancer when it refers to outlier detection (that could be disabled). So please add it to outlier detection section.

@knrc
Copy link
Copy Markdown
Author

knrc commented Oct 24, 2018

@rshriram I've pushed up changes and moved this to outlier detection, let me know what you think. @costinm Can you also take a look?

// Load balancing panic threshold. The load balancing pool will normally include hosts
// known to be healthy, if the percentage of healthy hosts drops below this panic threshold
// the proxy will disregard the health status and balance across all hosts. The default
// panic threshold is 50%.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

threshold_percent is redundant.
Also please fix the documentation. "Load balancing panic threshold" is a vague thing to say in outlier detection. The second sentence has a typo/grammar issue..

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, I was just trying to make the name consistent with max ejection. The documentation is a copy of the envoy docs but I can change that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found a better name for this..

// Outlier detection will be enabled as long as the associated load balancing 
// pool has atleast min_health_percent hosts in healthy mode. When the 
// percentage of healthy hosts in the load balancing pool drops below this 
// threshold, outlier detection will be disabled and the proxy will load balance
// across all hosts in the pool (healthy and unhealthy).
int32 min_health_percent

@knrc
Copy link
Copy Markdown
Author

knrc commented Oct 25, 2018

@rshriram updated

@rshriram
Copy link
Copy Markdown
Member

Can you send this PR to release-1.1 branch please?

@googlebot
Copy link
Copy Markdown
Collaborator

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

@googlebot googlebot added cla: no Set by the Google CLA bot to indicate the author of a PR has not signed the Google CLA. and removed cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. labels Oct 26, 2018
@knrc knrc changed the base branch from master to release-1.1 October 26, 2018 18:28
@knrc
Copy link
Copy Markdown
Author

knrc commented Oct 26, 2018

It looks as if googlebot is confused so I'll try to repush

@istio-testing
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: sebastienvas

If they are not already assigned, you can assign the PR to them by writing /assign @sebastienvas in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@googlebot
Copy link
Copy Markdown
Collaborator

CLAs look good, thanks!

@googlebot googlebot added cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. and removed cla: no Set by the Google CLA bot to indicate the author of a PR has not signed the Google CLA. labels Oct 26, 2018
@rshriram rshriram merged commit d03e770 into istio:release-1.1 Nov 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants