Skip to content

Configurable HorizontalPodAutoscaler#74525

Merged
k8s-ci-robot merged 5 commits intokubernetes:masterfrom
gliush:configurable-hpa-limits
Dec 11, 2019
Merged

Configurable HorizontalPodAutoscaler#74525
k8s-ci-robot merged 5 commits intokubernetes:masterfrom
gliush:configurable-hpa-limits

Conversation

@gliush
Copy link
Copy Markdown
Contributor

@gliush gliush commented Feb 25, 2019

I introduce an algorithm-agnostic HPA object configuration that will configure each particular HPA scaling behavior.

KEP: https://github.com/kubernetes/enhancements/blob/master/keps/sig-autoscaling/20190307-configurable-scale-velocity-for-hpa.md

This PR contains only API changes for now. The business logic will be introduced in a separate PR. For now I keep all the changes in my local repo until the current PR is approved:
gliush#2

What type of PR is this?
/kind feature

What this PR does / why we need it:
Different applications may have different business values, different logic and may require different scaling behaviors.
At the moment, there’s only one cluster-level configuration parameter that influence how fast the cluster is scaled down:
--horizontal-pod-autoscaler-downscale-stabilization-window (default to 5 min)

And a couple of hard-coded constants that specify how fast the cluster can scale up:

This PR introduces an algorithm-agnostic HPA object configuration that will configure each particular HPA scaling behavior.
For both directions (scale up and scale down) it will be possible to specify one or several behavior that will control how the HPA controller will scale the resources. Effectively each behavior is a specification of "how many percents/pods could be added/removed per some period of time".
Additionally, each direction might have a StabilizationWindowSeconds parameter set to gather recommendations for some time and pick the safest change.

For more information and motivation read the KEPs (links are above).

Which issue(s) this PR fixes:

Fixes #39090
Fixes #65097
Fixes #69428

It covers partly #56335

Special notes for your reviewer:
The API changes are backward compatible. All current defaults are kept the same.

Does this PR introduce a user-facing change?:

Added the HPA API, that allows scale behavior to be configured through the HPA
`behavior` field. Behaviors are specified separately for scaling up and down. In
each direction a stabilization window can be specified as well as a list of
policies and how to select amongst them. Policies can limit the absolute number
of pods added or removed, or the percentage of pods added or removed.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. labels Feb 25, 2019
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 25, 2019
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @gliush. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 25, 2019
@gliush
Copy link
Copy Markdown
Contributor Author

gliush commented Feb 25, 2019

/sig autoscaling

@k8s-ci-robot k8s-ci-robot added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label Feb 25, 2019
@k8s-ci-robot k8s-ci-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 25, 2019
@gliush
Copy link
Copy Markdown
Contributor Author

gliush commented Feb 25, 2019

@thockin: Could you have a look? @mwielgus says that you're the most familiar with the matter.
I can't do it by myself, sorry.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 25, 2019
@thockin
Copy link
Copy Markdown
Member

thockin commented Mar 1, 2019

Is there a KEP for this? Generally we want to see the design work independent of the code.

I am happy to take up the API review, but can't really do that until I know the domain experts (e.g. @mwielgus) are happy with the design.

@thockin
Copy link
Copy Markdown
Member

thockin commented Mar 1, 2019

I see there's a doc but that's not a KEP

@thockin
Copy link
Copy Markdown
Member

thockin commented Mar 1, 2019 via email

@mwielgus
Copy link
Copy Markdown
Contributor

mwielgus commented Mar 1, 2019

@thockin - I have worked with @gliush on this for a while. We agreed on the rough idea and the configuration parameters, so we are proceeding with the api.

This functionality has been requested multiple times already so I'm glad that we finally have someone who is working on it :).

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 14, 2019
@josephburnett
Copy link
Copy Markdown
Contributor

/assign @thockin

@josephburnett
Copy link
Copy Markdown
Contributor

josephburnett commented Nov 14, 2019

/hold

@gliush I dug into the failed pull-kubernetes-kubemark-e2e-gce-big test and saw this:

TestMetrics error: [restart counts violation: RestartCount(kube-controller-manager-e2e-74525-ac87c-kubemark-master, kube-controller-manager)=1, want <= 0]

I deployed these changes with kube-up.sh and when I add behavior.scaleDown.stabilizationWindowSeconds: 30 and apply load then controller-manager crash loops. Nothing in the logs so I don't know why, but this must be fixed and requires an e2e test.

@josephburnett
Copy link
Copy Markdown
Contributor

Maybe related: #84990

Copy link
Copy Markdown
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll approve this as-is, but you really need that one extra case in defaulting - please followup.

I see there's a hold for another reason, so this may be in vain, but good luck. :)

/lgtm
/approve

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API spec says that Behavior == nil will also set the defaults. I see the test cases don't cover that, but I think you want to create a value for behavior and then merge the defaults in, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should really be 15 to keep parity with existing defaults, which resync HPAs every 15 seconds.

@gliush
Copy link
Copy Markdown
Contributor Author

gliush commented Nov 26, 2019

I've changed the default period seconds for HPA: 60s -> 15s to correspond to the previous behavior.
All the tests are passed.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gliush, josephburnett, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@josephburnett
Copy link
Copy Markdown
Contributor

@arjunrn thanks for fixing the controller-manager crash! I've done some manual testing in a cluster and it works nicely. Added and removed behaviors which are respected.

Just two requests:

  1. Please squash this PR before I give an LGTM. There are 21 commits and this repo doesn't squash for you!
  2. Please provide a pointer to the E2E test for this. @arjunrn I think you have one already. But I want to be sure we have something that will guard against a bug like the controller-manager crashing.

@josephburnett
Copy link
Copy Markdown
Contributor

/lgtm

I checked and there is no difference between the latest squash and the unsquashed changes I tested. Nice work!

@josephburnett
Copy link
Copy Markdown
Contributor

/hold cancel

@arjunrn
Copy link
Copy Markdown
Contributor

arjunrn commented Dec 11, 2019

/test pull-kubernetes-kubemark-e2e-gce-big

@saschagrunert
Copy link
Copy Markdown
Member

Hey @gliush 👋, Im currently looking at the latest generated releases notes and saw that this note needs to either state the user facing change or set it to NONE. Do you think you can change the PR description for that? The next re-generation of the notes will fix that issue on our side later on. :)

@gliush
Copy link
Copy Markdown
Contributor Author

gliush commented Jan 7, 2020

@saschagrunert : I've updated the release notes, could you check, please? I'm a little bit concerned about the style and the amount of details.

@saschagrunert
Copy link
Copy Markdown
Member

Thank you @gliush 🙏 I assume the API is new and we should state that to the user as well, like:

Added the HPA API, which allows scale ...

So in general it is looking good, but we have include information like mentioned here: https://github.com/saschagrunert/community/blob/master/contributors/guide/release-notes.md#contents-of-a-release-note

@gliush
Copy link
Copy Markdown
Contributor Author

gliush commented Jan 7, 2020

@saschagrunert : thank you! Done!

@saschagrunert
Copy link
Copy Markdown
Member

Wonderful, thank you again! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubectl cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/cli Categorizes an issue or PR as relevant to SIG CLI. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

Status: API review completed, 1.17