WIP: DRA e2e: instructions for setting up cluster with autoscaler support by pohly · Pull Request #123078 · kubernetes/kubernetes

pohly · 2024-02-01T15:52:03Z

What type of PR is this?

/kind documentation

What this PR does / why we need it:

These instructions will be useful for developers who want to run Kubernetes with DRA or other experimental features enabled in a cluster that supports autoscaling.

Does this PR introduce a user-facing change?

NONE

/assign @marquiz

You were already using these instructions. Okay to merge?

k8s-ci-robot · 2024-02-01T15:52:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~test/e2e/dra/OWNERS~~ [pohly]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

marquiz

Thanks @pohly for the PR. A few comments below, about things that I needed to change.

marquiz · 2024-02-01T21:01:05Z

+           scheduler:
+             extraArgs:
+               feature-gates: DynamicResourceAllocation=true,ContextualLogging=true
+         # TODO: enable features in kubelet


Should be solved by just adding kubeletExtraArgs to the initConfiguration (and joinConfiguration?) below?

Ack, removed.

marquiz · 2024-02-01T21:01:08Z

+!             # Currently ignored and/or overwritten?
+!             # /var/run/kubeadm/kubeadm.yaml on the control plane container doesn't have it.


This works (at least for me) if you just patch the quick-start-control-plane KubeadmControlPlaneTemplate. So these comments could be dropped?

Yes, seems to work now. Removed.

marquiz · 2024-02-01T21:02:39Z

+-       machinePools:
+-       - class: default-worker
+-         name: mp-0
+-         replicas: 1


I think this should be dropped (so that the autoscaler knows that it can control the field)

Suggested change

- replicas: 1

I am removing the entire "machinePools" section here because that is for a pool with an experimental API. What we want is just the "machineDeployments". This is what is left after patching:

workers: machineDeployments: - class: default-worker name: md-0 replicas: 1

@marquiz: agreed?

I pushed an update that addresses the other points.

Sorry for the delay, I had to put this aside for a while to focus on 1.30. It's still useful to have for 1.31, because that is where we hopefully will have cluster autoscaler support.

/cc @towca

@pohly sorry my comment was off-by-a-few-lines. I think the replicas field should be removed for the machineDeployments

As it stands, the cluster comes up with one worker node. Then autoscaler can scale up or down. I prefer that over not bringing up any worker node initially because then a problem with that only surfaces later.

Your comment was "so that the autoscaler knows that it can control the field" - I don't think that setting the initial value prevents that.

IIRC in my testing it did. When the autoscaler sees that it's set it determines that it's controlled by some other entity and refuses to act on it. It default to 1 (if not set), IIRC

You are right, the instructions are not enough to actually make the autoscaler do anything. It finds no node groups. But did you really get it to work as you suggested above?

What is missing is something else, the annotations on the MachineDeployment:
https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/autoscaling#enabling-autoscaling

@sbueringer: any suggestion how to get those annotations added automatically to the MachineDeployment?

Do we need --node-group-auto-discovery=clusterapi:clusterName=capi-quickstart or is it enabled by default?

https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/autoscaling#configuring-node-group-auto-discovery says "you must configure node group auto discovery" but https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/autoscaling#enabling-autoscaling says "The autoscaler will monitor any MachineSet, MachineDeployment, or MachinePool containing both of these annotations."

When the cluster comes up after following the instructions in this README.md, it has a generated machine deployment with a variable name. I could come up with a kubectl invocation that adds the annotations, but it would be nicer to include that in the cluster configuration.

Also, does the cloud provider in use here (Docker, from clusterctl init --infrastructure docker) support scale from zero automatically? https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/autoscaling#scale-from-zero-support documents some additional annotations, but it is not clear if they are needed.

@sbueringer: any suggestion how to get those annotations added automatically to the MachineDeployment?

You can patch:

- class: default-worker name: md-0 replicas: 1

to be this instead:

- class: default-worker name: md-0 metadata: annotations: cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "1" cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size: "3"

The result is:

the annotations will be set on the MachineDeployment

the MachineDeployment webhook will pick the value of min-size as initial value for MD.spec.replicas

If you are setting replicas and the autoscaler annotations here, the CAPI controller and the autoscaler would both try to write the MD.spec.replicas field continuously.

Do we need --node-group-auto-discovery=clusterapi:clusterName=capi-quickstart or is it enabled by default?

I'm not sure if you have to set --node-group-auto-discovery. In our e2e test we do it, but we also have other tests running in other namespaces. You can give it a try without it.

Also, does the cloud provider in use here (Docker, from clusterctl init --infrastructure docker) support scale from zero automatically?

CAPD falls under:

If your Cluster API provider does not have support for scaling from zero, you may still use this feature through the capacity annotations.

So you'll have to set the annotations. Also in CAPD you can't really define via DockerMachine spec which size your "Machine" has like e.g. in AWS. But you can check on the created Nodes via Nodes.status.capacity what the capacity of a Node is. Not really sure where that is coming from though.

Reminder to self: this still needs to be included in the instructions.

bart0sh · 2024-02-02T21:46:14Z

/triage accepted
/priority important-longterm

ndixita · 2024-02-07T18:33:59Z

/assign @SergeyKanzhelev

ndixita · 2024-02-07T18:34:16Z

/assign @bart0sh

bart0sh · 2024-02-07T21:38:59Z

@pohly please address review comments, thanks

dims · 2024-03-05T15:22:32Z

@pohly still has 3 comments that needs response

These instructions will be useful for developers who want to run Kubernetes with DRA or other experimental features enabled in a cluster that supports autoscaling.

sbueringer · 2024-03-28T15:32:08Z

Answered on the thread. Otherwise lgtm as far as I can tell

aojea · 2024-04-11T08:46:40Z

+# The control plane won’t be Ready until we install a CNI in the next step.
+
+$ KUBECONFIG=capi-quickstart.kubeconfig kubectl --kubeconfig=./capi-quickstart.kubeconfig \
+  apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml


you should use kindnetd is much more stable kubernetes-sigs/cluster-api@d0c495a and consume less resources

+1 Good catch!

what can I say, it is my baby :)

While we're here. Thx for maintaining kind & kindnet. It's a very nice and stable foundation to build our own testing in CAPI on. Safes us so much time & effort.

kind is Ben baby ,

appreciate the compliments ❤️

k8s-triage-robot · 2024-07-28T11:16:36Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-08-27T11:44:36Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-09-26T12:27:22Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-09-26T12:27:27Z

@k8s-triage-robot: Closed this PR.

Details

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Feb 1, 2024

k8s-ci-robot assigned marquiz Feb 1, 2024

k8s-ci-robot requested review from bart0sh and klueska February 1, 2024 15:52

marquiz reviewed Feb 1, 2024

View reviewed changes

k8s-ci-robot assigned SergeyKanzhelev Feb 7, 2024

k8s-ci-robot assigned bart0sh Feb 7, 2024

DRA e2e: instructions for setting up cluster with autoscaler support

7c6c17c

These instructions will be useful for developers who want to run Kubernetes with DRA or other experimental features enabled in a cluster that supports autoscaling.

pohly force-pushed the dra-autoscaler-docs branch from ec6e2b2 to 7c6c17c Compare March 25, 2024 15:55

k8s-ci-robot requested a review from towca March 25, 2024 15:59

pohly mentioned this pull request Apr 11, 2024

support Cluster Autoscaler (CA) testing kubernetes-sigs/cloud-provider-kind#16

Closed

aojea reviewed Apr 11, 2024

View reviewed changes

pohly changed the title ~~DRA e2e: instructions for setting up cluster with autoscaler support~~ WIP: DRA e2e: instructions for setting up cluster with autoscaler support Apr 29, 2024

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 29, 2024

carlory mentioned this pull request May 13, 2024

Remove gcp in-tree cloud provider and credential providers #124519

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 28, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 27, 2024

k8s-ci-robot closed this Sep 26, 2024

		! # Currently ignored and/or overwritten?
		! # /var/run/kubeadm/kubeadm.yaml on the control plane container doesn't have it.

Conversation

pohly commented Feb 1, 2024

What type of PR is this?

What this PR does / why we need it:

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented Feb 1, 2024

Uh oh!

marquiz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbueringer Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bart0sh commented Feb 2, 2024

Uh oh!

ndixita commented Feb 7, 2024

Uh oh!

ndixita commented Feb 7, 2024

Uh oh!

bart0sh commented Feb 7, 2024

Uh oh!

dims commented Mar 5, 2024

Uh oh!

sbueringer commented Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbueringer Apr 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-triage-robot commented Jul 28, 2024

Uh oh!

k8s-triage-robot commented Aug 27, 2024

Uh oh!

k8s-triage-robot commented Sep 26, 2024

Uh oh!

k8s-ci-robot commented Sep 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

sbueringer Mar 28, 2024 •

edited

Loading

sbueringer commented Mar 28, 2024 •

edited

Loading

sbueringer Apr 11, 2024 •

edited

Loading