Skip to content

WIP: DRA e2e: instructions for setting up cluster with autoscaler support#123078

Closed
pohly wants to merge 1 commit intokubernetes:masterfrom
pohly:dra-autoscaler-docs
Closed

WIP: DRA e2e: instructions for setting up cluster with autoscaler support#123078
pohly wants to merge 1 commit intokubernetes:masterfrom
pohly:dra-autoscaler-docs

Conversation

@pohly
Copy link
Copy Markdown
Contributor

@pohly pohly commented Feb 1, 2024

What type of PR is this?

/kind documentation

What this PR does / why we need it:

These instructions will be useful for developers who want to run Kubernetes with DRA or other experimental features enabled in a cluster that supports autoscaling.

Does this PR introduce a user-facing change?

NONE

/assign @marquiz

You were already using these instructions. Okay to merge?

@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Feb 1, 2024
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/documentation Categorizes issue or PR as related to documentation. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 1, 2024
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 1, 2024
Copy link
Copy Markdown
Contributor

@marquiz marquiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pohly for the PR. A few comments below, about things that I needed to change.

Comment thread test/e2e/dra/README.md Outdated
+ scheduler:
+ extraArgs:
+ feature-gates: DynamicResourceAllocation=true,ContextualLogging=true
+ # TODO: enable features in kubelet
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be solved by just adding kubeletExtraArgs to the initConfiguration (and joinConfiguration?) below?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, removed.

Comment thread test/e2e/dra/README.md Outdated
Comment on lines +153 to +154
! # Currently ignored and/or overwritten?
! # /var/run/kubeadm/kubeadm.yaml on the control plane container doesn't have it.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works (at least for me) if you just patch the quick-start-control-plane KubeadmControlPlaneTemplate. So these comments could be dropped?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, seems to work now. Removed.

Comment thread test/e2e/dra/README.md
- machinePools:
- - class: default-worker
- name: mp-0
- replicas: 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be dropped (so that the autoscaler knows that it can control the field)

Suggested change
- replicas: 1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am removing the entire "machinePools" section here because that is for a pool with an experimental API. What we want is just the "machineDeployments". This is what is left after patching:

    workers:
      machineDeployments:
      - class: default-worker
        name: md-0
        replicas: 1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marquiz: agreed?

I pushed an update that addresses the other points.

Sorry for the delay, I had to put this aside for a while to focus on 1.30. It's still useful to have for 1.31, because that is where we hopefully will have cluster autoscaler support.

/cc @towca

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly sorry my comment was off-by-a-few-lines. I think the replicas field should be removed for the machineDeployments

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands, the cluster comes up with one worker node. Then autoscaler can scale up or down. I prefer that over not bringing up any worker node initially because then a problem with that only surfaces later.

Your comment was "so that the autoscaler knows that it can control the field" - I don't think that setting the initial value prevents that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC in my testing it did. When the autoscaler sees that it's set it determines that it's controlled by some other entity and refuses to act on it. It default to 1 (if not set), IIRC

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the instructions are not enough to actually make the autoscaler do anything. It finds no node groups. But did you really get it to work as you suggested above?

What is missing is something else, the annotations on the MachineDeployment:
https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/autoscaling#enabling-autoscaling

@sbueringer: any suggestion how to get those annotations added automatically to the MachineDeployment?

Do we need --node-group-auto-discovery=clusterapi:clusterName=capi-quickstart or is it enabled by default?

https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/autoscaling#configuring-node-group-auto-discovery says "you must configure node group auto discovery" but https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/autoscaling#enabling-autoscaling says "The autoscaler will monitor any MachineSet, MachineDeployment, or MachinePool containing both of these annotations."

When the cluster comes up after following the instructions in this README.md, it has a generated machine deployment with a variable name. I could come up with a kubectl invocation that adds the annotations, but it would be nicer to include that in the cluster configuration.

Also, does the cloud provider in use here (Docker, from clusterctl init --infrastructure docker) support scale from zero automatically? https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/autoscaling#scale-from-zero-support documents some additional annotations, but it is not clear if they are needed.

Copy link
Copy Markdown
Member

@sbueringer sbueringer Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbueringer: any suggestion how to get those annotations added automatically to the MachineDeployment?

You can patch:

        - class: default-worker
          name: md-0
          replicas: 1

to be this instead:

        - class: default-worker
          name: md-0
          metadata:
            annotations:
              cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "1"
              cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size: "3"

The result is:

  • the annotations will be set on the MachineDeployment
  • the MachineDeployment webhook will pick the value of min-size as initial value for MD.spec.replicas

If you are setting replicas and the autoscaler annotations here, the CAPI controller and the autoscaler would both try to write the MD.spec.replicas field continuously.

Do we need --node-group-auto-discovery=clusterapi:clusterName=capi-quickstart or is it enabled by default?

I'm not sure if you have to set --node-group-auto-discovery. In our e2e test we do it, but we also have other tests running in other namespaces. You can give it a try without it.

Also, does the cloud provider in use here (Docker, from clusterctl init --infrastructure docker) support scale from zero automatically?

CAPD falls under:

If your Cluster API provider does not have support for scaling from zero, you may still use this feature through the capacity annotations.

So you'll have to set the annotations. Also in CAPD you can't really define via DockerMachine spec which size your "Machine" has like e.g. in AWS. But you can check on the created Nodes via Nodes.status.capacity what the capacity of a Node is. Not really sure where that is coming from though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to self: this still needs to be included in the instructions.

@bart0sh
Copy link
Copy Markdown
Contributor

bart0sh commented Feb 2, 2024

/triage accepted
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 2, 2024
@ndixita
Copy link
Copy Markdown
Contributor

ndixita commented Feb 7, 2024

/assign @SergeyKanzhelev

@ndixita
Copy link
Copy Markdown
Contributor

ndixita commented Feb 7, 2024

/assign @bart0sh

@bart0sh
Copy link
Copy Markdown
Contributor

bart0sh commented Feb 7, 2024

@pohly please address review comments, thanks

@dims
Copy link
Copy Markdown
Member

dims commented Mar 5, 2024

@pohly still has 3 comments that needs response

These instructions will be useful for developers who want to run Kubernetes
with DRA or other experimental features enabled in a cluster that supports
autoscaling.
@pohly pohly force-pushed the dra-autoscaler-docs branch from ec6e2b2 to 7c6c17c Compare March 25, 2024 15:55
@k8s-ci-robot k8s-ci-robot requested a review from towca March 25, 2024 15:59
@sbueringer
Copy link
Copy Markdown
Member

sbueringer commented Mar 28, 2024

Answered on the thread. Otherwise lgtm as far as I can tell

Comment thread test/e2e/dra/README.md
# The control plane won’t be Ready until we install a CNI in the next step.

$ KUBECONFIG=capi-quickstart.kubeconfig kubectl --kubeconfig=./capi-quickstart.kubeconfig \
apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should use kindnetd is much more stable kubernetes-sigs/cluster-api@d0c495a and consume less resources

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Good catch!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what can I say, it is my baby :)

Copy link
Copy Markdown
Member

@sbueringer sbueringer Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're here. Thx for maintaining kind & kindnet. It's a very nice and stable foundation to build our own testing in CAPI on. Safes us so much time & effort.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind is Ben baby ,

appreciate the compliments ❤️

@pohly pohly changed the title DRA e2e: instructions for setting up cluster with autoscaler support WIP: DRA e2e: instructions for setting up cluster with autoscaler support Apr 29, 2024
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 29, 2024
@k8s-triage-robot
Copy link
Copy Markdown

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 28, 2024
@k8s-triage-robot
Copy link
Copy Markdown

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 27, 2024
@k8s-triage-robot
Copy link
Copy Markdown

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@k8s-triage-robot: Closed this PR.

Details

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/documentation Categorizes issue or PR as related to documentation. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

10 participants