Skip to content

Write about best-practice high availability and scaling of cert-manager components#1330

Merged
jetstack-bot merged 13 commits intocert-manager:masterfrom
wallrj:high-availability-best-practice
Oct 20, 2023
Merged

Write about best-practice high availability and scaling of cert-manager components#1330
jetstack-bot merged 13 commits intocert-manager:masterfrom
wallrj:high-availability-best-practice

Conversation

@wallrj
Copy link
Copy Markdown
Member

@wallrj wallrj commented Oct 18, 2023

Preview: https://deploy-preview-1330--cert-manager-website.netlify.app/docs/installation/best-practice/#high-availability

We've added various Helm chart values which allow users to configure cert-manager for HA and scalability,
but we've never documented any recommendations explaining how these settings should be used in production.

My original / ultimate plan was to change some of the Helm chart defaults to include useful default topology constraints,
as first suggested by @ThatsMrTalbot in a cert-manager dev meeting and discussed further in https://kubernetes.slack.com/archives/CDEQJ0Q8M/p1697561450000799

But first I want to document what we think the best practice settings are.

@jetstack-bot jetstack-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 18, 2023
@netlify
Copy link
Copy Markdown

netlify bot commented Oct 18, 2023

Deploy Preview for cert-manager-website ready!

Name Link
🔨 Latest commit 5121600
🔍 Latest deploy log https://app.netlify.com/sites/cert-manager-website/deploys/65328e6c9127e000085e01b3
😎 Deploy Preview https://deploy-preview-1330--cert-manager-website.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@wallrj wallrj force-pushed the high-availability-best-practice branch from 7137207 to f417916 Compare October 18, 2023 12:16
Signed-off-by: Richard Wall <richard.wall@venafi.com>
@wallrj wallrj force-pushed the high-availability-best-practice branch 3 times, most recently from 785560a to 69ec316 Compare October 18, 2023 13:33
@wallrj wallrj changed the title WIP: Write about best-practice high availability and scaling of cert-manager components Write about best-practice high availability and scaling of cert-manager components Oct 18, 2023
@jetstack-bot jetstack-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 18, 2023
Comment on lines +100 to +106
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app.kubernetes.io/instance: cert-manager
app.kubernetes.io/component: webhook
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt this is actually necessary because the documentation says:

the scheduler automatically tries to spread the Pods in a ReplicaSet across nodes in a single-zone cluster (to reduce the impact of node failures, see kubernetes.io/hostname). With multiple-zone clusters, this spreading behavior also applies to zones (to reduce the impact of zone failures). This is achieved via SelectorSpreadPriority.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm easy here, not having it is less YAML.
Having it means it's explicit.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we can verify that this works by default, we should not tell people to configure this (because we want to keep our docs as simple as possible). Instead, we can just link to the documentation you shared and say that this works correctly by default.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that according to the docs the desired anti-affinity scheduling should happen by default.
I've updated this paragraph.

so as to reduce the load on the Kubernetes API server.

For example, if the cluster contains very many CertificateRequest resources,
you will need to increase the memory limit of the controller Pod.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might say something about the memory optimizations for cainjector e.g.

Fix cainjector's --namespace flag. Users who want to prevent cainjector from reading all Secrets and Certificates in all namespaces (i.e to prevent excessive memory consumption) can now scope it to a single namespace using the --namespace flag. A cainjector that is only used as part of cert-manager installation only needs access to the cert-manager installation namespace. (#5694, @irbekrm)
-- https://deploy-preview-1330--cert-manager-website.netlify.app/docs/releases/release-notes/release-notes-1.11/

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it best practice to limit this to only the installation namespace?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's not technically "best practice" but I've added it anyway.
See what you think.

@wallrj wallrj requested review from erikgb and jsoref October 18, 2023 13:49
Copy link
Copy Markdown
Member

@hawksight hawksight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very well written, just a few minor notes to break up the text a little.

For reference, I just checked for the labels on my GKE nodes:

kubernetes.io/hostname=gke-demo-istio-5d390798-5v7i,kubernetes.io/os=linux,node.kubernetes.io/instance-type=e2-medium,topology.gke.io/zone=europe-west1-b,topology.kubernetes.io/region=europe-west1,topology.kubernetes.io/zone=europe-west1-b

So the labels chosen looks appropriate, although I have not checked any other managed offering.

so as to reduce the load on the Kubernetes API server.

For example, if the cluster contains very many CertificateRequest resources,
you will need to increase the memory limit of the controller Pod.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it best practice to limit this to only the installation namespace?

Comment on lines +100 to +106
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app.kubernetes.io/instance: cert-manager
app.kubernetes.io/component: webhook
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm easy here, not having it is less YAML.
Having it means it's explicit.

wallrj and others added 5 commits October 18, 2023 16:56
Thanks @jsoref for correcting these mistakes

Co-authored-by: Josh Soref <2119212+jsoref@users.noreply.github.com>
Signed-off-by: Richard Wall <wallrj@users.noreply.github.com>
Thanks @hawksight. I agree.

Co-authored-by: Peter Fiddes <hawksight@users.noreply.github.com>
Signed-off-by: Richard Wall <wallrj@users.noreply.github.com>
Signed-off-by: Richard Wall <richard.wall@venafi.com>
Link to Google GKE documentation which talks about webhook disruptions

Signed-off-by: Richard Wall <richard.wall@venafi.com>
Signed-off-by: Richard Wall <richard.wall@venafi.com>
wallrj and others added 5 commits October 20, 2023 14:05
…quate

Signed-off-by: Richard Wall <richard.wall@venafi.com>
Signed-off-by: Richard Wall <richard.wall@venafi.com>
Co-authored-by: Tim Ramlot <42113979+inteon@users.noreply.github.com>
Signed-off-by: Richard Wall <wallrj@users.noreply.github.com>
Co-authored-by: Tim Ramlot <42113979+inteon@users.noreply.github.com>
Signed-off-by: Richard Wall <wallrj@users.noreply.github.com>
Signed-off-by: Richard Wall <richard.wall@venafi.com>
@wallrj wallrj requested a review from inteon October 20, 2023 14:17
Signed-off-by: Richard Wall <richard.wall@venafi.com>
Copy link
Copy Markdown
Member

@inteon inteon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wallrj. This information is very useful for anyone who is serious about installing and configuring cert-manager in production environments.
/approve
/lgtm

@jetstack-bot jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 20, 2023
@jetstack-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hawksight, inteon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jetstack-bot jetstack-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 20, 2023
Signed-off-by: Richard Wall <richard.wall@venafi.com>
@jetstack-bot jetstack-bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 20, 2023
@inteon
Copy link
Copy Markdown
Member

inteon commented Oct 20, 2023

/lgtm

@jetstack-bot jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 20, 2023
@jetstack-bot jetstack-bot merged commit f03bbea into cert-manager:master Oct 20, 2023
@wallrj wallrj deleted the high-availability-best-practice branch October 20, 2023 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants