Write about best-practice high availability and scaling of cert-manager components by wallrj · Pull Request #1330 · cert-manager/website

wallrj · 2023-10-18T11:15:33Z

Preview: https://deploy-preview-1330--cert-manager-website.netlify.app/docs/installation/best-practice/#high-availability

We've added various Helm chart values which allow users to configure cert-manager for HA and scalability,
but we've never documented any recommendations explaining how these settings should be used in production.

My original / ultimate plan was to change some of the Helm chart defaults to include useful default topology constraints,
as first suggested by @ThatsMrTalbot in a cert-manager dev meeting and discussed further in https://kubernetes.slack.com/archives/CDEQJ0Q8M/p1697561450000799

But first I want to document what we think the best practice settings are.

netlify · 2023-10-18T11:18:36Z

✅ Deploy Preview for cert-manager-website ready!

Name	Link
🔨 Latest commit	`5121600`
🔍 Latest deploy log	https://app.netlify.com/sites/cert-manager-website/deploys/65328e6c9127e000085e01b3
😎 Deploy Preview	https://deploy-preview-1330--cert-manager-website.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Signed-off-by: Richard Wall <richard.wall@venafi.com>

wallrj · 2023-10-18T13:40:08Z

content/docs/installation/best-practice.md

+  - maxSkew: 1
+    topologyKey: kubernetes.io/hostname
+    whenUnsatisfiable: ScheduleAnyway
+    labelSelector:
+      matchLabels:
+        app.kubernetes.io/instance: cert-manager
+        app.kubernetes.io/component: webhook


I doubt this is actually necessary because the documentation says:

the scheduler automatically tries to spread the Pods in a ReplicaSet across nodes in a single-zone cluster (to reduce the impact of node failures, see kubernetes.io/hostname). With multiple-zone clusters, this spreading behavior also applies to zones (to reduce the impact of zone failures). This is achieved via SelectorSpreadPriority.

I'm easy here, not having it is less YAML.
Having it means it's explicit.

I think if we can verify that this works by default, we should not tell people to configure this (because we want to keep our docs as simple as possible). Instead, we can just link to the documentation you shared and say that this works correctly by default.

I agree that according to the docs the desired anti-affinity scheduling should happen by default.
I've updated this paragraph.

content/docs/installation/best-practice.md

wallrj · 2023-10-18T13:47:04Z

content/docs/installation/best-practice.md

+so as to reduce the load on the Kubernetes API server.
+
+For example, if the cluster contains very many CertificateRequest resources,
+you will need to increase the memory limit of the controller Pod.


I might say something about the memory optimizations for cainjector e.g.

Fix cainjector's --namespace flag. Users who want to prevent cainjector from reading all Secrets and Certificates in all namespaces (i.e to prevent excessive memory consumption) can now scope it to a single namespace using the --namespace flag. A cainjector that is only used as part of cert-manager installation only needs access to the cert-manager installation namespace. (#5694, @irbekrm)
-- https://deploy-preview-1330--cert-manager-website.netlify.app/docs/releases/release-notes/release-notes-1.11/

Is it best practice to limit this to only the installation namespace?

I suppose it's not technically "best practice" but I've added it anyway.
See what you think.

hawksight

Very well written, just a few minor notes to break up the text a little.

For reference, I just checked for the labels on my GKE nodes:

kubernetes.io/hostname=gke-demo-istio-5d390798-5v7i,kubernetes.io/os=linux,node.kubernetes.io/instance-type=e2-medium,topology.gke.io/zone=europe-west1-b,topology.kubernetes.io/region=europe-west1,topology.kubernetes.io/zone=europe-west1-b

So the labels chosen looks appropriate, although I have not checked any other managed offering.

content/docs/installation/best-practice.md

hawksight · 2023-10-18T14:25:43Z

content/docs/installation/best-practice.md

+so as to reduce the load on the Kubernetes API server.
+
+For example, if the cluster contains very many CertificateRequest resources,
+you will need to increase the memory limit of the controller Pod.


Is it best practice to limit this to only the installation namespace?

hawksight · 2023-10-18T14:26:17Z

content/docs/installation/best-practice.md

+  - maxSkew: 1
+    topologyKey: kubernetes.io/hostname
+    whenUnsatisfiable: ScheduleAnyway
+    labelSelector:
+      matchLabels:
+        app.kubernetes.io/instance: cert-manager
+        app.kubernetes.io/component: webhook


I'm easy here, not having it is less YAML.
Having it means it's explicit.

content/docs/installation/best-practice.md

@jsoref

Thanks @jsoref for correcting these mistakes Co-authored-by: Josh Soref <2119212+jsoref@users.noreply.github.com> Signed-off-by: Richard Wall <wallrj@users.noreply.github.com>

@hawksight

Thanks @hawksight. I agree. Co-authored-by: Peter Fiddes <hawksight@users.noreply.github.com> Signed-off-by: Richard Wall <wallrj@users.noreply.github.com>

Signed-off-by: Richard Wall <richard.wall@venafi.com>

Link to Google GKE documentation which talks about webhook disruptions Signed-off-by: Richard Wall <richard.wall@venafi.com>

Signed-off-by: Richard Wall <richard.wall@venafi.com>

content/docs/installation/best-practice.md

…quate Signed-off-by: Richard Wall <richard.wall@venafi.com>

Signed-off-by: Richard Wall <richard.wall@venafi.com>

Co-authored-by: Tim Ramlot <42113979+inteon@users.noreply.github.com> Signed-off-by: Richard Wall <wallrj@users.noreply.github.com>

Signed-off-by: Richard Wall <richard.wall@venafi.com>

inteon

Thanks @wallrj. This information is very useful for anyone who is serious about installing and configuring cert-manager in production environments.
/approve
/lgtm

jetstack-bot · 2023-10-20T14:27:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hawksight, inteon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [inteon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Richard Wall <richard.wall@venafi.com>

inteon · 2023-10-20T14:31:14Z

/lgtm

wallrj force-pushed the high-availability-best-practice branch from 7137207 to f417916 Compare October 18, 2023 12:16

Write about high availability and scaling of cert-manager components

69ec316

Signed-off-by: Richard Wall <richard.wall@venafi.com>

wallrj force-pushed the high-availability-best-practice branch 3 times, most recently from 785560a to 69ec316 Compare October 18, 2023 13:33

wallrj changed the title ~~WIP: Write about best-practice high availability and scaling of cert-manager components~~ Write about best-practice high availability and scaling of cert-manager components Oct 18, 2023

jetstack-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 18, 2023

wallrj commented Oct 18, 2023

View reviewed changes

wallrj requested review from erikgb and jsoref October 18, 2023 13:49

hawksight approved these changes Oct 18, 2023

View reviewed changes

jsoref reviewed Oct 18, 2023

View reviewed changes

wallrj and others added 5 commits October 18, 2023 16:56

Apply suggestions from code review

194d0ca

Thanks @jsoref for correcting these mistakes Co-authored-by: Josh Soref <2119212+jsoref@users.noreply.github.com> Signed-off-by: Richard Wall <wallrj@users.noreply.github.com>

Update content/docs/installation/best-practice.md

cfe144d

Thanks @hawksight. I agree. Co-authored-by: Peter Fiddes <hawksight@users.noreply.github.com> Signed-off-by: Richard Wall <wallrj@users.noreply.github.com>

Remove mention of worker processes

f3bacf3

Signed-off-by: Richard Wall <richard.wall@venafi.com>

Add discrete Helm values examples for each recommendation.

a289f95

Link to Google GKE documentation which talks about webhook disruptions Signed-off-by: Richard Wall <richard.wall@venafi.com>

Explain when and how to limit the memory usage of cainjector

5a82132

Signed-off-by: Richard Wall <richard.wall@venafi.com>

inteon reviewed Oct 19, 2023

View reviewed changes

content/docs/installation/best-practice.md Outdated Show resolved Hide resolved

inteon reviewed Oct 19, 2023

View reviewed changes

content/docs/installation/best-practice.md Outdated Show resolved Hide resolved

inteon reviewed Oct 19, 2023

View reviewed changes

content/docs/installation/best-practice.md Outdated Show resolved Hide resolved

wallrj and others added 5 commits October 20, 2023 14:05

Explain that the default topology spread constraints are probably ade…

5973793

…quate Signed-off-by: Richard Wall <richard.wall@venafi.com>

Further reading and warnings around the use of Pod Disruption Budget

0366cad

Signed-off-by: Richard Wall <richard.wall@venafi.com>

Update content/docs/installation/best-practice.md

a0504de

Co-authored-by: Tim Ramlot <42113979+inteon@users.noreply.github.com> Signed-off-by: Richard Wall <wallrj@users.noreply.github.com>

Update content/docs/installation/best-practice.md

303343a

Co-authored-by: Tim Ramlot <42113979+inteon@users.noreply.github.com> Signed-off-by: Richard Wall <wallrj@users.noreply.github.com>

Fix typo

9b283a0

Signed-off-by: Richard Wall <richard.wall@venafi.com>

wallrj requested a review from inteon October 20, 2023 14:17

Make this paragraph make sense

d05d333

Signed-off-by: Richard Wall <richard.wall@venafi.com>

inteon approved these changes Oct 20, 2023

View reviewed changes

jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 20, 2023

jetstack-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 20, 2023

Link to The Darker Side of Webhooks blog post

5121600

Signed-off-by: Richard Wall <richard.wall@venafi.com>

jetstack-bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 20, 2023

jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 20, 2023

jetstack-bot merged commit f03bbea into cert-manager:master Oct 20, 2023

wallrj deleted the high-availability-best-practice branch October 20, 2023 14:52

wallrj mentioned this pull request Oct 20, 2023

Explain why and how to isolate the cert-manager workloads #1331

Merged

jetstack-bot mentioned this pull request Dec 7, 2023

[release-next] Fast-forward master into release-next #1357

Merged

Conversation

wallrj commented Oct 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Oct 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for cert-manager-website ready!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hawksight left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

inteon left a comment

Choose a reason for hiding this comment

Uh oh!

jetstack-bot commented Oct 20, 2023

Uh oh!

inteon commented Oct 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wallrj commented Oct 18, 2023 •

edited

Loading

netlify bot commented Oct 18, 2023 •

edited

Loading