-
Notifications
You must be signed in to change notification settings - Fork 2.4k
GKE: cm-acme-http-solver triggers no.scale.down.node.pod.not.backed.by.controller due to lack of PodDisruptionBudget / safe-to-evict annotation #5267
Description
Describe the bug:
We have some very unhappy acme solver pods (I believe it's because we enabled a virtual-server for nginxinc/kubernetes-ingress but I'm not entirely certain, and the precise reason for the unhappy solver pods is beyond the scope of this issue).
Ideally, there should not be warnings like this in our logs, and the GKE node rebalancing system should be able to do what it wants to do (which I believe is scale down a node, and potentially recreate it elsewhere):
GKE event
{
"jsonPayload": {
"noDecisionStatus": {
"noScaleDown": {
"nodes": [
{
"reason": {
"parameters": [
"cm-acme-http-solver-zfbxg"
],
"messageId": "no.scale.down.node.pod.not.backed.by.controller"
}
}
]
}
}
},
"resource": {
"type": "k8s_cluster"
}
}Expected behaviour:
cert-manager should create pods that can be deleted.
- Set an annotation
"cluster-autoscaler.kubernetes.io/safe-to-evict": "true"for the Pod - define a controller (ReplicationController, DaemonSet, Job, StatefulSet, or ReplicaSet).
See Kubernetes Cluster Autoscaler FAQ and GKE cluster-autoscaler-visibility "no.scale.down.node.pod.not.backed.by.controller"
Steps to reproduce the bug:
- Use a GKE cluster w/ cert-manager
- Create a certificate resource using an
Ingressobject - Cause the solver to be unable to solve future reissuances (for simplicity, you can move the public dns entry to not point to the ingress)
- Visit the GKE cluster view and see:
Pod is blocking scale down because it’s not backed by a controller ℹ️
Anything else we need to know?:
Creating a PDB at a helm or similar level is fairly intractable because PDBs are namespaced objects and solvers can be created in any namespace (in response to demand in that namespace) see #5267 (comment).
Workarounds: #5267 (comment)
Environment details::
- Kubernetes version: control plane migrated to 1.23.5-gke.1503, control plane was: 1.22.6-gke.300, the node pool is in the process of updating to match the control plane version. The notice would have been from the older version.
- Cloud-provider/provisioner: Google
- cert-manager version: 1.7.2
- Install method: helm
/kind bug