Skip to content

GKE: cm-acme-http-solver triggers no.scale.down.node.pod.not.backed.by.controller due to lack of PodDisruptionBudget / safe-to-evict annotation  #5267

@jsoref

Description

@jsoref

Describe the bug:

We have some very unhappy acme solver pods (I believe it's because we enabled a virtual-server for nginxinc/kubernetes-ingress but I'm not entirely certain, and the precise reason for the unhappy solver pods is beyond the scope of this issue).

Ideally, there should not be warnings like this in our logs, and the GKE node rebalancing system should be able to do what it wants to do (which I believe is scale down a node, and potentially recreate it elsewhere):

GKE event
{
  "jsonPayload": {
    "noDecisionStatus": {
      "noScaleDown": {
        "nodes": [
          {
            "reason": {
              "parameters": [
                "cm-acme-http-solver-zfbxg"
              ],
              "messageId": "no.scale.down.node.pod.not.backed.by.controller"
            }
          }
        ]
      }
    }
  },
  "resource": {
    "type": "k8s_cluster"
  }
}

Expected behaviour:

cert-manager should create pods that can be deleted.

  • Set an annotation "cluster-autoscaler.kubernetes.io/safe-to-evict": "true" for the Pod
  • define a controller (ReplicationController, DaemonSet, Job, StatefulSet, or ReplicaSet).

See Kubernetes Cluster Autoscaler FAQ and GKE cluster-autoscaler-visibility "no.scale.down.node.pod.not.backed.by.controller"

Steps to reproduce the bug:

  1. Use a GKE cluster w/ cert-manager
  2. Create a certificate resource using an Ingress object
  3. Cause the solver to be unable to solve future reissuances (for simplicity, you can move the public dns entry to not point to the ingress)
  4. Visit the GKE cluster view and see:
    image

    Pod is blocking scale down because it’s not backed by a controller ℹ️

Anything else we need to know?:

Creating a PDB at a helm or similar level is fairly intractable because PDBs are namespaced objects and solvers can be created in any namespace (in response to demand in that namespace) see #5267 (comment).

Workarounds: #5267 (comment)

Environment details::

  • Kubernetes version: control plane migrated to 1.23.5-gke.1503, control plane was: 1.22.6-gke.300, the node pool is in the process of updating to match the control plane version. The notice would have been from the older version.
  • Cloud-provider/provisioner: Google
  • cert-manager version: 1.7.2
  • Install method: helm

/kind bug

Metadata

Metadata

Assignees

Labels

good first issueDenotes an issue ready for a new contributor, according to the "help wanted" guidelines.kind/bugCategorizes issue or PR as related to a bug.priority/backlogHigher priority than priority/awaiting-more-evidence.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions