Skip to content

Kubelet serving CSR never created #130001

@lentzi90

Description

@lentzi90

What happened?

We are seeing an issue occasionally where the kubelet never gets the server certificate (serverTLSBootstrap: true).
We have an auto-approver for the server certificates and detect this issue because we are waiting for the certificate to appear in /var/lib/kubelet/pki/kubelet-server-current.pem. When we hit it, not even the CSR is created.
It started happening after upgrading to v1.32 so we believe it could be from a change that was introduced in v1.32.
Restarting the kubelet does not help. It happily continues without the server certificate, even though it is configured to bootstrap it.

In healthy clusters, we see this:

$ kubectl get csr
NAME        AGE   SIGNERNAME                                    REQUESTOR                           REQUESTEDDURATION   CONDITION
csr-8dql4   29s   kubernetes.io/kubelet-serving                 system:node:lennart-kubelet-debug   <none>              Pending
csr-r8ztr   17s   kubernetes.io/kubelet-serving                 system:node:lennart-kubelet-debug   <none>              Pending
csr-xtwpf   30s   kubernetes.io/kube-apiserver-client-kubelet   system:node:lennart-kubelet-debug   <none>              Approved,Issued

In faulty clusters, there is just the client CSR.
We are not sure why two CSRs are created for the server either, but it works when we approve the newer of them.

Edit: The two server CSRs happen because kubeadm restarts the kubelet as part of initializing. This is how it picks up the client certificate. After the restart, the kubelet does not "remember" its old server CSR so it creates a new. This is not an issue.

Here is the kubelet config from an affected node: kubelet-config.txt

What did you expect to happen?

When configured, the kubelet should always create the server CSR.

How can we reproduce it (as minimally and precisely as possible)?

Unfortunately we do not have any guaranteed reproduction steps.
In general, this is what is needed to trigger the issue:

  1. Kubelet is configured to rotate certificates (default) and to use server certificate.
  2. Node status is stable, i.e. there are no changes of any kind that would cause the kubelet to update the status.
  3. Kubelet is restarted and the server certificate does not exist (e.g. because it was removed or it wasn't issued yet).

In this specific case you should be able to see that the CSR is not created:

$ kubectl get csr
NAME        AGE   SIGNERNAME                                    REQUESTOR                           REQUESTEDDURATION   CONDITION
csr-2mq8h   22s   kubernetes.io/kube-apiserver-client-kubelet   system:node:lennart-kubelet-debug   <none>              Approved,Issued
$ kubectl get nodes
NAME                    STATUS   ROLES           AGE   VERSION
lennart-kubelet-debug   Ready    control-plane   28s   v1.32.1

Anything else we need to know?

I believe the culprit is #128640. Not the randomization of the interval, but the change in how the zero case for the lastStatusReportTime is handled.
It used to be treated as expired and trigger an update. After the PR it is treated the opposite.
This means that patchNodeStatus is never called and consequently neither is setLastObservedNodeAddresses.
The latter is initializing the kubelets internal node address state (not node status).
This internal state is what the server certificate manager is relying on to create the CSR. As long as it is nil, i.e. as long as the node status hasn't been patched, it cannot progress.

Kubernetes version

Details
$ kubectl version
Client Version: v1.32.1
Kustomize Version: v5.5.0
Server Version: v1.32.1

Cloud provider

Details Happens both with and without external cloud provider

OS version

Details
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

Install tools

Details

Container runtime (CRI) and version (if applicable)

Details

Related plugins (CNI, CSI, ...) and versions (if applicable)

Details

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.sig/authCategorizes an issue or PR as relevant to SIG Auth.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Closed / Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions