What happened?
We are seeing an issue occasionally where the kubelet never gets the server certificate (serverTLSBootstrap: true).
We have an auto-approver for the server certificates and detect this issue because we are waiting for the certificate to appear in /var/lib/kubelet/pki/kubelet-server-current.pem. When we hit it, not even the CSR is created.
It started happening after upgrading to v1.32 so we believe it could be from a change that was introduced in v1.32.
Restarting the kubelet does not help. It happily continues without the server certificate, even though it is configured to bootstrap it.
In healthy clusters, we see this:
$ kubectl get csr
NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION
csr-8dql4 29s kubernetes.io/kubelet-serving system:node:lennart-kubelet-debug <none> Pending
csr-r8ztr 17s kubernetes.io/kubelet-serving system:node:lennart-kubelet-debug <none> Pending
csr-xtwpf 30s kubernetes.io/kube-apiserver-client-kubelet system:node:lennart-kubelet-debug <none> Approved,Issued
In faulty clusters, there is just the client CSR.
We are not sure why two CSRs are created for the server either, but it works when we approve the newer of them.
Edit: The two server CSRs happen because kubeadm restarts the kubelet as part of initializing. This is how it picks up the client certificate. After the restart, the kubelet does not "remember" its old server CSR so it creates a new. This is not an issue.
Here is the kubelet config from an affected node: kubelet-config.txt
What did you expect to happen?
When configured, the kubelet should always create the server CSR.
How can we reproduce it (as minimally and precisely as possible)?
Unfortunately we do not have any guaranteed reproduction steps.
In general, this is what is needed to trigger the issue:
- Kubelet is configured to rotate certificates (default) and to use server certificate.
- Node status is stable, i.e. there are no changes of any kind that would cause the kubelet to update the status.
- Kubelet is restarted and the server certificate does not exist (e.g. because it was removed or it wasn't issued yet).
In this specific case you should be able to see that the CSR is not created:
$ kubectl get csr
NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION
csr-2mq8h 22s kubernetes.io/kube-apiserver-client-kubelet system:node:lennart-kubelet-debug <none> Approved,Issued
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
lennart-kubelet-debug Ready control-plane 28s v1.32.1
Anything else we need to know?
I believe the culprit is #128640. Not the randomization of the interval, but the change in how the zero case for the lastStatusReportTime is handled.
It used to be treated as expired and trigger an update. After the PR it is treated the opposite.
This means that patchNodeStatus is never called and consequently neither is setLastObservedNodeAddresses.
The latter is initializing the kubelets internal node address state (not node status).
This internal state is what the server certificate manager is relying on to create the CSR. As long as it is nil, i.e. as long as the node status hasn't been patched, it cannot progress.
Kubernetes version
Details
$ kubectl version
Client Version: v1.32.1
Kustomize Version: v5.5.0
Server Version: v1.32.1
Cloud provider
Details
Happens both with and without external cloud provider
OS version
Details
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
Install tools
Details
Container runtime (CRI) and version (if applicable)
Details
Related plugins (CNI, CSI, ...) and versions (if applicable)
Details
What happened?
We are seeing an issue occasionally where the kubelet never gets the server certificate (
serverTLSBootstrap: true).We have an auto-approver for the server certificates and detect this issue because we are waiting for the certificate to appear in
/var/lib/kubelet/pki/kubelet-server-current.pem. When we hit it, not even the CSR is created.It started happening after upgrading to v1.32 so we believe it could be from a change that was introduced in v1.32.
Restarting the kubelet does not help. It happily continues without the server certificate, even though it is configured to bootstrap it.
In healthy clusters, we see this:
In faulty clusters, there is just the client CSR.
We are not sure why two CSRs are created for the server either, but it works when we approve the newer of them.Edit: The two server CSRs happen because kubeadm restarts the kubelet as part of initializing. This is how it picks up the client certificate. After the restart, the kubelet does not "remember" its old server CSR so it creates a new. This is not an issue.
Here is the kubelet config from an affected node: kubelet-config.txt
What did you expect to happen?
When configured, the kubelet should always create the server CSR.
How can we reproduce it (as minimally and precisely as possible)?
Unfortunately we do not have any guaranteed reproduction steps.
In general, this is what is needed to trigger the issue:
In this specific case you should be able to see that the CSR is not created:
Anything else we need to know?
I believe the culprit is #128640. Not the randomization of the interval, but the change in how the zero case for the lastStatusReportTime is handled.
It used to be treated as expired and trigger an update. After the PR it is treated the opposite.
This means that
patchNodeStatusis never called and consequently neither issetLastObservedNodeAddresses.The latter is initializing the kubelets internal node address state (not node status).
This internal state is what the server certificate manager is relying on to create the CSR. As long as it is
nil, i.e. as long as the node status hasn't been patched, it cannot progress.Kubernetes version
Details
Cloud provider
Details
Happens both with and without external cloud providerOS version
Details
Install tools
Details
Container runtime (CRI) and version (if applicable)
Details
Related plugins (CNI, CSI, ...) and versions (if applicable)
Details