Prevent unmanaged pods in GKE's containerd flavors#18486
Prevent unmanaged pods in GKE's containerd flavors#18486joamaki merged 2 commits intocilium:masterfrom bmcustodio:fix-gke-containerd
Conversation
This comment has been minimized.
This comment has been minimized.
kaworu
left a comment
There was a problem hiding this comment.
Couple of question, but overall LGTM.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
/test-1.23-net-next |
|
/test-1.23-net-next |
|
@nebril can you point to which PR did the backports of this PR for v1.11? Or switch the labels back if this was not backported to 1.11 yet? EDIT: Found it, see the link below. |
|
@bmcustodio Please consider updating the official upgrade guide in future when communicating to users about necessary steps for them to take during upgrade; users are likely to miss notes marked "important" in the release notes unless both we and they are paying close attention. For v1.11.2 I will call it out explicitly in the announcements. |
|
Hmm, actually looking back again at this PR I think that in the announcement I miscommunicated this because I assumed that the advice was specific to GKE users. It looks like all users must take action regarding the taints. Can you confirm? So far this is only part of v1.11.2 so shouldn't have a huge impact but I am concerned about communicating this clearly so we don't end up having to field a large number of questions from confused users about why the taints are different than they were before. |
Yes, this is correct ;) |
The changes that we have been doing to
/etc/defaults/kubeletare reset on node reboots, as is apparently the whole/etcdirectory — which also means that/etc/cni/net.d/05-cilium.confis removed.This would not be a problem if the assumption we made that the node taint we recommend placing on the nodes would come back upon reboots held true, but in practice it doesn't.
Besides this, it seems that containerd will re-instante its CNI configuration file, and it will do so way before Cilium has had the chance to re-run on the node and re-create its CNI configuration, causing pods to be assigned IPs by the default CNI rather than by Cilium in the meantime.
This commit attempts at preventing that from happening by observing that
/home/kubernetes/bin/kubelet(i.e. the actual kubelet binary) is kept between reboots and executed concurrently with containerd by systemd. We leverage on this empirical observation to replace this file kubelet with a wrapper script that, under the required conditions, disables containerd, patches its configuration, removes undesired CNI configuration files, re-enables containerd and becomes the kubelet.Additionally, to prevent situations in which the GKE node is forcibly stopped and re-created from causing unmanaged pods, and building on the observation that the node comes back with the same name and pods are already scheduled there, we change the recommended taint effect from
NoScheduletoNoExecute, to cause any previously scheduled pods to be evicted, preventing them from getting IPs assigned by the default CNI.This should not impact other environments due to the nature ofNoExecute, so we recommend it everywhere.