CNI: keep cni config on shutdown; taint instead, queue deletions#23486
CNI: keep cni config on shutdown; taint instead, queue deletions#23486borkmann merged 5 commits intocilium:masterfrom
Conversation
35b870d to
881e0bb
Compare
|
/test Job 'Cilium-PR-K8s-1.24-kernel-5.4' failed: Click to show.Test NameFailure OutputIf it is a flake and a GitHub issue doesn't already exist to track it, comment |
|
Test failures are either known flakes, or unlikely to be caused by this:
So, rebasing and retrying to see if that helps. |
881e0bb to
fa124d7
Compare
There was a problem hiding this comment.
Looks mostly good to me, a few minor comments to address.
If I may share an opinion: I appreciate the PR description because it made reviewing the code much easier, however I'd hate to lose all that context because it isn't somewhere in the commit msg. What do you think of putting it somewhere in the commit(s)? The reason is because when bisecting, it's much easier to read all the context related to a change from the commit msg, rather than think "hmm, I wonder if there's more to this" and have to locate the PR behind it.
What I've seen people do (and what I try to do as well) is paste the "important" commit msgs in the PR description as well, so people sort of have a cover letter to read before diving in, as you did in your PR, while also preserving the history in the commit(s) themselves.
On another note, for the release note: I'd suggest stripping the extra newlines as it's not going to format quite as nicely in the end. Something like this would be easier to read:
This is sentence 1 of the release note.
This is sentence 2 of the same release note.
And so on...
fa124d7 to
e64a1a1
Compare
sayboras
left a comment
There was a problem hiding this comment.
This PR is not only interesting but alsoo fun to read, love the comment in code 👍.
LGTM ✔️
|
@christarazi updated the logging based on your suggestions, thanks for the review. |
|
/test Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed: Click to show.Test NameFailure OutputIf it is a flake and a GitHub issue doesn't already exist to track it, comment |
|
jenkins job failure is because of a missed tail call!? This PR has nothing to do with BPF... |
|
/mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next 👍 created #24514 |
|
/test-1.26-net-next |
|
it is sad we don't have a better defined processs for the Pod lifecycle in k8s so we don't have to implement these things :( |
* New unikorn release with cilium taints removed from `clusteropenstack` helm application as this is now provided via the cilium operator as of 1.14.0 - cilium/cilium#23486 * Updates a bunch of other packages with a nuew preview application bundle for both clusters and control plane * Some minor doc corrections
* New unikorn release with a note on deprecating and removing cilium taints from `clusteropenstack` helm application as this is now provided via the cilium operator as of 1.14.0 - cilium/cilium#23486 We can't do this yet as it would break any apps pre cp-app-bundle 1.2.0 * Updates a bunch of other packages with a nuew preview application bundle for both clusters and control plane * Some minor doc corrections
* New unikorn release with a note on deprecating and removing cilium taints from `clusteropenstack` helm application as this is now provided via the cilium operator as of 1.14.0 - cilium/cilium#23486 We can't do this yet as it would break any apps pre kubernetes-cluster-1.4.0 * Updates a bunch of other packages with a nuew preview application bundle for both clusters and control plane * Some minor doc corrections
* New unikorn release with a note on deprecating and removing cilium taints from `clusteropenstack` helm application as this is now provided via the cilium operator as of 1.14.0 - cilium/cilium#23486 We can't do this yet as it would break any apps pre kubernetes-cluster-1.4.0 * Updates a bunch of other packages with a nuew preview application bundle for both clusters and control plane * Some minor doc corrections
* New unikorn release with a note on deprecating and removing cilium taints from `clusteropenstack` helm application as this is now provided via the cilium operator as of 1.14.0 - cilium/cilium#23486 We can't do this yet as it would break any apps pre kubernetes-cluster-1.4.0 * Updates a bunch of other packages with a nuew preview application bundle for both clusters and control plane * Some minor doc corrections
|
Forgive me if there's an obvious answer to this question, but, out of curiosity, why a |
|
That is a fine idea. I initially did NoExecute as well to catch pods that were scheduled but not yet picked up by the Kubelet, but I agree that it's too drastic an effect. |
What are your thoughts on updating the behavior to actually apply a NoExecute? Or could this be parameterized similar to how we are able to set the |
|
Actually, I got this backwards; I set a NoSchedule taint and avoided a NoExecute taint. The reason is that running pods do not need to be evicted when Cilium goes down, as they are running and should not be disrupted. |
When Cilium goes down, the running pods lose connectivity. Shouldn't they be evicted so that they can be rescheduled onto other nodes where Cilium is running? We encountered this scenario recently and experienced some service outages because the pods were not rescheduled elsehwere. |
Generally, that shouldn't happen, with the exception of flows being sent through the userspace L7 proxies. Do you suspect a Cilium bug? |
Oh, this is not what we observed (or perhaps what we observed were failed network calls sent through the userspace L7 proxies). We saw several reported failed connection attempts for the duration that the nodes had the |
This change is made up of several commits, which build on each other. It fixes a wart with how CNI, Containerd, and Dockershim interact, and prevents some possible outage scenarios.
Ultimately, I'd like to make much of this less necessary with an improved CNI STATUS verb, but that will take some time to roll out across the ecosystem.
Asynchronous CNI delete
The CNI plugin now queues DEL for later submission, rather than failing. It tries to connect to the agent socket, and if that fails, it will write a file in
/run/ciliumwith the pod's details. When the agent starts up, it will read that queue and process all pending deletes.This fixes #22067, where an end-user found themselves with an unrecoverable cluster. Cilium got descheduled but their node was at its pod limit. They couldn't start any pods, and they couldn't delete any pods without Cilium running. Not good.
Taint the node when Cilium is shut down
On nodes where Cilium is scheduled, if it is not running, taint the node with NoSchedule. This will prevent new pods from being scheduled there. This has two goals:
Preserve CNI configuration
Lastly, we no longer delete the CNI configuration file when stopping the agent. This has several important effects:
Fixes: #22067.