CNI: keep cni config on shutdown; taint instead, queue deletions by squeed · Pull Request #23486 · cilium/cilium

squeed · 2023-01-31T12:48:15Z

This change is made up of several commits, which build on each other. It fixes a wart with how CNI, Containerd, and Dockershim interact, and prevents some possible outage scenarios.

Ultimately, I'd like to make much of this less necessary with an improved CNI STATUS verb, but that will take some time to roll out across the ecosystem.

Asynchronous CNI delete

The CNI plugin now queues DEL for later submission, rather than failing. It tries to connect to the agent socket, and if that fails, it will write a file in /run/cilium with the pod's details. When the agent starts up, it will read that queue and process all pending deletes.

This fixes #22067, where an end-user found themselves with an unrecoverable cluster. Cilium got descheduled but their node was at its pod limit. They couldn't start any pods, and they couldn't delete any pods without Cilium running. Not good.

Taint the node when Cilium is shut down

On nodes where Cilium is scheduled, if it is not running, taint the node with NoSchedule. This will prevent new pods from being scheduled there. This has two goals:

Minimize ignorable errors from pods failing to start because Cilium can't handle a CNI ADD
Minimize pods being started with a non-Cilium network provider

Preserve CNI configuration

Lastly, we no longer delete the CNI configuration file when stopping the agent. This has several important effects:

the node no longer goes NotReady, so it won't be removed from cloud LB backends just because cilium is being upgraded. This prevents unneeded churn.
CNI DEL will still succeed, so pods can always be cleaned up
No chance of pods being started with a different CNI provider

The Cilium operator now taints nodes where Cilium is scheduled to run but is not running. 
This prevents pods from being scheduled on nodes without Cilium.
The CNI configuration file is no longer removed on agent shutdown. 
This means that pod deletion will always succeed; previously it would fail if Cilium was down for an upgrade.
This should help prevent nodes accidentally entering an unmanageable state.
It also means that nodes are not removed from cloud LoadBalancer backends during Cilium upgrades.

Fixes: #22067.

squeed · 2023-01-31T15:37:21Z

/test

Job 'Cilium-PR-K8s-1.24-kernel-5.4' failed:

Click to show.

Test Name

K8sDatapathConfig MonitorAggregation Checks that monitor aggregation restricts notifications

Failure Output

FAIL: Found 1 k8s-app=cilium logs matching list of errors that must be investigated:

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.24-kernel-5.4 so I can create one.

squeed · 2023-02-01T21:36:39Z

Test failures are either known flakes, or unlikely to be caused by this:

l4lb XDP doesn't use CNI at all
TestGetIdentity/Duplicated_identity is a known flake
MonitorAggregation didn't get all ICMP events. Again, not something this could do.

So, rebasing and retrying to see if that helps.

christarazi

Looks mostly good to me, a few minor comments to address.

If I may share an opinion: I appreciate the PR description because it made reviewing the code much easier, however I'd hate to lose all that context because it isn't somewhere in the commit msg. What do you think of putting it somewhere in the commit(s)? The reason is because when bisecting, it's much easier to read all the context related to a change from the commit msg, rather than think "hmm, I wonder if there's more to this" and have to locate the PR behind it.

What I've seen people do (and what I try to do as well) is paste the "important" commit msgs in the PR description as well, so people sort of have a cover letter to read before diving in, as you did in your PR, while also preserving the history in the commit(s) themselves.

On another note, for the release note: I'd suggest stripping the extra newlines as it's not going to format quite as nicely in the end. Something like this would be easier to read:

This is sentence 1 of the release note.
This is sentence 2 of the same release note.
And so on...

pkg/lock/lockfile/lockfile_linux.go

plugins/cilium-cni/main.go

operator/watchers/node_taint_test.go

Documentation/installation/taints.rst

pkg/defaults/defaults.go

pkg/lock/lockfile/lockfile_linux.go

sayboras

This PR is not only interesting but alsoo fun to read, love the comment in code 👍.

LGTM ✔️

squeed · 2023-03-21T13:41:04Z

@christarazi updated the logging based on your suggestions, thanks for the review.

christarazi · 2023-03-21T20:44:19Z

/test

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sUpdates Tests upgrade and downgrade from a Cilium stable image to master

Failure Output

FAIL: Expected

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

squeed · 2023-03-22T09:28:30Z

jenkins job failure is because of a missed tail call!? This PR has nothing to do with BPF...

time="2023-03-21T21:17:36Z" level=debug msg="running local command: kubectl exec -n kube-system cilium-dj8v9 -- cilium metrics list -o json | jq '.[] | select( .name == \"cilium_drop_count_total\" and .labels.reason == \"Missed tail call\" ).value'"
cmd: "kubectl exec -n kube-system cilium-dj8v9 -- cilium metrics list -o json | jq '.[] | select( .name == \"cilium_drop_count_total\" and .labels.reason == \"Missed tail call\" ).value'" exitCode: 0 duration: 168.910676ms stdout:
1

squeed · 2023-03-22T09:56:14Z

/mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next

👍 created #24514

squeed · 2023-03-22T09:59:16Z

/test-1.26-net-next

aojea · 2023-08-03T06:50:21Z

it is sad we don't have a better defined processs for the Pod lifecycle in k8s so we don't have to implement these things :(

* New unikorn release with cilium taints removed from `clusteropenstack` helm application as this is now provided via the cilium operator as of 1.14.0 - cilium/cilium#23486 * Updates a bunch of other packages with a nuew preview application bundle for both clusters and control plane * Some minor doc corrections

* New unikorn release with a note on deprecating and removing cilium taints from `clusteropenstack` helm application as this is now provided via the cilium operator as of 1.14.0 - cilium/cilium#23486 We can't do this yet as it would break any apps pre cp-app-bundle 1.2.0 * Updates a bunch of other packages with a nuew preview application bundle for both clusters and control plane * Some minor doc corrections

* New unikorn release with a note on deprecating and removing cilium taints from `clusteropenstack` helm application as this is now provided via the cilium operator as of 1.14.0 - cilium/cilium#23486 We can't do this yet as it would break any apps pre kubernetes-cluster-1.4.0 * Updates a bunch of other packages with a nuew preview application bundle for both clusters and control plane * Some minor doc corrections

robbie-demuth · 2025-02-14T20:25:01Z

Forgive me if there's an obvious answer to this question, but, out of curiosity, why a NoSchedule taint as opposed to a NoExecute taint? If Cilium isn't running, pods already scheduled to the node will presumably run into issues. Why not apply a NoExecute taint so that they're evicted and rescheduled elsewhere? A NoExecute taint presumably shouldn't evict Cilium itself since it's a daemon set pod

squeed · 2025-02-15T07:44:06Z

That is a fine idea. I initially did NoExecute as well to catch pods that were scheduled but not yet picked up by the Kubelet, but I agree that it's too drastic an effect.

bmendoza820 · 2025-02-19T16:17:26Z

That is a fine idea. I initially did NoExecute as well to catch pods that were scheduled but not yet picked up by the Kubelet, but I agree that it's too drastic an effect.

What are your thoughts on updating the behavior to actually apply a NoExecute? Or could this be parameterized similar to how we are able to set the agent-not-ready-taint-key?

squeed · 2025-02-19T16:33:41Z

Actually, I got this backwards; I set a NoSchedule taint and avoided a NoExecute taint. The reason is that running pods do not need to be evicted when Cilium goes down, as they are running and should not be disrupted.

bmendoza820 · 2025-02-19T22:15:57Z

Actually, I got this backwards; I set a NoSchedule taint and avoided a NoExecute taint. The reason is that running pods do not need to be evicted when Cilium goes down, as they are running and should not be disrupted.

When Cilium goes down, the running pods lose connectivity. Shouldn't they be evicted so that they can be rescheduled onto other nodes where Cilium is running? We encountered this scenario recently and experienced some service outages because the pods were not rescheduled elsehwere.

squeed · 2025-02-20T09:37:55Z

When Cilium goes down, the running pods lose connectivity.

Generally, that shouldn't happen, with the exception of flows being sent through the userspace L7 proxies. Do you suspect a Cilium bug?

bmendoza820 · 2025-02-21T14:43:31Z

When Cilium goes down, the running pods lose connectivity.

Generally, that shouldn't happen, with the exception of flows being sent through the userspace L7 proxies. Do you suspect a Cilium bug?

Oh, this is not what we observed (or perhaps what we observed were failed network calls sent through the userspace L7 proxies). We saw several reported failed connection attempts for the duration that the nodes had the NoSchedule taint applied. This lasted for several hours until our autoscaler conducted node consolidation.

squeed requested review from a team as code owners January 31, 2023 12:48

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 31, 2023

squeed requested review from a team as code owners January 31, 2023 12:48

squeed requested review from christarazi, qmonnet and sayboras January 31, 2023 12:48

github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jan 31, 2023

squeed added dont-merge/needs-release-note release-note/major This PR introduces major new functionality to Cilium. and removed kind/community-contribution This was a contribution made by a community member. labels Jan 31, 2023

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 31, 2023

squeed force-pushed the cni-down-agent branch from 35b870d to 881e0bb Compare January 31, 2023 14:17

squeed added area/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. area/agent Cilium agent related. and removed dont-merge/needs-release-note labels Jan 31, 2023

squeed mentioned this pull request Jan 31, 2023

Startup race leads to unmanaged pods in kubernetes #23394

Closed

2 tasks

squeed force-pushed the cni-down-agent branch from 881e0bb to fa124d7 Compare February 1, 2023 21:39

christarazi requested changes Feb 2, 2023

View reviewed changes

pkg/lock/lockfile/lockfile_linux.go Outdated Show resolved Hide resolved

plugins/cilium-cni/main.go Outdated Show resolved Hide resolved

operator/watchers/node_taint_test.go Outdated Show resolved Hide resolved

Documentation/installation/taints.rst Outdated Show resolved Hide resolved

squeed force-pushed the cni-down-agent branch from fa124d7 to e64a1a1 Compare February 3, 2023 11:21

squeed requested a review from christarazi February 3, 2023 11:21

qmonnet reviewed Feb 13, 2023

View reviewed changes

pkg/defaults/defaults.go Outdated Show resolved Hide resolved

pkg/lock/lockfile/lockfile_linux.go Outdated Show resolved Hide resolved

christarazi approved these changes Feb 13, 2023

View reviewed changes

sayboras approved these changes Feb 16, 2023

View reviewed changes

squeed mentioned this pull request Feb 24, 2023

cni: add option to keep CNI configuration file on agent shutdown #24009

Merged

squeed force-pushed the cni-down-agent branch from 4ea5a42 to 7758b3e Compare March 21, 2023 13:40

christarazi approved these changes Mar 21, 2023

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 22, 2023

chancez approved these changes Mar 22, 2023

View reviewed changes

borkmann merged commit 98b2967 into cilium:master Mar 22, 2023

squeed changed the title ~~CNI: no longer remove cni config on shutdown; taint instead.~~ CNI: keep cni config on shutdown; taint instead, queue deletions Apr 28, 2023

zadjadr mentioned this pull request Aug 4, 2023

fix(cilium): cilium-operator must be able to patch K8S nodes resources kubernetes/kops#15470

Closed

alex-berger mentioned this pull request Aug 9, 2023

Cilium pod can stuck on pending when daemonset is restarting #27243

Closed

2 tasks

drew-viles mentioned this pull request Sep 5, 2023

patch: updated all of the versions - ALL OF THEM! eschercloudai/unikorn#306

Merged

joestringer mentioned this pull request Feb 12, 2024

Remove IP filters from initial GC #29696

Merged

8 tasks

squeed mentioned this pull request Apr 7, 2025

Kubernetes Node ready when Cilium Agent is not #37477

Closed

3 tasks

squeed mentioned this pull request Jul 23, 2025

Fix Endpoint delete request processing during agent startup #40568

Merged

Conversation

squeed commented Jan 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Asynchronous CNI delete

Taint the node when Cilium is shut down

Preserve CNI configuration

Uh oh!

squeed commented Jan 31, 2023 • edited by maintainer-s-little-helper bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Name

Failure Output

Uh oh!

squeed commented Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christarazi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sayboras left a comment

Choose a reason for hiding this comment

Uh oh!

squeed commented Mar 21, 2023

Uh oh!

christarazi commented Mar 21, 2023 • edited by maintainer-s-little-helper bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Name

Failure Output

Uh oh!

squeed commented Mar 22, 2023

Uh oh!

squeed commented Mar 22, 2023 • edited by maintainer-s-little-helper bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

squeed commented Mar 22, 2023

Uh oh!

aojea commented Aug 3, 2023

Uh oh!

robbie-demuth commented Feb 14, 2025

Uh oh!

squeed commented Feb 15, 2025

Uh oh!

bmendoza820 commented Feb 19, 2025

Uh oh!

squeed commented Feb 19, 2025

Uh oh!

bmendoza820 commented Feb 19, 2025

Uh oh!

squeed commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bmendoza820 commented Feb 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

squeed commented Jan 31, 2023 •

edited

Loading

squeed commented Jan 31, 2023 •

edited by maintainer-s-little-helper bot

Loading

squeed commented Feb 1, 2023 •

edited

Loading

christarazi left a comment •

edited

Loading

christarazi commented Mar 21, 2023 •

edited by maintainer-s-little-helper bot

Loading

squeed commented Mar 22, 2023 •

edited by maintainer-s-little-helper bot

Loading

squeed commented Feb 20, 2025 •

edited

Loading