daemon: add cleanup for stale local ciliumendpoints that aren't being managed. by tommyp1ckles · Pull Request #20350 · cilium/cilium

tommyp1ckles · 2022-06-29T15:08:56Z

It's possible for CiliumEndpoints to become stale where they still reference existing Pods that are no longer being managed by Cilium.
In this scenario, the operator will not GC these CEPs as they have a valid pod owner reference.
This commit adds an init cleanup which cleans up stale ceps. As well, cep/ces K8s watchers will mark such CEPs for deletion and a controller GC routine will periodically GC the old CEPs.

Fixes #17631

Background, we've seen some instances where Pods CEP have become stale & out-of-sync with the actual Pod they're meant to be managing. Particularly in the following two cases:

Pods somehow un-managed, while retaining their CEP. One way that this can happen (i.e. I reproduced by) if the /etc/cni... Cilium config files get removed and the Node is restarted and loses its Cilium bpf state. In this case the same Pod might get re-sandboxed with another CNI (i.e. if the containers have to be restarted). At this point you'll have a Pod with a CEP but no endpoint in the endpoint manager. In this situation the CEP IP and actual Pod IP are likely to differ since the Pod has been restarted under a different CNI. Controller will not GC as the Pod.UUID and owner reference have not changed.
Pod becomes un-managed due to lost state. This can happen if the bpf state gets deleted for an endpoint (such as with a temporary fs following a reboot). Once you restart the Cilium pod, upon restore the existing endpoint will not be restored. But, the Pod is still running with all the Cilium state still intact.

In both cases, the agent can determine if the CEP should be deleted by checking against it's managed endpoints. If none exist then we know that that Pod is unmanaged. Endpoints that are changed, such as in the case of a Pod container being killed will have its CiliumEndpoint eventually resynced via the k8s sync controller.

Added Agent init check that removes all CiliumEndpoints referencing local Node that are not managed. This fixes issues where sometimes CiliumEndpoints referencing still running Pods can become unmanaged during Cilium restart.

maintainer-s-little-helper · 2022-06-29T15:10:05Z

Commit 0740077f1c203ddf5604d04d5c4da5fdd003313c does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

tommyp1ckles · 2022-06-29T15:11:15Z

/test

maintainer-s-little-helper · 2022-06-29T16:30:16Z

Commits 0740077f1c203ddf5604d04d5c4da5fdd003313c, 6a286990870d06c7fff055023656a6efba62837a do not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

maintainer-s-little-helper · 2022-06-29T22:07:56Z

Commit 76ff3fddd544401c9cd9215d7d42e16b5204767f does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

maintainer-s-little-helper · 2022-06-30T05:38:01Z

Commit a1c4ca5169db32ad772c747d343f1bc0016b0ebd does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

maintainer-s-little-helper · 2022-06-30T06:39:59Z

Commit d110e68257a3f6c28805fc63a855d5c913aeeaa1 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

maintainer-s-little-helper · 2022-06-30T07:01:43Z

Commit 8643d0ee2e2d648474e90b01ed52345e619b4e94 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

tommyp1ckles · 2022-06-30T07:02:01Z

/test

tommyp1ckles · 2022-06-30T16:38:56Z

Lots of failures, going to take a look at these

tommyp1ckles · 2022-07-04T16:23:40Z

/test

christarazi

Nice work! A few comments below. Overall the approach is sound.

daemon/cmd/ciliumendpoints.go

daemon/cmd/ciliumendpoints_test.go

pkg/k8s/factory_functions.go

pkg/k8s/watchers/cilium_endpoint.go

pkg/option/config.go

daemon/cmd/ciliumendpoints.go

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>

Added support for indexing informer in k8s/watchers, as well as custom indexer func which allows maintaining index on CES's containing local endpoints by their underlying endpoint names. Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>

It's possible for CiliumEndpoints to become stale where they still reference existing Pods that are no longer being managed by Cilium. In this scenario, the operator will not GC these CEPs as they have a valid pod owner reference. This commit adds an init cleanup which cleans up stale ceps. As well, cep/ces K8s watchers will mark such CEPs for deletion and a controller GC routine will periodically GC the old CEPs. Fixes cilium#17631 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>

tommyp1ckles · 2022-11-19T01:35:22Z

/test

gandro · 2022-12-07T09:50:56Z

@tommyp1ckles For the tophats, this seems to be non-trivial to backport to v1.12, 1.11 and 1.11 due to changes in the agent structure. Could you attempt the backport yourself? Thanks!

gandro · 2022-12-07T09:52:04Z

Removed the backport labels, as this has been picked up for automation multiple times already.

maintainer-s-little-helper bot added dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jun 29, 2022

tommyp1ckles force-pushed the pr/tommyp1ckles/stale-cep-cleanup branch from 6a28699 to 76ff3fd Compare June 29, 2022 22:07

tommyp1ckles force-pushed the pr/tommyp1ckles/stale-cep-cleanup branch from 76ff3fd to a1c4ca5 Compare June 30, 2022 05:37

tommyp1ckles force-pushed the pr/tommyp1ckles/stale-cep-cleanup branch from a1c4ca5 to d110e68 Compare June 30, 2022 06:39

tommyp1ckles marked this pull request as ready for review June 30, 2022 06:40

tommyp1ckles requested a review from a team June 30, 2022 06:40

tommyp1ckles requested a review from a team as a code owner June 30, 2022 06:40

tommyp1ckles requested a review from a team June 30, 2022 06:40

tommyp1ckles requested a review from a team as a code owner June 30, 2022 06:40

tommyp1ckles requested review from christarazi, jrajahalme, ldelossa and nathanjsweet June 30, 2022 06:40

tommyp1ckles force-pushed the pr/tommyp1ckles/stale-cep-cleanup branch from d110e68 to 8643d0e Compare June 30, 2022 07:01

tommyp1ckles force-pushed the pr/tommyp1ckles/stale-cep-cleanup branch from 8643d0e to 0d96615 Compare June 30, 2022 07:02

maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jun 30, 2022

tommyp1ckles force-pushed the pr/tommyp1ckles/stale-cep-cleanup branch 2 times, most recently from d054300 to 20be299 Compare July 4, 2022 22:44

christarazi reviewed Jul 6, 2022

View reviewed changes

ldelossa approved these changes Nov 17, 2022

View reviewed changes

tommyp1ckles added 6 commits November 18, 2022 17:24

pkg/option: add flag for toggling stale CEP cleanup.

7a7b06e

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>

daemon/cmd: make CES cleanup behaviour explicit.

0f5ef72

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>

k8s/watchers: prevent panic when cep has no network status.

9a22609

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>

daemon/cmd: cleanup, remove superfluous sprintf.

718bb9b

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>

christarazi approved these changes Nov 20, 2022

View reviewed changes

This was referenced Nov 22, 2022

v1.12 backports 2022-11-22 #22308

Merged

v1.11 backports 2022-11-22 #22309

Merged

v1.10 backports 2022-11-22 #22310

Merged

v1.12 backports 2022-11-23 #22328

Merged

tommyp1ckles mentioned this pull request Nov 28, 2022

v1.12 backports 2022-11-28 #22407

Merged

jrajahalme mentioned this pull request Dec 1, 2022

daemon: Do not fail CI runs for already deleted CEP #22474

Merged

bimmlerd mentioned this pull request Dec 1, 2022

v1.12 Backports 2022-12-01 #22500

Merged

14 tasks

pippolo84 mentioned this pull request Dec 6, 2022

v1.11 backports 2022-12-06 #22563

Merged

8 tasks

gandro mentioned this pull request Dec 6, 2022

v1.10 backports 2022-12-06 #22582

Merged

10 tasks

This was referenced Jan 14, 2023

[Backport v1.12] Stale CEP cleanup backports #23096

Merged

[Backport v1.11] Stale CEP cleanup backports #23097

Merged

This was referenced Jan 26, 2023

Prepare for release v1.11.13 #23370

Merged

Prepare for release v1.12.6 #23372

Merged

qmonnet mentioned this pull request Jan 27, 2023

v1.10 Backports 2023-01-27 #23422

Merged

7 tasks

This was referenced Feb 13, 2023

Prepare for release v1.11.14 #23697

Closed

Prepare for release v1.11.14 #23703

Merged

lrouquette mentioned this pull request Mar 14, 2023

Panic when upgrading from 1.11.12 to 1.11.14 with disable-endpoint-crd: true #24366

Closed

2 tasks

pippolo84 mentioned this pull request Nov 17, 2023

Modularize stale endpoint gc in an independent cell #29246

Merged

giorio94 mentioned this pull request Nov 6, 2024

watchers: disable CRD to kvstore handover logic #35741

Merged

Conversation

tommyp1ckles commented Jun 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maintainer-s-little-helper bot commented Jun 29, 2022

Uh oh!

tommyp1ckles commented Jun 29, 2022

Uh oh!

maintainer-s-little-helper bot commented Jun 29, 2022

Uh oh!

maintainer-s-little-helper bot commented Jun 29, 2022

Uh oh!

maintainer-s-little-helper bot commented Jun 30, 2022

Uh oh!

maintainer-s-little-helper bot commented Jun 30, 2022

Uh oh!

maintainer-s-little-helper bot commented Jun 30, 2022

Uh oh!

tommyp1ckles commented Jun 30, 2022

Uh oh!

tommyp1ckles commented Jun 30, 2022

Uh oh!

tommyp1ckles commented Jul 4, 2022

Uh oh!

christarazi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tommyp1ckles commented Nov 19, 2022

Uh oh!

gandro commented Dec 7, 2022

Uh oh!

gandro commented Dec 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

tommyp1ckles commented Jun 29, 2022 •

edited

Loading