daemon: add cleanup for stale local ciliumendpoints that aren't being managed.#20350
Conversation
|
Commit 0740077f1c203ddf5604d04d5c4da5fdd003313c does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
|
/test |
|
Commits 0740077f1c203ddf5604d04d5c4da5fdd003313c, 6a286990870d06c7fff055023656a6efba62837a do not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
6a28699 to
76ff3fd
Compare
|
Commit 76ff3fddd544401c9cd9215d7d42e16b5204767f does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
76ff3fd to
a1c4ca5
Compare
|
Commit a1c4ca5169db32ad772c747d343f1bc0016b0ebd does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
a1c4ca5 to
d110e68
Compare
|
Commit d110e68257a3f6c28805fc63a855d5c913aeeaa1 does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
d110e68 to
8643d0e
Compare
|
Commit 8643d0ee2e2d648474e90b01ed52345e619b4e94 does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
|
/test |
8643d0e to
0d96615
Compare
|
Lots of failures, going to take a look at these |
|
/test |
d054300 to
20be299
Compare
christarazi
left a comment
There was a problem hiding this comment.
Nice work! A few comments below. Overall the approach is sound.
Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Added support for indexing informer in k8s/watchers, as well as custom indexer func which allows maintaining index on CES's containing local endpoints by their underlying endpoint names. Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
It's possible for CiliumEndpoints to become stale where they still reference existing Pods that are no longer being managed by Cilium. In this scenario, the operator will not GC these CEPs as they have a valid pod owner reference. This commit adds an init cleanup which cleans up stale ceps. As well, cep/ces K8s watchers will mark such CEPs for deletion and a controller GC routine will periodically GC the old CEPs. Fixes cilium#17631 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
|
/test |
|
@tommyp1ckles For the tophats, this seems to be non-trivial to backport to v1.12, 1.11 and 1.11 due to changes in the agent structure. Could you attempt the backport yourself? Thanks! |
|
Removed the backport labels, as this has been picked up for automation multiple times already. |
It's possible for CiliumEndpoints to become stale where they still reference existing Pods that are no longer being managed by Cilium.
In this scenario, the operator will not GC these CEPs as they have a valid pod owner reference.
This commit adds an init cleanup which cleans up stale ceps. As well, cep/ces K8s watchers will mark such CEPs for deletion and a controller GC routine will periodically GC the old CEPs.
Fixes #17631
Background, we've seen some instances where Pods CEP have become stale & out-of-sync with the actual Pod they're meant to be managing. Particularly in the following two cases:
Pods somehow un-managed, while retaining their CEP. One way that this can happen (i.e. I reproduced by) if the /etc/cni... Cilium config files get removed and the Node is restarted and loses its Cilium bpf state. In this case the same Pod might get re-sandboxed with another CNI (i.e. if the containers have to be restarted). At this point you'll have a Pod with a CEP but no endpoint in the endpoint manager. In this situation the CEP IP and actual Pod IP are likely to differ since the Pod has been restarted under a different CNI. Controller will not GC as the Pod.UUID and owner reference have not changed.
Pod becomes un-managed due to lost state. This can happen if the bpf state gets deleted for an endpoint (such as with a temporary fs following a reboot). Once you restart the Cilium pod, upon restore the existing endpoint will not be restored. But, the Pod is still running with all the Cilium state still intact.
In both cases, the agent can determine if the CEP should be deleted by checking against it's managed endpoints. If none exist then we know that that Pod is unmanaged. Endpoints that are changed, such as in the case of a Pod container being killed will have its CiliumEndpoint eventually resynced via the k8s sync controller.