Optimize scalability of CiliumIdentity operation#11275
Conversation
…dpoint Rework the CiliumIdentity garbage collector to derive liveness of a CiliumIdentity based on the identity usage of CiliumEndpoint information. As the CiliumEndpoint garbage collector scans endpoints, it will maintain a list of identities with heartbeat signals. As the CiliumIdentity garbage collector runs, it will use the heartbeat signals before removing stale identities. In order to make the gabrage collector with the heartbeat timeout requirements work, the default scan intervals are reduced slightly. Signed-off-by: Thomas Graf <thomas@cilium.io>
With the introduction of the CiliumEndpoint based garbage collection of CiliumIdentities. The status field of the CiliumIdentity can remain unpopulated. This will massively increase the scalability of identity allocation as no PATCHing is required after the identity has been allocated. Signed-off-by: Thomas Graf <thomas@cilium.io>
46948c0 to
24abe2f
Compare
|
test-me-please |
Signed-off-by: Thomas Graf <thomas@cilium.io>
24abe2f to
30ec2e5
Compare
|
test-me-please Looks like the CI is subject to some underlying network issues: |
|
test-me-please |
| defer i.mutex.Unlock() | ||
|
|
||
| for identity, lifesign := range i.lastLifesign { | ||
| if time.Since(lifesign) > 10*i.timeout { |
There was a problem hiding this comment.
Why the 10*? can't this live live somewhere else?
There was a problem hiding this comment.
10 is a safe assumption for when a life sign is definitely never needed again.
| @@ -70,6 +71,7 @@ func enableCiliumEndpointSyncGC() { | |||
| podsCache[pod.Namespace+"/"+pod.Name] = &pod | |||
There was a problem hiding this comment.
It worries me a little bit that we are doing an entire Pod List of the cluster every N minutes. This function should be re-written once #11195 is merged.
There was a problem hiding this comment.
Is your worry that we do this every 10min instead of 30min now? We have always been doing this so far. This only changes the interval.
We can definitely optimize this. I'm not sure a watcher is always better though. it will pay the cost of churn. Whereas right now we pay the snapshot cost of all pods which can be a lot less or a lot more. It will vary.
There was a problem hiding this comment.
I wasn't aware about this listing at all, that's why I made the comment. The watcher will be good because we can optimize it and only keep the pod name and namespace in memory (instead of all fields of the pod). Can be optimized as a follow up.
| // For each CEP we fetched, check if we know about it | ||
| for _, cep := range ceps.Items { | ||
| if cep.Status.Identity != nil { | ||
| identityHeartbeat.MarkAlive(strconv.FormatInt(cep.Status.Identity.ID, 10), timeNow) |
There was a problem hiding this comment.
Can't this logic be in the CEP watcher? With a reference counter we would be able to know which CEPs are using a particular identity.
There was a problem hiding this comment.
Apparently a non-existing one (I thought we had one in the operator). I was thinking on having a CEP watcher in the operator that would keep a reference counter for the used identities.
There was a problem hiding this comment.
The watcher has a downside though. If we ever miss the notification, we won't ever clean up again unless the identity gets re-used and released again. We are also no longer talking about a heartbeat-based mechanism but event notification with the usual pros and cons.
| @@ -70,6 +71,7 @@ func enableCiliumEndpointSyncGC() { | |||
| podsCache[pod.Namespace+"/"+pod.Name] = &pod | |||
There was a problem hiding this comment.
I wasn't aware about this listing at all, that's why I made the comment. The watcher will be good because we can optimize it and only keep the pod name and namespace in memory (instead of all fields of the pod). Can be optimized as a follow up.
| // For each CEP we fetched, check if we know about it | ||
| for _, cep := range ceps.Items { | ||
| if cep.Status.Identity != nil { | ||
| identityHeartbeat.MarkAlive(strconv.FormatInt(cep.Status.Identity.ID, 10), timeNow) |
There was a problem hiding this comment.
Apparently a non-existing one (I thought we had one in the operator). I was thinking on having a CEP watcher in the operator that would keep a reference counter for the used identities.
Following the deprecation notice from cilium#11275, remove the field. Signed-off-by: Chris Tarazi <chris@isovalent.com>
Following the deprecation notice from #11275, remove the field. Signed-off-by: Chris Tarazi <chris@isovalent.com>
Following the deprecation notice from #11275, remove the field. Signed-off-by: Chris Tarazi <chris@isovalent.com>
Following the deprecation notice from #11275, remove the field. Signed-off-by: Chris Tarazi <chris@isovalent.com>
This PR changes the garbage collection mechanism of identities to use the CiliumEndpoint information instead of requiring all nodes to maintain status information about who is using a particular identity.
This improves scalability of identity management in CRD mode massively.
Depends on: #11270