operator: rate limit GC of security identities#12451
Conversation
14c3c33 to
d7cee75
Compare
|
test-me-please |
d7cee75 to
690854b
Compare
|
all CI jobs passed except travis and runtime due to the fact that I forgot to generate the documentation for the new flags |
|
retest-runtime |
690854b to
70d5617
Compare
|
test-me-please |
nebril
left a comment
There was a problem hiding this comment.
Looks good overall, but since we are using golang.org/x/time/rate elsewhere in our code, would it be possible to be consistent here?
There was a problem hiding this comment.
Is there a reason to not use https://godoc.org/golang.org/x/time/rate ?
There was a problem hiding this comment.
Yes, there is a reason. We were initially using x/time/rate, but it turns out that the token bucket algorithm used by x/time/rate is not suitable for our purpose.
An observation we made in reproducing #8360 is that even when there are thousands of identities marked for deletion, cilium-operator already does not seem to produce more than a few dozen deletion events per second (since it has to do a kvstore operation for each identity).
The problem is that it sends a constant stream of updates (at a rate of few dozen deletion events per second) over the period of a few minutes. This in turn causes cilium-agent to constantly regenerate the endpoints (as it is receiving new updates every second), causing it to become unresponsive as it is wasting time regenerating endpoints.
To formulate simply: The main problem are not mainly bursts, the problem is the constant stream of updates.
So a token bucket algorithm as implemented by x/time/rate will not fix the issue. It is designed to flatten bursts in a signal to a fixed output rate:
Our main concern is that we want to give cilium-agent some time to breathe. In fact, Cilium is very able to deal with small bursts of identity changes, as it does coalesce identity updates happening within a predefined interval (--policy-trigger-interval, default 1s).
Therefore, André's custom rate limiter kind of does the opposite of x/time/rate. It takes a somewhat fixed-rate stream of updates and turns it into a stream of controlled small bursts (250 by default) every 1 minutes.
This gives cilium-agent time to recover from endpoint regeneration churn within the one minute idle time.
@aanm: We probably should add a comment in the source code somewhere about so nobody accidentally replaces the rate limiter with x/time/rate.
There was a problem hiding this comment.
@gandro I've had it here, if that's not enough what other places should I put it? https://github.com/cilium/cilium/blob/15adbd708765c2c01029ea2d622b1f2c79241a63/pkg/rate/doc.go#L15-L18
There was a problem hiding this comment.
The comment explains the "what", but I think we are missing the "why" we are using that algorithm for identity GC.
It's not obvious from the identitiy GC code that the traffic pattern produced by rate limiting algorithm matters. We currently only explain is how it is different from golang.org/x/time/rate, but not why it needs to be different.
To be clear, the "what it does" package level doc in rate that you quoted is fine with me. But I think we should add a comment here: https://github.com/cilium/cilium/blob/70d561786228892a2988ccccbe5ed9b827986acf/operator/main.go#L77
There we should have a comment about "why" we need a algorithm that produces small-bursts with pauses in between.
There was a problem hiding this comment.
We can add a comment in a follow up. I think documenting the "why" is very valuable to our future selves and newcomers who are taking a look at this code.
Some keys can be reused in between GC function calls. To avoid them being GCed we should not mark them as stale keys and wait for the next GC call to be executed. Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: André Martins <andre@cilium.io>
To prevent bursts of security identities from being deleted in the KVStore, possibly causing Cilium agent to have a high CPU usage due policy calculation, this commit adds a rate limiter for such KVStore deletes. For example, in case there are 1000 identities to GCed, the operator will delete 250 every minute until all 1000 identities are GCed. Signed-off-by: André Martins <andre@cilium.io>
70d5617 to
2eef9ea
Compare
|
merging as the comments addressed were only descriptions of packages and fields and the CI was previously green |

To prevent bursts of security identities from being deleted in the
KVStore, possibly causing Cilium agent to have a high CPU usage due
policy calculation, this commit adds a rate limiter for such KVStore
deletes. For example, in case there are 1000 identities to GCed, the
operator will delete 250 every minute until all 1000 identities are
GCed.
The 1.7 version is here #12450