Skip to content

operator: rate limit GC of security identities#12451

Merged
aanm merged 4 commits intomasterfrom
pr/do-not-burst-identitiy-gc
Jul 9, 2020
Merged

operator: rate limit GC of security identities#12451
aanm merged 4 commits intomasterfrom
pr/do-not-burst-identitiy-gc

Conversation

@aanm
Copy link
Copy Markdown
Member

@aanm aanm commented Jul 7, 2020

To prevent bursts of security identities from being deleted in the
KVStore, possibly causing Cilium agent to have a high CPU usage due
policy calculation, this commit adds a rate limiter for such KVStore
deletes. For example, in case there are 1000 identities to GCed, the
operator will delete 250 every minute until all 1000 identities are
GCed.

The 1.7 version is here #12450

@aanm aanm added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. backport-pending/1.7 labels Jul 7, 2020
@aanm aanm requested a review from gandro July 7, 2020 23:49
@aanm aanm requested a review from a team as a code owner July 7, 2020 23:49
@aanm aanm requested a review from a team July 7, 2020 23:49
@aanm aanm requested review from a team as code owners July 7, 2020 23:49
@aanm aanm force-pushed the pr/do-not-burst-identitiy-gc branch from 14c3c33 to d7cee75 Compare July 8, 2020 00:05
@aanm
Copy link
Copy Markdown
Member Author

aanm commented Jul 8, 2020

test-me-please

@coveralls
Copy link
Copy Markdown

coveralls commented Jul 8, 2020

Coverage Status

Coverage increased (+0.04%) to 36.976% when pulling 2eef9ea on pr/do-not-burst-identitiy-gc into 9db20dc on master.

@aanm aanm force-pushed the pr/do-not-burst-identitiy-gc branch from d7cee75 to 690854b Compare July 8, 2020 08:00
@aanm
Copy link
Copy Markdown
Member Author

aanm commented Jul 8, 2020

all CI jobs passed except travis and runtime due to the fact that I forgot to generate the documentation for the new flags

@aanm
Copy link
Copy Markdown
Member Author

aanm commented Jul 8, 2020

retest-runtime

Copy link
Copy Markdown
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I have validated this on a external etcd cluster with 45k dangling identities and it does indeed fix #8360 🎉

In terms of code, just one minor API issue with the introduced rate limiting package.

Comment thread pkg/rate/limiter.go Outdated
Comment thread pkg/rate/limiter.go Outdated
@aanm aanm force-pushed the pr/do-not-burst-identitiy-gc branch from 690854b to 70d5617 Compare July 8, 2020 10:03
@aanm
Copy link
Copy Markdown
Member Author

aanm commented Jul 8, 2020

test-me-please

Copy link
Copy Markdown
Member

@nebril nebril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, but since we are using golang.org/x/time/rate elsewhere in our code, would it be possible to be consistent here?

Comment thread pkg/rate/limiter.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to not use https://godoc.org/golang.org/x/time/rate ?

Copy link
Copy Markdown
Member

@gandro gandro Jul 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is a reason. We were initially using x/time/rate, but it turns out that the token bucket algorithm used by x/time/rate is not suitable for our purpose.

An observation we made in reproducing #8360 is that even when there are thousands of identities marked for deletion, cilium-operator already does not seem to produce more than a few dozen deletion events per second (since it has to do a kvstore operation for each identity).

The problem is that it sends a constant stream of updates (at a rate of few dozen deletion events per second) over the period of a few minutes. This in turn causes cilium-agent to constantly regenerate the endpoints (as it is receiving new updates every second), causing it to become unresponsive as it is wasting time regenerating endpoints.

To formulate simply: The main problem are not mainly bursts, the problem is the constant stream of updates.

So a token bucket algorithm as implemented by x/time/rate will not fix the issue. It is designed to flatten bursts in a signal to a fixed output rate:

image

Our main concern is that we want to give cilium-agent some time to breathe. In fact, Cilium is very able to deal with small bursts of identity changes, as it does coalesce identity updates happening within a predefined interval (--policy-trigger-interval, default 1s).

Therefore, André's custom rate limiter kind of does the opposite of x/time/rate. It takes a somewhat fixed-rate stream of updates and turns it into a stream of controlled small bursts (250 by default) every 1 minutes.

This gives cilium-agent time to recover from endpoint regeneration churn within the one minute idle time.

@aanm: We probably should add a comment in the source code somewhere about so nobody accidentally replaces the rate limiter with x/time/rate.

Copy link
Copy Markdown
Member

@christarazi christarazi Jul 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the comment. Heck, even this explanation by @gandro would be great to have as the package documentation as well. Thanks for the great explanation @gandro, very easy to understand.

Copy link
Copy Markdown
Member Author

@aanm aanm Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gandro I've had it here, if that's not enough what other places should I put it? https://github.com/cilium/cilium/blob/15adbd708765c2c01029ea2d622b1f2c79241a63/pkg/rate/doc.go#L15-L18

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment explains the "what", but I think we are missing the "why" we are using that algorithm for identity GC.

It's not obvious from the identitiy GC code that the traffic pattern produced by rate limiting algorithm matters. We currently only explain is how it is different from golang.org/x/time/rate, but not why it needs to be different.

To be clear, the "what it does" package level doc in rate that you quoted is fine with me. But I think we should add a comment here: https://github.com/cilium/cilium/blob/70d561786228892a2988ccccbe5ed9b827986acf/operator/main.go#L77
There we should have a comment about "why" we need a algorithm that produces small-bursts with pauses in between.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a comment in a follow up. I think documenting the "why" is very valuable to our future selves and newcomers who are taking a look at this code.

aanm added 4 commits July 9, 2020 16:55
Some keys can be reused in between GC function calls. To avoid them
being GCed we should not mark them as stale keys and wait for the next
GC call to be executed.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: André Martins <andre@cilium.io>
To prevent bursts of security identities from being deleted in the
KVStore, possibly causing Cilium agent to have a high CPU usage due
policy calculation, this commit adds a rate limiter for such KVStore
deletes. For example, in case there are 1000 identities to GCed, the
operator will delete 250 every minute until all 1000 identities are
GCed.

Signed-off-by: André Martins <andre@cilium.io>
@aanm aanm force-pushed the pr/do-not-burst-identitiy-gc branch from 70d5617 to 2eef9ea Compare July 9, 2020 15:55
@aanm
Copy link
Copy Markdown
Member Author

aanm commented Jul 9, 2020

merging as the comments addressed were only descriptions of packages and fields and the CI was previously green

@aanm aanm merged commit ea57e36 into master Jul 9, 2020
@aanm aanm deleted the pr/do-not-burst-identitiy-gc branch July 9, 2020 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note/minor This PR changes functionality that users may find relevant to operating Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants