ipsec: Add configurable jitter to key file watcher to prevent thundering herd#44263
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
bf88dbe to
491f5e2
Compare
This comment was marked as resolved.
This comment was marked as resolved.
a91421c to
659ec1e
Compare
|
/test |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
a75f0d5 to
5caebbf
Compare
There was a problem hiding this comment.
Docs changes look good to me, though I wonder why turn the feature off by default. I'll defer those details to @cilium/ipsec owners though.
nit: Please clean up the commit message.
Currently it shows like this:
ipsec: Add jitter to key file watcher to prevent thundering herd . Fixes#: 42721
Signed-off-by: Sharvil Shetty <sharvilshetty.200@gmail.com>
fixed documentation indents
Signed-off-by: Sharvil Shetty <sharvilshetty.200@gmail.com>
ipsec: Add key rotation jitter to prevent thundering herd. Fixes: https://github.com/cilium/cilium/issues/42721
Signed-off-by: Sharvil Shetty <shavyshetty@users.noreply.github.com>
ipsec: Add jitter to key file watcher to prevent thundering herd . Fixes#: 42721
Signed-off-by: Sharvil Shetty <sharvilshetty.200@gmail.com>
ipsec: Add key rotation jitter to prevent thundering herd. Fixes: https://github.com/cilium/cilium/issues/42721
Signed-off-by: Sharvil Shetty <shavyshetty@users.noreply.github.com>
ipsec: Add jitter to key file watcher to prevent thundering herd . Fixes#: 42721
Signed-off-by: Sharvil Shetty <sharvilshetty.200@gmail.com>
ipsec: Add key rotation jitter to prevent thundering herd. Fixes: https://github.com/cilium/cilium/issues/42721
Signed-off-by: Sharvil Shetty <shavyshetty@users.noreply.github.com>
It should be more like:
ipsec: Add jitter to key file watcher to prevent thundering herd
We observed problems in our cluster where [...]
In order to fix this, add jitter into each agent to spread the
control plane load over a longer period so that the control plane
is not overwhelmed by many updates at the same time.
Signed-off-by: Sharvil Shetty <sharvilshetty.200@gmail.com>
ACK on the commit messages, I thought I cleaned them up, but it looks like not. Will do. About the disabled by default, I left a note in the comment thread:
My thought is that smaller clusters won't encounter this issue, and cluster owners might prefer an immediate rotation instead. We only started seeing issues after reaching a certain size. I am flexible to go either way will wait for others to chime in. |
We do have some logic in Heads up a couple of the smoke tests fails, which will require you to update the PR. The failures list the commands to fix these things up at the end: |
smagnani96
left a comment
There was a problem hiding this comment.
Follow-up of the comment: I discussed this briefly with Paul. We believe we can always enable this behavior using a relatively small jitter delay, such as [0, keyRotationDuration/10].
I'd kindly ask to:
- adjust code accordingly
- remove the agent/helm flag to enable/disable: would not be easy and intuitive to use, let's always add a small jitter.
- adjust Doc accordingly
Sorry, I missed @pchaigno's comment from a couple of days ago. I am not sure I understand this part:
Just curious, how does this effectively half the rotation duration? And if we instead do /10, would that mean we have reduced the rotation duration by a factor of 10? In the absence of jitter, the max rotation duration is respected, so how does adding a delay ( jitter) to keyfilewatcher alter that behavior and affect the validity of the old key? |
|
@shavyshetty The worst case is when In large clusters this might not be enough, for this reason we're suggesting using a maximum jitter relatively small wrt the maximum key rotation time. |
This logic change sounds good to me. I do want to note that there is some natural jitter already, because in kubernetes, an update to a secret / configmap is not instant across all nodes. It takes some time for all kubelet instances to notice the change, and make the update. In my experience, it spreads the changes out over ~90 seconds. While it is a bit confusing to have jitter at two levels (one at kubelet, one introduced in this PR), I think it is still the right call. For large clusters, we likely want a larger jitter window & a larger maxRotation. |
|
Thanks for your reviews @Artyop, @adkafka, @joestringer, @pchaigno, and @smagnani96. I think I have now incorporated all your feedback, based on our collective understanding of the trade-offs for each option we explored. Please let me know if any further changes are needed for some CI tests. |
|
@shavyshetty you need to go fmt to have the go checks to pass |
smagnani96
left a comment
There was a problem hiding this comment.
Two minor nits to please golangci-lint, then LGTM.
PR is way in a better shape than before, thanks for sticking with our feedback.
|
/test |
smagnani96
left a comment
There was a problem hiding this comment.
LGTM, thanks.
@pchaigno if you want double-check last impl.
|
@smagnani96 @pchaigno @joestringer |
smagnani96
left a comment
There was a problem hiding this comment.
@smagnani96 @pchaigno @joestringer
Hey folks, just checking if anything is pending on my end for the PR to be merged?
LGTM, I'll start the tests 👍🏼
|
/test |
pchaigno
left a comment
There was a problem hiding this comment.
The code looks good to me. Just a couple minor things to fix on the commit description.
We observed CPU spikes in large clusters where all agents detect key file changes simultaneously and update CiliumNode resources at once, overwhelming the Kubernetes API server. In order to fix this, add jitter into each agent to spread the control plane load over a longer period so that the control plan is not overwhelmed by many updates at the same time. When enabled, jitter is randomly selected from [0, keyRotationDuration/10]. Fixes: cilium#42721 Signed-off-by: Sharvil Shetty <sharvilshetty.200@gmail.com>
|
/test |
|
Thanks for the contribution @shavyshetty 🎉 |
Adds a configurable jitter delay to the IPsec key file watcher to prevent a thundering herd on the Kubernetes API server during key rotation in large clusters.
Fixes: #42721