Skip to content

ipsec: Add configurable jitter to key file watcher to prevent thundering herd#44263

Merged
pchaigno merged 1 commit intocilium:mainfrom
shavyshetty:feat/ipsec-key-rotation-jitter
Mar 3, 2026
Merged

ipsec: Add configurable jitter to key file watcher to prevent thundering herd#44263
pchaigno merged 1 commit intocilium:mainfrom
shavyshetty:feat/ipsec-key-rotation-jitter

Conversation

@shavyshetty
Copy link
Copy Markdown

@shavyshetty shavyshetty commented Feb 9, 2026

Adds a configurable jitter delay to the IPsec key file watcher to prevent a thundering herd on the Kubernetes API server during key rotation in large clusters.

Add jitter delay to the IPsec key file watcher for key rotations, to avoid thundering herd problem on Cilium agents.

Fixes: #42721

@maintainer-s-little-helper

This comment was marked as resolved.

@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Feb 9, 2026
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Feb 9, 2026
@shavyshetty shavyshetty force-pushed the feat/ipsec-key-rotation-jitter branch from bf88dbe to 491f5e2 Compare February 9, 2026 03:42
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Feb 9, 2026
@maintainer-s-little-helper

This comment was marked as resolved.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Feb 9, 2026
@shavyshetty shavyshetty force-pushed the feat/ipsec-key-rotation-jitter branch from a91421c to 659ec1e Compare February 9, 2026 04:07
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Feb 9, 2026
@shavyshetty shavyshetty changed the title ipsec: Add jitter to key file watcher to prevent thundering herd ipsec: Add configurable jitter to key file watcher to prevent thundering herd Feb 9, 2026
@shavyshetty shavyshetty marked this pull request as ready for review February 9, 2026 04:34
@shavyshetty shavyshetty requested review from a team as code owners February 9, 2026 04:34
@Artyop
Copy link
Copy Markdown
Contributor

Artyop commented Feb 9, 2026

/test

@pchaigno pchaigno added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Feb 9, 2026
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 9, 2026
@pchaigno pchaigno added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. feature/ipsec Relates to Cilium's IPsec feature labels Feb 9, 2026
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 9, 2026
@maintainer-s-little-helper

This comment was marked as resolved.

@maintainer-s-little-helper

This comment was marked as resolved.

@shavyshetty shavyshetty force-pushed the feat/ipsec-key-rotation-jitter branch from a75f0d5 to 5caebbf Compare February 13, 2026 20:16
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Feb 13, 2026
@HadrienPatte HadrienPatte removed request for a team February 13, 2026 20:20
Copy link
Copy Markdown
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs changes look good to me, though I wonder why turn the feature off by default. I'll defer those details to @cilium/ipsec owners though.

nit: Please clean up the commit message.

Currently it shows like this:

ipsec: Add jitter to key file watcher to prevent thundering herd . Fixes#: 42721
Signed-off-by: Sharvil Shetty <sharvilshetty.200@gmail.com>

fixed documentation indents

Signed-off-by: Sharvil Shetty <sharvilshetty.200@gmail.com>

ipsec: Add key rotation jitter to prevent thundering herd. Fixes: https://github.com/cilium/cilium/issues/42721

Signed-off-by: Sharvil Shetty <shavyshetty@users.noreply.github.com>

ipsec: Add jitter to key file watcher to prevent thundering herd . Fixes#: 42721

Signed-off-by: Sharvil Shetty <sharvilshetty.200@gmail.com>

ipsec: Add key rotation jitter to prevent thundering herd. Fixes: https://github.com/cilium/cilium/issues/42721

Signed-off-by: Sharvil Shetty <shavyshetty@users.noreply.github.com>

ipsec: Add jitter to key file watcher to prevent thundering herd . Fixes#: 42721

Signed-off-by: Sharvil Shetty <sharvilshetty.200@gmail.com>

ipsec: Add key rotation jitter to prevent thundering herd. Fixes: https://github.com/cilium/cilium/issues/42721

Signed-off-by: Sharvil Shetty <shavyshetty@users.noreply.github.com>

It should be more like:

ipsec: Add jitter to key file watcher to prevent thundering herd 

We observed problems in our cluster where [...]

In order to fix this, add jitter into each agent to spread the
control plane load over a longer period so that the control plane
is not overwhelmed by many updates at the same time.

Signed-off-by: Sharvil Shetty <sharvilshetty.200@gmail.com>

@shavyshetty
Copy link
Copy Markdown
Author

shavyshetty commented Feb 13, 2026

Docs changes look good to me, though I wonder why turn the feature off by default. I'll defer those details to @cilium/ipsec owners though.

nit: Please clean up the commit message.

Currently it shows like this:

ipsec: Add jitter to key file watcher to prevent thundering herd . Fixes#: 42721
Signed-off-by: Sharvil Shetty <sharvilshetty.200@gmail.com>

fixed documentation indents

Signed-off-by: Sharvil Shetty <sharvilshetty.200@gmail.com>

ipsec: Add key rotation jitter to prevent thundering herd. Fixes: https://github.com/cilium/cilium/issues/42721

Signed-off-by: Sharvil Shetty <shavyshetty@users.noreply.github.com>

ipsec: Add jitter to key file watcher to prevent thundering herd . Fixes#: 42721

Signed-off-by: Sharvil Shetty <sharvilshetty.200@gmail.com>

ipsec: Add key rotation jitter to prevent thundering herd. Fixes: https://github.com/cilium/cilium/issues/42721

Signed-off-by: Sharvil Shetty <shavyshetty@users.noreply.github.com>

ipsec: Add jitter to key file watcher to prevent thundering herd . Fixes#: 42721

Signed-off-by: Sharvil Shetty <sharvilshetty.200@gmail.com>

ipsec: Add key rotation jitter to prevent thundering herd. Fixes: https://github.com/cilium/cilium/issues/42721

Signed-off-by: Sharvil Shetty <shavyshetty@users.noreply.github.com>

It should be more like:

ipsec: Add jitter to key file watcher to prevent thundering herd 

We observed problems in our cluster where [...]

In order to fix this, add jitter into each agent to spread the
control plane load over a longer period so that the control plane
is not overwhelmed by many updates at the same time.

Signed-off-by: Sharvil Shetty <sharvilshetty.200@gmail.com>

ACK on the commit messages, I thought I cleaned them up, but it looks like not. Will do.

About the disabled by default, I left a note in the comment thread:

Based on the preferences in this thread, I updated the logic such that jitter will be between [0, maxRotation/2] if enabled. I kept the option as an opt-in for clusters rather than enabling jitter by default. Let me know if you all disagree with this.
#44263 (comment)

My thought is that smaller clusters won't encounter this issue, and cluster owners might prefer an immediate rotation instead. We only started seeing issues after reaching a certain size. I am flexible to go either way will wait for others to chime in.

@joestringer
Copy link
Copy Markdown
Member

My thought is that smaller clusters won't encounter this issue, and cluster owners might prefer an immediate rotation instead.

We do have some logic in pkg/backoff that scales time intervals depending on cluster size, might be another option to consider if the raw jitter numbers are too much. Though maybe that is a configuration not for the jitter but more for the maxRotation value(?)

Heads up a couple of the smoke tests fails, which will require you to update the PR. The failures list the commands to fix these things up at the end:
https://github.com/cilium/cilium/actions/runs/22002029026/job/63579007443?pr=44263
https://github.com/cilium/cilium/actions/runs/22002029056/job/63579029836?pr=44263

Copy link
Copy Markdown
Contributor

@smagnani96 smagnani96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up of the comment: I discussed this briefly with Paul. We believe we can always enable this behavior using a relatively small jitter delay, such as [0, keyRotationDuration/10].
I'd kindly ask to:

  1. adjust code accordingly
  2. remove the agent/helm flag to enable/disable: would not be easy and intuitive to use, let's always add a small jitter.
  3. adjust Doc accordingly

@shavyshetty
Copy link
Copy Markdown
Author

shavyshetty commented Feb 17, 2026

Follow-up of the comment: I discussed this briefly with Paul. We believe we can always enable this behavior using a relatively small jitter delay, such as [0, keyRotationDuration/10]. I'd kindly ask to:

  1. adjust code accordingly
  2. remove the agent/helm flag to enable/disable: would not be easy and intuitive to use, let's always add a small jitter.
  3. adjust Doc accordingly

Sorry, I missed @pchaigno's comment from a couple of days ago. I am not sure I understand this part:

If we go for [0, maxRotation/2], then we've just halved the rotation duration for all existing users in the worst case. Not great.

Just curious, how does this effectively half the rotation duration? And if we instead do /10, would that mean we have reduced the rotation duration by a factor of 10?

In the absence of jitter, the max rotation duration is respected, so how does adding a delay ( jitter) to keyfilewatcher alter that behavior and affect the validity of the old key?

@smagnani96
Copy link
Copy Markdown
Contributor

@shavyshetty The worst case is when jitter = maxRotation/2, which is 2min30sec in the default case with maxRotation=5min. This means nodes realize there's a new key only after 2min30sec. As a result, the rotation procedure has only 5min - 2min30sec before we cleanup old state/policies referring to the previous key.

In large clusters this might not be enough, for this reason we're suggesting using a maximum jitter relatively small wrt the maximum key rotation time.

@adkafka
Copy link
Copy Markdown

adkafka commented Feb 17, 2026

@shavyshetty The worst case is when jitter = maxRotation/2, which is 2min30sec in the default case with maxRotation=5min. This means nodes realize there's a new key only after 2min30sec. As a result, the rotation procedure has only 5min - 2min30sec before we cleanup old state/policies referring to the previous key.

In large clusters this might not be enough, for this reason we're suggesting using a maximum jitter relatively small wrt the maximum key rotation time.

This logic change sounds good to me.

I do want to note that there is some natural jitter already, because in kubernetes, an update to a secret / configmap is not instant across all nodes. It takes some time for all kubelet instances to notice the change, and make the update. In my experience, it spreads the changes out over ~90 seconds.

While it is a bit confusing to have jitter at two levels (one at kubelet, one introduced in this PR), I think it is still the right call. For large clusters, we likely want a larger jitter window & a larger maxRotation.

@shavyshetty
Copy link
Copy Markdown
Author

Thanks for your reviews @Artyop, @adkafka, @joestringer, @pchaigno, and @smagnani96. I think I have now incorporated all your feedback, based on our collective understanding of the trade-offs for each option we explored.
I appreciate your willingness to incorporate this change into Cillium.

Please let me know if any further changes are needed for some CI tests.

@Artyop
Copy link
Copy Markdown
Contributor

Artyop commented Feb 18, 2026

@shavyshetty you need to go fmt to have the go checks to pass

Copy link
Copy Markdown
Contributor

@smagnani96 smagnani96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor nits to please golangci-lint, then LGTM.
PR is way in a better shape than before, thanks for sticking with our feedback.

@Artyop
Copy link
Copy Markdown
Contributor

Artyop commented Feb 19, 2026

/test

Copy link
Copy Markdown
Contributor

@smagnani96 smagnani96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.
@pchaigno if you want double-check last impl.

@shavyshetty
Copy link
Copy Markdown
Author

shavyshetty commented Mar 2, 2026

@smagnani96 @pchaigno @joestringer
Hey folks, just checking if anything is pending on my end for the PR to be merged?

Copy link
Copy Markdown
Contributor

@smagnani96 smagnani96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smagnani96 @pchaigno @joestringer
Hey folks, just checking if anything is pending on my end for the PR to be merged?

LGTM, I'll start the tests 👍🏼

@smagnani96
Copy link
Copy Markdown
Contributor

/test

Copy link
Copy Markdown
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me. Just a couple minor things to fix on the commit description.

We observed CPU spikes in large clusters where all agents detect key file
changes simultaneously and update CiliumNode resources at once,
overwhelming the Kubernetes API server.

In order to fix this, add jitter into each agent to spread the control plane
load over a longer period so that the control plan is not overwhelmed
by many updates at the same time. When enabled, jitter is randomly selected
from [0, keyRotationDuration/10].

Fixes: cilium#42721
Signed-off-by: Sharvil Shetty <sharvilshetty.200@gmail.com>
@pchaigno
Copy link
Copy Markdown
Member

pchaigno commented Mar 3, 2026

/test

@joestringer
Copy link
Copy Markdown
Member

Thanks for the contribution @shavyshetty 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature/ipsec Relates to Cilium's IPsec feature kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CFP: Configurable Jitter for IPsec Key Rotation

6 participants