Skip to content

Fix key rotation race condition by configuring XFRM before advertising new SPI to other nodes#44335

Merged
pchaigno merged 2 commits intocilium:mainfrom
daanvinken:fix-race-condition-key-rotation
Mar 9, 2026
Merged

Fix key rotation race condition by configuring XFRM before advertising new SPI to other nodes#44335
pchaigno merged 2 commits intocilium:mainfrom
daanvinken:fix-race-condition-key-rotation

Conversation

@daanvinken
Copy link
Copy Markdown
Contributor

@daanvinken daanvinken commented Feb 13, 2026

Fix IPSec key rotation race by setting up XFRM states before advertising new key

During IPSec key rotation, the current code advertises the new key via CiliumNode CRD
before configuring XFRM states. Under CPU contention of the control plane / worker nodes, completing AllNodeValidateImplementation()
can take an extended amount of time, during which peers may learn about the new key and start sending
traffic encrypted with it. In my understanding we currntly rely on eventual consistency. In the case of this inconsistency, our ingress XFRM states aren't ready yet, these packets
are dropped, causing cilium_ipsec_xfrm_error{error="no_state", type="inbound"} errors
and service disruption.

This was observed in one of our clusters after upgrading to Cilium 1.17, which doubled the number
of XFRM states due to encrypted overlay support (from ~1,620 to ~3,240 keys on an ~1000 node
cluster). The increased CPU load during rotation widened the race window significantly.

According to the IPSec documentation:

During transition the new and old keys will be in use. The Cilium agent keeps per
endpoint data on which key is used by each endpoint and will use the correct key
if either side has not yet been updated.

This per-endpoint tracking relies on the BPF datapath selecting min(local_key, peer_key).
However, this assumes XFRM states are configured before peers learn about the new key.
When CiliumNode CRD propagation outpaces XFRM setup, peers select the new key but
the receiving node cannot decrypt it.

The fix swaps the order of operations in keyfileWatcher:

Before: load key → advertise → setup XFRM → update BPF
After:  load key → setup XFRM → advertise → update BPF

By setting up XFRM states before advertising, we ensure ingress is ready before peers
start sending traffic with the new key. The localNode.Update() call triggers an
async observer that updates the CiliumNode CRD, so peers only learn about the new key
after XFRM configuration is complete.

The BPF min-key selection logic ensures outbound traffic remains safe during the
transition, as the BPF encrypt map is updated last.

Logging Improvements

This PR also adds logging to improve observability during key rotation:

Level Message Location
Info "Loaded IPsec keyfile" keyfileWatcher
Info "Retrying XFRM state add after deleting conflicting state" xfrmStateReplace
Debug "Updated BPF encrypt map with new SPI" setIPSecSPI

This helps diagnose key rotation timing by showing: key load → XFRM setup → BPF update sequence.

Fixes: #29319

Fix IPSec key rotation race condition where packets were dropped due to XFRM states not being ready when peers started using the new key. Also adds logging for key rotation flow.

@daanvinken daanvinken requested a review from a team as a code owner February 13, 2026 13:24
@maintainer-s-little-helper

This comment was marked as resolved.

@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Feb 13, 2026
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Feb 13, 2026
@daanvinken daanvinken marked this pull request as draft February 13, 2026 13:25
@daanvinken daanvinken changed the title fix key rotation race by configuring XFRM before advertising [DRAFT: Need to validate my hypothesis] fix key rotation race by configuring XFRM before advertising Feb 13, 2026
@daanvinken
Copy link
Copy Markdown
Contributor Author

@smagnani96 Could you share your thoughts on this? I had to dig into the internals to understand this and may be completely off.
It seems we have ran into this during a key rotation CPU saturated cluster, where we had a race condition and nodes were not able to receive traffic with the new key yet, even though advertised.

@smagnani96
Copy link
Copy Markdown
Contributor

Hey, when you start observing the issue, are the respective XFRM states in the node where the packet are being dropped still using the old SPI or the new one?
Follow-up question: did you try using a greater value for --ipsec-key-rotation-duration?
I want to rule out the possibility that a node clears up its old XFRM state with the old SPI while still receiving traffic with that SPI from a peer.

@julianwiedmann
Copy link
Copy Markdown
Member

From a quick look into the history, it looks like #29319 introduced this.

@pchaigno pchaigno added release-note/bug This PR fixes an issue in a previous release of Cilium. feature/ipsec Relates to Cilium's IPsec feature needs-backport/1.19 This PR / issue needs backporting to the v1.19 branch labels Feb 17, 2026
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Feb 17, 2026
@daanvinken daanvinken marked this pull request as ready for review February 19, 2026 21:06
@daanvinken
Copy link
Copy Markdown
Contributor Author

daanvinken commented Feb 19, 2026

Thanks folks.

Let me answer first on --ipsec-key-rotation-duration: We already have this set to 10 minutes. From both logs and metrics, we can confirm the errors occur well within this window, Specifically within the first ~90 seconds of key rotation, not after the 10-minute cleanup period. This rules out premature old-key cleanup as the cause, since that parameter only controls when old XFRM states are cleaned up, and those states would still be present during this timeframe.

Regarding old vs new SPI: Unfortunately, we cannot trace down the exact sequence of events to definitively confirm which SPI the dropped packets were using. We don't have packet captures from the incident, and cilium_ipsec_xfrm_error{error="no_state", type="inbound"} doesn't indicate the SPI of the rejected packets.

However, the timing strongly suggests this is about new SPI states not being ready, rather than old SPI states being cleaned early:

  1. Errors spike immediately when key rotation starts, correlating with "Loading IPsec keyfile" log entries
  2. We observe a ~32 second gap between loading the keyfile and the CiliumNode CRD update under CPU load, suggesting AllNodeValidateImplementation() is CPU-starved. The CPU starvation is unrelated to this PR, but relates to a cascading cluster wide red herring we had with CFP: Configurable Jitter for IPsec Key Rotation #42721 / ipsec: Add configurable jitter to key file watcher to prevent thundering herd #44263.
  3. The entire rotation window across all nodes takes ~90 second; well within our 10-minute cleanup window.

On another note; it may be worth it to have us add some logging for other steps during rotation. E.g. we're missing a log updating XFRM state success. The existing "Loading IPsec keyfile" and "Creating or updating CiliumNode resource" are the only success log, where the lattter is async and in a different subsystem. Wdyt?

It's hard / close to impossible for me to reproduce this error on a high-traffic large-scale cluster. How do you suggest we proceed? with @julianwiedmann mention on the previous PR it seems this may have been an unintended change /regression. Any suggestions to test this better if we deem necessary?

@maintainer-s-little-helper

This comment was marked as resolved.

@daanvinken daanvinken changed the title [DRAFT: Need to validate my hypothesis] fix key rotation race by configuring XFRM before advertising Fix key rotation race condition by configuring XFRM before advertising new SPI to other nodes Feb 19, 2026
Copy link
Copy Markdown
Contributor

@smagnani96 smagnani96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR LGTM.
Left a couple of comments for cleaning history and signing commit, but change itself is good.

Let me answer first on --ipsec-key-rotation-duration: ... We already have this set to 10 minutes. From both logs and metrics, we can confirm the errors occur well within this window, Specifically within the first ~90 seconds of key rotation, not after the 10-minute cleanup period. This rules out premature old-key cleanup as the cause, since that parameter only controls when old XFRM states are cleaned up, and those states would still be present during this timeframe.

Yep, if that is the case then we're most likely not ready with the new key in the receiving node.

On another note; it may be worth it to have us add some logging for other steps during rotation. E.g. we're missing a log updating XFRM state success. The existing "Loading IPsec keyfile" and "Creating or updating CiliumNode resource" are the only success log, where the lattter is async and in a different subsystem. Wdyt?

Adding Debug logs never hurts, given they give more context and help debugging issues. An Info log (per-state/policy addition/removal) sounds too verbose.

It's hard / close to impossible for me to reproduce this error on a high-traffic large-scale cluster. How do you suggest we proceed? with @julianwiedmann mention on the previous PR it seems this may have been an unintended change /regression. Any suggestions to test this better if we deem necessary?

I'm aligned with Julian. The fix itself makes sense, as it restores the previous behavior. With the fix, the flow in the keyfileWatcher function (for a valid new key) should be the following:

  1. loadIPSecKeysFile: loads the key, updates ipSecCurrentKeySPI and sets the new key in ipSecKeysGlobal
  2. AllNodeValidateImplementation: this will then call ipsec.UpsertIPsecEndpoint(), which will use getGlobalIPsecKey. The spi returned here and used for new states/policy is the new global key in ipSecKeysGlobal.
  3. localNode.Update: now advertise the new SPI
  4. setIPSecSPI: update SPI also in the BPF map.

@daanvinken
Copy link
Copy Markdown
Contributor Author

daanvinken commented Feb 20, 2026

Awesome thanks, cleaning up PR.

I added some debug logs, please lmk if you have any additional thoughts

@daanvinken daanvinken force-pushed the fix-race-condition-key-rotation branch from 29e6f8a to ec78a0c Compare February 20, 2026 15:42
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Feb 20, 2026
@daanvinken daanvinken force-pushed the fix-race-condition-key-rotation branch from ec78a0c to f4fa4ed Compare February 24, 2026 17:15
@daanvinken
Copy link
Copy Markdown
Contributor Author

@smagnani96 all done!

@smagnani96
Copy link
Copy Markdown
Contributor

@smagnani96 all done!

Thanks. There are some CI flakes which are not related to this PR, but before me re-running them I'd kindly ask you to address https://github.com/cilium/cilium/actions/runs/22361850515/job/64876940663?pr=44335#step:5:38

  Error: pkg/datapath/linux/ipsec/ipsec_linux.go:1246:4: arguments should be put on separate lines (sloglint)
  			a.log.Info("Loaded IPsec keyfile", logfields.SPI, spi, logfields.Path, keyfilePath)

@daanvinken daanvinken force-pushed the fix-race-condition-key-rotation branch from f4fa4ed to 0213453 Compare March 2, 2026 15:29
@daanvinken
Copy link
Copy Markdown
Contributor Author

Missed that one, done.

Copy link
Copy Markdown
Contributor

@smagnani96 smagnani96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#44114 Got merged, so I think this needs a rebase on top of new fresh main to make CI happy (Smoke tests).

Set up XFRM states before updating LocalNode.EncryptionKey to ensure
ingress is ready before peers learn about the new key via CiliumNode CRD.

Fixes packet drops during rotation under CPU contention where
AllNodeValidateImplementation() takes longer than CRD propagation.

Signed-off-by: Daan Vinken <daanvinken@tythus.com>
Add Info log when keyfile is loaded, Info logs for XFRM state
conflict resolution in xfrmStateReplace, and a Debug log when
the BPF encrypt map is updated.

Signed-off-by: Daan Vinken <daanvinken@tythus.com>
@daanvinken daanvinken force-pushed the fix-race-condition-key-rotation branch from 0213453 to 14edd1c Compare March 3, 2026 21:41
@smagnani96
Copy link
Copy Markdown
Contributor

/test

@daanvinken
Copy link
Copy Markdown
Contributor Author

@smagnani96 what to do with these tests? Getting a lot of emails on failed tests :D
Lmk if there's anything I can do to help out.

@smagnani96
Copy link
Copy Markdown
Contributor

@smagnani96 what to do with these tests? Getting a lot of emails on failed tests :D Lmk if there's anything I can do to help out.

I double-checked and I believe these are CI flakes unrelated to this PR.
Re-running the failing tests. Thanks for the ping, I sometime miss notifications.

@pchaigno pchaigno enabled auto-merge March 9, 2026 10:26
@pchaigno pchaigno added this pull request to the merge queue Mar 9, 2026
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 9, 2026
Merged via the queue into cilium:main with commit 4b1f0b2 Mar 9, 2026
78 checks passed
@smagnani96
Copy link
Copy Markdown
Contributor

Thanks for the contribution @daanvinken 🎉

@tommyp1ckles tommyp1ckles mentioned this pull request Mar 9, 2026
7 tasks
@tommyp1ckles tommyp1ckles added backport-pending/1.19 The backport for Cilium 1.19.x for this PR is in progress. and removed needs-backport/1.19 This PR / issue needs backporting to the v1.19 branch labels Mar 9, 2026
@julianwiedmann julianwiedmann added affects/v1.16 This issue affects v1.16 branch affects/v1.17 This issue affects v1.17 branch affects/v1.18 This issue affects v1.18 branch labels Mar 10, 2026
@github-actions github-actions bot added backport-done/1.19 The backport for Cilium 1.19.x for this PR is done. and removed backport-pending/1.19 The backport for Cilium 1.19.x for this PR is in progress. labels Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

affects/v1.16 This issue affects v1.16 branch affects/v1.17 This issue affects v1.17 branch affects/v1.18 This issue affects v1.18 branch backport-done/1.19 The backport for Cilium 1.19.x for this PR is done. feature/ipsec Relates to Cilium's IPsec feature kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants