Fix key rotation race condition by configuring XFRM before advertising new SPI to other nodes#44335
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
|
@smagnani96 Could you share your thoughts on this? I had to dig into the internals to understand this and may be completely off. |
|
Hey, when you start observing the issue, are the respective XFRM states in the node where the packet are being dropped still using the old SPI or the new one? |
|
From a quick look into the history, it looks like #29319 introduced this. |
|
Thanks folks. Let me answer first on Regarding old vs new SPI: Unfortunately, we cannot trace down the exact sequence of events to definitively confirm which SPI the dropped packets were using. We don't have packet captures from the incident, and However, the timing strongly suggests this is about new SPI states not being ready, rather than old SPI states being cleaned early:
On another note; it may be worth it to have us add some logging for other steps during rotation. E.g. we're missing a log updating XFRM state success. The existing It's hard / close to impossible for me to reproduce this error on a high-traffic large-scale cluster. How do you suggest we proceed? with @julianwiedmann mention on the previous PR it seems this may have been an unintended change /regression. Any suggestions to test this better if we deem necessary? |
This comment was marked as resolved.
This comment was marked as resolved.
smagnani96
left a comment
There was a problem hiding this comment.
PR LGTM.
Left a couple of comments for cleaning history and signing commit, but change itself is good.
Let me answer first on
--ipsec-key-rotation-duration: ... We already have this set to 10 minutes. From both logs and metrics, we can confirm the errors occur well within this window, Specifically within the first ~90 seconds of key rotation, not after the 10-minute cleanup period. This rules out premature old-key cleanup as the cause, since that parameter only controls when old XFRM states are cleaned up, and those states would still be present during this timeframe.
Yep, if that is the case then we're most likely not ready with the new key in the receiving node.
On another note; it may be worth it to have us add some logging for other steps during rotation. E.g. we're missing a log updating XFRM state success. The existing
"Loading IPsec keyfile"and"Creating or updating CiliumNode resource"are the only success log, where the lattter is async and in a different subsystem. Wdyt?
Adding Debug logs never hurts, given they give more context and help debugging issues. An Info log (per-state/policy addition/removal) sounds too verbose.
It's hard / close to impossible for me to reproduce this error on a high-traffic large-scale cluster. How do you suggest we proceed? with @julianwiedmann mention on the previous PR it seems this may have been an unintended change /regression. Any suggestions to test this better if we deem necessary?
I'm aligned with Julian. The fix itself makes sense, as it restores the previous behavior. With the fix, the flow in the keyfileWatcher function (for a valid new key) should be the following:
loadIPSecKeysFile: loads the key, updatesipSecCurrentKeySPIand sets the new key inipSecKeysGlobalAllNodeValidateImplementation: this will then callipsec.UpsertIPsecEndpoint(), which will usegetGlobalIPsecKey. The spi returned here and used for new states/policy is the new global key inipSecKeysGlobal.localNode.Update: now advertise the new SPIsetIPSecSPI: update SPI also in the BPF map.
|
Awesome thanks, cleaning up PR. I added some debug logs, please lmk if you have any additional thoughts |
29e6f8a to
ec78a0c
Compare
ec78a0c to
f4fa4ed
Compare
|
@smagnani96 all done! |
Thanks. There are some CI flakes which are not related to this PR, but before me re-running them I'd kindly ask you to address https://github.com/cilium/cilium/actions/runs/22361850515/job/64876940663?pr=44335#step:5:38 Error: pkg/datapath/linux/ipsec/ipsec_linux.go:1246:4: arguments should be put on separate lines (sloglint)
a.log.Info("Loaded IPsec keyfile", logfields.SPI, spi, logfields.Path, keyfilePath) |
f4fa4ed to
0213453
Compare
|
Missed that one, done. |
smagnani96
left a comment
There was a problem hiding this comment.
#44114 Got merged, so I think this needs a rebase on top of new fresh main to make CI happy (Smoke tests).
Set up XFRM states before updating LocalNode.EncryptionKey to ensure ingress is ready before peers learn about the new key via CiliumNode CRD. Fixes packet drops during rotation under CPU contention where AllNodeValidateImplementation() takes longer than CRD propagation. Signed-off-by: Daan Vinken <daanvinken@tythus.com>
Add Info log when keyfile is loaded, Info logs for XFRM state conflict resolution in xfrmStateReplace, and a Debug log when the BPF encrypt map is updated. Signed-off-by: Daan Vinken <daanvinken@tythus.com>
0213453 to
14edd1c
Compare
|
/test |
|
@smagnani96 what to do with these tests? Getting a lot of emails on failed tests :D |
I double-checked and I believe these are CI flakes unrelated to this PR. |
|
Thanks for the contribution @daanvinken 🎉 |
Fix IPSec key rotation race by setting up XFRM states before advertising new key
During IPSec key rotation, the current code advertises the new key via CiliumNode CRD
before configuring XFRM states. Under CPU contention of the control plane / worker nodes, completing
AllNodeValidateImplementation()can take an extended amount of time, during which peers may learn about the new key and start sending
traffic encrypted with it. In my understanding we currntly rely on eventual consistency. In the case of this inconsistency, our ingress XFRM states aren't ready yet, these packets
are dropped, causing
cilium_ipsec_xfrm_error{error="no_state", type="inbound"}errorsand service disruption.
This was observed in one of our clusters after upgrading to Cilium 1.17, which doubled the number
of XFRM states due to encrypted overlay support (from ~1,620 to ~3,240 keys on an ~1000 node
cluster). The increased CPU load during rotation widened the race window significantly.
According to the IPSec documentation:
This per-endpoint tracking relies on the BPF datapath selecting
min(local_key, peer_key).However, this assumes XFRM states are configured before peers learn about the new key.
When CiliumNode CRD propagation outpaces XFRM setup, peers select the new key but
the receiving node cannot decrypt it.
The fix swaps the order of operations in
keyfileWatcher:By setting up XFRM states before advertising, we ensure ingress is ready before peers
start sending traffic with the new key. The
localNode.Update()call triggers anasync observer that updates the CiliumNode CRD, so peers only learn about the new key
after XFRM configuration is complete.
The BPF min-key selection logic ensures outbound traffic remains safe during the
transition, as the BPF encrypt map is updated last.
Logging Improvements
This PR also adds logging to improve observability during key rotation:
keyfileWatcherxfrmStateReplacesetIPSecSPIThis helps diagnose key rotation timing by showing: key load → XFRM setup → BPF update sequence.
Fixes: #29319