Skip to content

daemon: Fail agent startup on incompatible datapath mode#42482

Merged
tklauser merged 1 commit intomainfrom
pr/HadrienPatte/mismatch-datapath-fail
Oct 31, 2025
Merged

daemon: Fail agent startup on incompatible datapath mode#42482
tklauser merged 1 commit intomainfrom
pr/HadrienPatte/mismatch-datapath-fail

Conversation

@HadrienPatte
Copy link
Copy Markdown
Member

@HadrienPatte HadrienPatte commented Oct 29, 2025

Cilium does not currently support migrating a live node with pods between netkit and veth datapath modes
(ref). But there is currently no safety mechanisms to safeguard against it. This means that is it currently possible to accidentally switch datapath mode on a node and end up in an undefined and unexpected state.

This PR aims to ensure this can not happen by adding a check in the endpoint restore logic to check the restored endpoint link type against the configured datapath mode and make the agent crash in case of incompatibility.

The new errors look like this:
image

For context, we accidentally enabled netkit on nodes that had existing veth endpoints and it took us a while to understand that this was the issue as what we observed was a high volume of dropped ARP packets with reason Unsupported L3 protocol and protocol unknown l4. This is because since netkit runs in L3 mode, netkit bpf programs are compiled without ARP support.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 29, 2025
@HadrienPatte
Copy link
Copy Markdown
Member Author

/test

@HadrienPatte HadrienPatte marked this pull request as ready for review October 29, 2025 23:21
@HadrienPatte HadrienPatte requested a review from a team as a code owner October 29, 2025 23:21
@HadrienPatte HadrienPatte changed the title daemon: Fail endpoint startup on incompatible datapath mode daemon: Fail agent startup on incompatible datapath mode Oct 29, 2025
@HadrienPatte HadrienPatte force-pushed the pr/HadrienPatte/mismatch-datapath-fail branch from a781e74 to bdc997e Compare October 30, 2025 10:48
@HadrienPatte HadrienPatte requested a review from a team as a code owner October 30, 2025 10:48
@HadrienPatte HadrienPatte requested a review from thorn3r October 30, 2025 10:48
@HadrienPatte
Copy link
Copy Markdown
Member Author

/test

@HadrienPatte HadrienPatte force-pushed the pr/HadrienPatte/mismatch-datapath-fail branch 2 times, most recently from 6830c77 to 0903138 Compare October 30, 2025 17:13
Copy link
Copy Markdown
Member

@fristonio fristonio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks.

@HadrienPatte HadrienPatte requested a review from tklauser October 30, 2025 17:53
@HadrienPatte
Copy link
Copy Markdown
Member Author

/test

@tklauser tklauser enabled auto-merge October 31, 2025 10:02
@tklauser tklauser added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Oct 31, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 31, 2025
@tklauser tklauser added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/daemon Impacts operation of the Cilium daemon. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Oct 31, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 31, 2025
Cilium does not currently support migrating a live node with pods between `netkit` and `veth` datapath modes
([ref](https://docs.cilium.io/en/stable/operations/performance/tuning/#netkit-device-mode)). But there is currently no safety mechanisms to safeguard against it. This means that is it currently possible to accidentally switch datapath mode on a node and end up in an undefined and unexpected state.

This PR aims to ensure this can not happen by adding a check in the endpoint restore logic to check the restored endpoint link type against the configured datapath mode and make the agent crash in case of incompatibility.

Signed-off-by: Hadrien Patte <hadrien.patte@datadoghq.com>
@HadrienPatte HadrienPatte force-pushed the pr/HadrienPatte/mismatch-datapath-fail branch from 0903138 to c23a8c0 Compare October 31, 2025 12:53
@HadrienPatte
Copy link
Copy Markdown
Member Author

/test

@HadrienPatte HadrienPatte removed the request for review from thorn3r October 31, 2025 14:10
@tklauser tklauser added this pull request to the merge queue Oct 31, 2025
Merged via the queue into main with commit 4791e24 Oct 31, 2025
358 of 363 checks passed
@tklauser tklauser deleted the pr/HadrienPatte/mismatch-datapath-fail branch October 31, 2025 15:46
@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Oct 31, 2025
@cilium-release-bot cilium-release-bot bot moved this to Released in cilium v1.19.0 Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/daemon Impacts operation of the Cilium daemon. area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.

Projects

No open projects
Status: Released

Development

Successfully merging this pull request may close these issues.

3 participants