bpf:nat: Fix DSR support for remote NodePort services#41963
bpf:nat: Fix DSR support for remote NodePort services#41963julianwiedmann merged 4 commits intocilium:mainfrom
Conversation
f0afa83 to
ee141a1
Compare
|
👋 just to confirm - this is about the scenario where SocketLB is disabled, and a pod accesses a remote Nodeport service which selects a DSR backend on the client's node? I could have sworn that we already have an issue for this somewhere 🙂. Ideally we'd solve this by addressing #34777. |
@julianwiedmann For the wildcard lookups, I preferred not to reuse the SocketLB functions, so I factored them into a separate lookup function. |
ee141a1 to
ad6ba19
Compare
|
(adding the |
There was a problem hiding this comment.
Thanks for your contribution, and the detailed PR description.
IIUC, there are two main changes in the PR: (1) Perform LB wildcard svc lookup and xlate for nodeport-vip:port at the source node for DSR. (2) Create CT mappings for the xlate'd traffic so that it can be reverse xlate'd.
A high-level question: Is there a reason why (1) is implemented only for DSR? Why not align it with the socket-LB implementation for all cases?
I’d also suggest extracting the service wildcard lookup code into a helper function that explicitly highlights the functionality (similar to socket-LB). This would make it easier to draw parallels with the socket-LB implementation for feature parity.
|
@julianwiedmann @aditighag |
@aditighag Initially, I aimed to make the wildcard svc lookup generic, similar to socket-LB. However I wasn't fully confident it was safe to broaden the scope. The first iteration enables it only for DSR to keep the blast radius small. I agree that a generic approach is better. I will revise the patch to make the lookup generic. |
acd33b5 to
6061c03
Compare
|
@julianwiedmann @aditighag
Could you please take a look again? |
6061c03 to
53522cb
Compare
|
@pchaigno |
53522cb to
12a9a1c
Compare
|
(rebased branch again to facilitate testing) |
|
@pchaigno |
12a9a1c to
2267935
Compare
Happy to help, this was not an easy one :). I think we should try & backport this to |
@julianwiedmann I've updated the PR description with the release note as requested, and added a diagram to illustrate the issue. Please let me know if it looks good! I'll prepare the backport to v1.19 after soak period. |
Thank you! I think we should also emphasize that this (1) affects in-cluster connectivity (where folks could/should use the ClusterIP), and (2) only happens when SocketLB is disabled. |
@julianwiedmann Thanks for the feedback! Updated the release note to clarify that this only affects in-cluster NodePort access (rather than ClusterIP) with SocketLB disabled. |
| */ | ||
| if (!ct_has_egress_entry4(get_ct_map4(&tmp), &tmp)) { | ||
| svc = lb4_lookup_wildcard_nodeport_service(&key); | ||
| if (svc && !nodeport_uses_dsr4(svc)) |
There was a problem hiding this comment.
From unrelated discussion - we could also check for SVC_FLAG_NODEPORT here, to be absolutely sure of what we're handling. Assuming that the controlplane sets things up as expected.
Maybe something to look into when relaxing the DSR requirement.
#41963 added support for wildcard matching of Nodeport services in bpf_lxc's per-packet LB path. When accessing a DSR Nodeport service on a remote node, the LB now happens at the source. This addressed the scenario where client and backend sit on the same node, and therefore reply traffic would never pass through a native interface hook for RevDNAT. Extend this logic to also cover non-DSR Nodeport services. They are not affected by the mentioned bug (since replies would first go back to the intermediate LB node) - but doing the LB at the source still provides benefits. It allows for precise network policy, and saves resources on the intermediate node. In detail the Network policy implications are: 1. client egress policy needs to allows access to the service's backends. But no longer needs to allow access to remote nodes. 2. backend ingress policy needs to allow access by client pods. But no longer needs to allow WORLD for serving in-cluster clients. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
#41963 added support for wildcard matching of Nodeport services in bpf_lxc's per-packet LB path. When accessing a DSR Nodeport service on a remote node, the LB now happens at the source. This addressed the scenario where client and backend sit on the same node, and therefore reply traffic would never pass through a native interface hook for RevDNAT. Extend this logic to also cover non-DSR Nodeport services. They are not affected by the mentioned bug (since replies would first go back to the intermediate LB node) - but doing the LB at the source still provides benefits. It allows for precise network policy, and saves resources on the intermediate node. In detail the Network policy implications are: 1. client egress policy needs to allows access to the service's backends. But no longer needs to allow access to remote nodes. 2. backend ingress policy needs to allow access by client pods. But no longer needs to allow WORLD for serving in-cluster clients. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
ldelossa
left a comment
There was a problem hiding this comment.
Nice change, while policy changes will be needed to communicate thing, translation at source fits our other models much better 👍.
|
@ldelossa Thanks for the review and the kind words! I'm really glad to hear that this "translation at source" approach aligns well with the broader datapath architecture 👍. |
#41963 added support for wildcard matching of Nodeport services in bpf_lxc's per-packet LB path. When accessing a DSR Nodeport service on a remote node, the LB now happens at the source. This addressed the scenario where client and backend sit on the same node, and therefore reply traffic would never pass through a native interface hook for RevDNAT. Extend this logic to also cover non-DSR Nodeport services. They are not affected by the mentioned bug (since replies would first go back to the intermediate LB node) - but doing the LB at the source still provides benefits. It allows for precise network policy, and saves resources on the intermediate node. In detail the Network policy implications are: 1. client egress policy needs to allows access to the service's backends. But no longer needs to allow access to remote nodes. 2. backend ingress policy needs to allow access by client pods. But no longer needs to allow WORLD for serving in-cluster clients. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
cilium#41963 added support for wildcard matching of Nodeport services in bpf_lxc's per-packet LB path. When accessing a DSR Nodeport service on a remote node, the LB now happens at the source. This addressed the scenario where client and backend sit on the same node, and therefore reply traffic would never pass through a native interface hook for RevDNAT. Extend this logic to also cover non-DSR Nodeport services. They are not affected by the mentioned bug (since replies would first go back to the intermediate LB node) - but doing the LB at the source still provides benefits. It allows for precise network policy, and saves resources on the intermediate node. In detail the Network policy implications are: 1. client egress policy needs to allows access to the service's backends. But no longer needs to allow access to remote nodes. 2. backend ingress policy needs to allow access by client pods. But no longer needs to allow WORLD for serving in-cluster clients. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
This code was initially introduced as part of the hs-ipcache feature (#25745, #25854). It allowed for the XDP path to peek into GENEVE-encapsulated traffic, LB it in-place (using DSR-GENEVE mode), and redirect the packet straight out to the backend. But with HS-ipcache gone, this code path is pretty much unused. It *might* have seen some unintentional usage for E/W NodePort access when SocketLB is disabled - but with #41963, we now also LB at the source in such configurations. Either way, it's perfectly fine to remove this optimization. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
This code was initially introduced as part of the hs-ipcache feature (#25745, #25854). It allowed for the XDP path to peek into GENEVE-encapsulated traffic, LB it in-place (using DSR-GENEVE mode), and redirect the packet straight out to the backend. But with HS-ipcache gone, this code path is pretty much unused. It *might* have seen some unintentional usage for E/W NodePort access when SocketLB is disabled - but with #41963, we now also LB at the source in such configurations. Either way, it's perfectly fine to remove this optimization. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
This code was initially introduced as part of the hs-ipcache feature (#25745, #25854). It allowed for the XDP path to peek into GENEVE-encapsulated traffic, LB it in-place (using DSR-GENEVE mode), and redirect the packet straight out to the backend. But with HS-ipcache gone, this code path is pretty much unused. It *might* have seen some unintentional usage for E/W NodePort access when SocketLB is disabled - but with #41963, we now also LB at the source in such configurations. Either way, it's perfectly fine to remove this optimization. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
This code was initially introduced as part of the hs-ipcache feature (#25745, #25854). It allowed for the XDP path to peek into GENEVE-encapsulated traffic, LB it in-place (using DSR-GENEVE mode), and redirect the packet straight out to the backend. But with HS-ipcache gone, this code path is pretty much unused. It *might* have seen some unintentional usage for E/W NodePort access when SocketLB is disabled - but with #41963, we now also LB at the source in such configurations. Either way, it's perfectly fine to remove this optimization. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Please ensure your pull request adheres to the following guidelines:
description and a
Fixes: #XXXline if the commit addresses a particularGitHub issue.
Fixes: <commit-id>tag, thenplease add the commit author[s] as reviewer[s] to this issue.
Problem Statement
Root Cause
Solution Summary
Key Consideration
Release Note