High-scale IPCache: Nodeport LB support Part 1#25745
Merged
julianwiedmann merged 6 commits intocilium:mainfrom Jun 1, 2023
Merged
High-scale IPCache: Nodeport LB support Part 1#25745julianwiedmann merged 6 commits intocilium:mainfrom
julianwiedmann merged 6 commits intocilium:mainfrom
Conversation
ddcd1aa to
44a1742
Compare
Member
Author
|
/test Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed: Click to show.Test NameFailure OutputJenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/279/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Then please upload the Jenkins artifacts to that issue. |
The custom decap code for high-scale ipcache in from-netdev transfers the source's sec_identity from the tunnel header into CB_SRC_LABEL. But from-overlay initializes its src_sec_identity variable with 0, and only loads from CB_SRC_LABEL inside handle_ipv4(). In a config with BPF masquerading, the packet passes through nodeport_lb4() first - which stores the passed-in src_sec_identity (== 0) into CB_SRC_LABEL, checks for revSNAT and then tail-calls back to the start of the IPv4 from-overlay path. Thus we currently lose the sec_identity for eg. pod-to-pod connections when BPF Masquerading is enabled. Align this a bit closer with how bpf_host is working - load from CB_SRC_LABEL at the beginning of the tail-call, and clear it. This way we can feed the src_sec_identity into the call to nodeport_lb4(), where it then gets restored to CB_SRC_LABEL before tail-calling back. If needed, l3_local_delivery() will subsequently fill CB_SRC_LABEL again before redirecting to the local endpoint. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Determine the tunnel endpoint earlier, and initialize the GENEVE option struct earlier. This is just prep work for a subsequent patch, no functional change. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
When using the nodeport LB in GENEVE-DSR mode with hs-ipcache, don't rely on the ipcache to select the GENEVE tunnel endpoint for the selected backend. Use the InnerDstIP (== backend IP) instead, same as we do for pod-to-pod traffic. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Prefer DROP reasons over a raw CTX_ACT_DROP. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
A backend node in hs-ipcache mode currently strips off the tunnel headers and manually redirects the packet to cilium_geneve. The SrcSecID is transferred via CB_SRC_LABEL. For GENEVE-DSR we also need to transfer the DSR option, so that the nodeport_lb4() call in from-overlay can process it as usual. But we can't use ctx_set_tunnel_opt() for this, as the metadata_dst will be scrubbed from the skb when redirecting to cilium_geneve's Ingress. Transfer the DSR info via skb->cb instead, and copy it back to a metadata_dst in from-overlay so that things look identical for the nodeport DSR code. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
When a DSR backend replies back to the client in a hs-ipcache configuration, it potentially uses tunnel encapsulation (based on the configured WorldCIDR). RevDNAT for the reply is then handled in to-overlay. To match the LB path (where both the inner and outer DstIP were set to the service IP), we should also revDNAT the outer SrcIP. As we're in the to-overlay program, the SrcIP is stored in the tunnel_key. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
44a1742 to
34063bc
Compare
Member
Author
|
Rebased on top of #24422. |
Member
Author
|
/test |
Member
Author
|
net-next failed in |
Member
Author
|
/test-1.26-net-next |
ldelossa
approved these changes
Jun 1, 2023
Contributor
ldelossa
left a comment
There was a problem hiding this comment.
Changes look good to me. 👍
julianwiedmann
added a commit
that referenced
this pull request
Mar 26, 2026
This code was initially introduced as part of the hs-ipcache feature (#25745, #25854). It allowed for the XDP path to peek into GENEVE-encapsulated traffic, LB it in-place (using DSR-GENEVE mode), and redirect the packet straight out to the backend. But with HS-ipcache gone, this code path is pretty much unused. It *might* have seen some unintentional usage for E/W NodePort access when SocketLB is disabled - but with #41963, we now also LB at the source in such configurations. Either way, it's perfectly fine to remove this optimization. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann
added a commit
that referenced
this pull request
Mar 26, 2026
This code was initially introduced as part of the hs-ipcache feature (#25745, #25854). It allowed for the XDP path to peek into GENEVE-encapsulated traffic, LB it in-place (using DSR-GENEVE mode), and redirect the packet straight out to the backend. But with HS-ipcache gone, this code path is pretty much unused. It *might* have seen some unintentional usage for E/W NodePort access when SocketLB is disabled - but with #41963, we now also LB at the source in such configurations. Either way, it's perfectly fine to remove this optimization. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann
added a commit
that referenced
this pull request
Mar 26, 2026
This code was initially introduced as part of the hs-ipcache feature (#25745, #25854). It allowed for the XDP path to peek into GENEVE-encapsulated traffic, LB it in-place (using DSR-GENEVE mode), and redirect the packet straight out to the backend. But with HS-ipcache gone, this code path is pretty much unused. It *might* have seen some unintentional usage for E/W NodePort access when SocketLB is disabled - but with #41963, we now also LB at the source in such configurations. Either way, it's perfectly fine to remove this optimization. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
github-merge-queue bot
pushed a commit
that referenced
this pull request
Mar 31, 2026
This code was initially introduced as part of the hs-ipcache feature (#25745, #25854). It allowed for the XDP path to peek into GENEVE-encapsulated traffic, LB it in-place (using DSR-GENEVE mode), and redirect the packet straight out to the backend. But with HS-ipcache gone, this code path is pretty much unused. It *might* have seen some unintentional usage for E/W NodePort access when SocketLB is disabled - but with #41963, we now also LB at the source in such configurations. Either way, it's perfectly fine to remove this optimization. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR is in the context of the high-scale ipcache feature described at cilium/design-cfps#7.
It enables the nodeport LB to handle an unencapsulated service request, and forward the request to the backend using GENEVE-DSR. It also adds handling for reply traffic.
In detail:
from-netdev(either XDP or TC), andnodeport_lb4()selects a backend.from-netdevprogram strips off the encapsulation and redirects it tofrom-overlay. We manually transfer the DSR info across this redirect. Thefrom-overlayprogram processes the DSR info and creates a corresponding SNAT entry.to-overlay(when they go to a destination inside the Clustermesh), orto-netdev(when the client belongs to one of the configured WorldCIDRs).