Skip to content

High-scale IPCache: Nodeport LB support Part 1#25745

Merged
julianwiedmann merged 6 commits intocilium:mainfrom
julianwiedmann:1.14-hsipcache-part1
Jun 1, 2023
Merged

High-scale IPCache: Nodeport LB support Part 1#25745
julianwiedmann merged 6 commits intocilium:mainfrom
julianwiedmann:1.14-hsipcache-part1

Conversation

@julianwiedmann
Copy link
Copy Markdown
Member

@julianwiedmann julianwiedmann commented May 29, 2023

This PR is in the context of the high-scale ipcache feature described at cilium/design-cfps#7.

It enables the nodeport LB to handle an unencapsulated service request, and forward the request to the backend using GENEVE-DSR. It also adds handling for reply traffic.

In detail:

  • the request enters the LB in from-netdev (either XDP or TC), and nodeport_lb4() selects a backend.
  • the DNATed packet goes down the DSR egress code path, and has GENEVE encapsulation added (+ DSR option as needed). In the context of hs-ipcache, we use the backend's IP address as OuterDstIP (same as if it was a pod-to-pod connection). If the load-balancing is done in XDP, we punt up to TC for adding the tunnel headers.
  • the packet is sent to the backend node
  • at the backend node, the from-netdev program strips off the encapsulation and redirects it to from-overlay. We manually transfer the DSR info across this redirect. The from-overlay program processes the DSR info and creates a corresponding SNAT entry.
  • replies are revDNATed in to-overlay (when they go to a destination inside the Clustermesh), or to-netdev (when the client belongs to one of the configured WorldCIDRs).
Add support for load-balancing unencapsulated requests in a configuration with high-scale ipcache.

@julianwiedmann julianwiedmann added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. feature/high-scale-ipcache Relates to the high-scale ipcache feature. labels May 29, 2023
@julianwiedmann julianwiedmann changed the title High-scale IPCache: Nodeport support Part 1 High-scale IPCache: Nodeport LB support Part 1 May 29, 2023
@julianwiedmann julianwiedmann force-pushed the 1.14-hsipcache-part1 branch from ddcd1aa to 44a1742 Compare May 30, 2023 10:23
@julianwiedmann
Copy link
Copy Markdown
Member Author

julianwiedmann commented May 30, 2023

/test

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sAgentPolicyTest Basic Test Traffic redirections to proxy Tests proxy visibility interactions with policy lifecycle operations

Failure Output

FAIL: Failed to start hubble observe

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/279/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

Then please upload the Jenkins artifacts to that issue.

@julianwiedmann julianwiedmann added kind/feature This introduces new functionality. feature/lb-only Impacts cilium running in lb-only datapath mode labels May 30, 2023
@julianwiedmann julianwiedmann marked this pull request as ready for review May 30, 2023 10:37
@julianwiedmann julianwiedmann requested a review from a team as a code owner May 30, 2023 10:37
Copy link
Copy Markdown
Contributor

@bleggett bleggett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

The custom decap code for high-scale ipcache in from-netdev transfers
the source's sec_identity from the tunnel header into CB_SRC_LABEL.

But from-overlay initializes its src_sec_identity variable with 0, and only
loads from CB_SRC_LABEL inside handle_ipv4(). In a config with BPF
masquerading, the packet passes through nodeport_lb4() first - which stores
the passed-in src_sec_identity (== 0) into CB_SRC_LABEL, checks for revSNAT
and then tail-calls back to the start of the IPv4 from-overlay path.
Thus we currently lose the sec_identity for eg. pod-to-pod connections when
BPF Masquerading is enabled.

Align this a bit closer with how bpf_host is working - load from
CB_SRC_LABEL at the beginning of the tail-call, and clear it. This way we
can feed the src_sec_identity into the call to nodeport_lb4(), where it
then gets restored to CB_SRC_LABEL before tail-calling back.

If needed, l3_local_delivery() will subsequently fill CB_SRC_LABEL again
before redirecting to the local endpoint.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Determine the tunnel endpoint earlier, and initialize the GENEVE option
struct earlier.

This is just prep work for a subsequent patch, no functional change.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
When using the nodeport LB in GENEVE-DSR mode with hs-ipcache, don't rely
on the ipcache to select the GENEVE tunnel endpoint for the selected
backend.

Use the InnerDstIP (== backend IP) instead, same as we do for pod-to-pod
traffic.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Prefer DROP reasons over a raw CTX_ACT_DROP.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
A backend node in hs-ipcache mode currently strips off the tunnel headers
and manually redirects the packet to cilium_geneve. The SrcSecID is
transferred via CB_SRC_LABEL.

For GENEVE-DSR we also need to transfer the DSR option, so that the
nodeport_lb4() call in from-overlay can process it as usual. But we can't
use ctx_set_tunnel_opt() for this, as the metadata_dst will be scrubbed
from the skb when redirecting to cilium_geneve's Ingress.

Transfer the DSR info via skb->cb instead, and copy it back to a
metadata_dst in from-overlay so that things look identical for the nodeport
DSR code.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
When a DSR backend replies back to the client in a hs-ipcache
configuration, it potentially uses tunnel encapsulation (based on the
configured WorldCIDR). RevDNAT for the reply is then handled in to-overlay.

To match the LB path (where both the inner and outer DstIP were set
to the service IP), we should also revDNAT the outer SrcIP. As we're in
the to-overlay program, the SrcIP is stored in the tunnel_key.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
@julianwiedmann julianwiedmann force-pushed the 1.14-hsipcache-part1 branch from 44a1742 to 34063bc Compare May 31, 2023 13:20
@julianwiedmann
Copy link
Copy Markdown
Member Author

Rebased on top of #24422.

@julianwiedmann
Copy link
Copy Markdown
Member Author

/test

@julianwiedmann
Copy link
Copy Markdown
Member Author

net-next failed in K8sDatapathServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) Checks in-cluster KPR with L7 policy, but looks like there's no dump because the jenkins run timed out afterwards.

@julianwiedmann
Copy link
Copy Markdown
Member Author

/test-1.26-net-next

Copy link
Copy Markdown
Contributor

@ldelossa ldelossa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me. 👍

@julianwiedmann julianwiedmann merged commit f541499 into cilium:main Jun 1, 2023
@julianwiedmann julianwiedmann deleted the 1.14-hsipcache-part1 branch June 1, 2023 17:14
julianwiedmann added a commit that referenced this pull request Mar 26, 2026
This code was initially introduced as part of the hs-ipcache feature
(#25745,
 #25854). It allowed for the XDP path
to peek into GENEVE-encapsulated traffic, LB it in-place (using DSR-GENEVE
mode), and redirect the packet straight out to the backend.

But with HS-ipcache gone, this code path is pretty much unused. It *might*
have seen some unintentional usage for E/W NodePort access when SocketLB
is disabled - but with #41963, we now
also LB at the source in such configurations. Either way, it's perfectly
fine to remove this optimization.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit that referenced this pull request Mar 26, 2026
This code was initially introduced as part of the hs-ipcache feature
(#25745,
 #25854). It allowed for the XDP path
to peek into GENEVE-encapsulated traffic, LB it in-place (using DSR-GENEVE
mode), and redirect the packet straight out to the backend.

But with HS-ipcache gone, this code path is pretty much unused. It *might*
have seen some unintentional usage for E/W NodePort access when SocketLB
is disabled - but with #41963, we now
also LB at the source in such configurations. Either way, it's perfectly
fine to remove this optimization.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit that referenced this pull request Mar 26, 2026
This code was initially introduced as part of the hs-ipcache feature
(#25745,
 #25854). It allowed for the XDP path
to peek into GENEVE-encapsulated traffic, LB it in-place (using DSR-GENEVE
mode), and redirect the packet straight out to the backend.

But with HS-ipcache gone, this code path is pretty much unused. It *might*
have seen some unintentional usage for E/W NodePort access when SocketLB
is disabled - but with #41963, we now
also LB at the source in such configurations. Either way, it's perfectly
fine to remove this optimization.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Mar 31, 2026
This code was initially introduced as part of the hs-ipcache feature
(#25745,
 #25854). It allowed for the XDP path
to peek into GENEVE-encapsulated traffic, LB it in-place (using DSR-GENEVE
mode), and redirect the packet straight out to the backend.

But with HS-ipcache gone, this code path is pretty much unused. It *might*
have seen some unintentional usage for E/W NodePort access when SocketLB
is disabled - but with #41963, we now
also LB at the source in such configurations. Either way, it's perfectly
fine to remove this optimization.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. feature/high-scale-ipcache Relates to the high-scale ipcache feature. feature/lb-only Impacts cilium running in lb-only datapath mode kind/feature This introduces new functionality. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants