bpf: Remove link scope of cilium_host's IPv4 address#23241
Merged
ldelossa merged 1 commit intocilium:masterfrom Jan 25, 2023
Merged
bpf: Remove link scope of cilium_host's IPv4 address#23241ldelossa merged 1 commit intocilium:masterfrom
cilium_host's IPv4 address#23241ldelossa merged 1 commit intocilium:masterfrom
Conversation
Kube-proxy always masquerades DNATed packets going to NodePort services.
This is to ensure that reply packets always flow through the
intermediate, DNATing node. Consider the following path:
pod@node1 -> nodeport@node2 -> backend@node3
A packet is sent from pod@node1 to a NodePort service with node2's IP
address. Node2 DNATs the packet and forwards it to the backend on node3.
If node2 doesn't also masquerade the packet, the reply packet will be
sent directly to node1, bypassing the reverse DNAT.
In tunneling mode however, kube-proxy appears unable to pick the correct
source IP for masquerading. Consider the following packet flow (under
VXLAN + endpoint routes + IPsec [1]):
<- endpoint 656 flow 0x5c7eb4 , identity 20590->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.0.1.172:57110 -> 192.168.56.12:30656 tcp SYN
-> stack flow 0x5c7eb4 , identity 20590->host state new ifindex 0 orig-ip 0.0.0.0: 10.0.1.172:57110 -> 192.168.56.12:30656 tcp SYN
<- host flow 0x5c7eb4 , identity 20590->unknown state unknown ifindex lxc7e0fe2229abe orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN
-> stack flow 0x5c7eb4 , identity 20590->unknown state unknown ifindex cilium_host orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN
<- stack encrypted flow 0x5c7eb4 , identity 20590->unknown state new ifindex cilium_net orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN
-> overlay encrypted flow 0x5c7eb4 , identity 20590->unknown state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN
Client pod 10.0.1.172 sends a packet to NodePort 30656 on node 2. That
packet is masqueraded to 10.0.2.15 (line 3), the IP on the default
interface. This choice is incorrect as the packet will then go through
the tunnel and not the underlay. The reply will therefore not be sent
through the tunnel and may even fail if 10.0.2.15 isn't routable from
node 2 (as is the case in our testing setup).
Instead, kube-proxy should pick the IP address of cilium_host, which
belongs to the node's pod CIDR, thus ensuring the reply will be routed
through the tunnel. Why isn't it?
Checking the kernel's source code [2], we can see that the scope of IP
addresses on the interfaces is taken into account in addition to the
destination IP (and other packet information in case of source routing,
etc.). Specifically, in the case of netfilter's masquerading,
inet_select_addr is called with a scope of RT_SCOPE_UNIVERSE (0).
Therefore, only IP addresses with a scope equal to RT_SCOPE_UNIVERSE
will be picked.
This commit thus removes the link scope on the IPv4 address of
cilium_host, such that the address now has a RT_SCOPE_UNIVERSE scope
(default).
This will be tested in the Cilium Datapath workflow via a subsequent
pull request, but we need to fix one other bug before we can do that.
1 - IPsec doesn't matter to the bug here. Endpoint routes however does.
If endpoint routes is enabled, Cilium adds a masquerading rule in
front of kube-proxy's to always masquerade DNATed pod traffic to
cilium_host IP address. See [3] for details.
2 - https://github.com/torvalds/linux/blob/v5.19/net/ipv4/devinet.c#L1324
3 - https://github.com/cilium/cilium/blob/v1.13.0-rc4/pkg/datapath/iptables/iptables.go#L1216-L1242
Co-authored-by: Liu Xu <liuxu623@gmail.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
This was referenced Jan 23, 2023
Member
Author
|
Previous run failed in https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/514/testReport/Suite-k8s-1/26/K8sDatapathServicesTest_Checks_N_S_loadbalancing_Tests_with_direct_routing_and_DSR/. Seems likely to be a flake given this change shouldn't affect KPR. |
dylandreimerink
approved these changes
Jan 24, 2023
Member
Author
I had help :-) While debugging this I had in mind #21738, which made the same change before. And Daniel also pointed to the kernel code when I asked about the masquerading logic. |
Member
Author
2 tasks
This was referenced Feb 13, 2023
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Kube-proxy always masquerades DNATed packets going to NodePort services. This is to ensure that reply packets always flow through the intermediate, DNATing node. Consider the following path:
A packet is sent from pod@node1 to a NodePort service with node2's IP address. Node2 DNATs the packet and forwards it to the backend on node3. If node2 doesn't also masquerade the packet, the reply packet will be sent directly to node1, bypassing the reverse DNAT.
In tunneling mode however, kube-proxy appears unable to pick the correct source IP for masquerading. Consider the following packet flow (under VXLAN + endpoint routes + IPsec [1]):
Client pod 10.0.1.172 sends a packet to NodePort 30656 on node 2. That packet is masqueraded to 10.0.2.15 (line 3), the IP on the default interface. This choice is incorrect as the packet will then go through the tunnel and not the underlay. The reply will therefore not be sent through the tunnel and may even fail if 10.0.2.15 isn't routable from node 2 (as is the case in our testing setup).
Instead, kube-proxy should pick the IP address of
cilium_host, which belongs to the node's pod CIDR, thus ensuring the reply will be routed through the tunnel. Why isn't it?Checking the kernel's source code [2], we can see that the scope of IP addresses on the interfaces is taken into account in addition to the destination IP (and other packet information in case of source routing, etc.). Specifically, in the case of netfilter's masquerading,
inet_select_addris called with a scope ofRT_SCOPE_UNIVERSE(0). Therefore, only IP addresses with a scope equal toRT_SCOPE_UNIVERSEwill be picked.This pull request thus removes the link scope on the IPv4 address of
cilium_host, such that the address now has aRT_SCOPE_UNIVERSEscope (default).This will be tested in the Cilium Datapath workflow via a subsequent pull request, but we need to fix one other bug before we can do that.
1 - IPsec doesn't matter to the bug here. Endpoint routes however does. If endpoint routes is enabled, Cilium adds a masquerading rule in front of kube-proxy's to always masquerade DNATed pod traffic to
cilium_hostIP address. See [3] for details.Co-authored-by: Liu Xu liuxu623@gmail.com
Signed-off-by: Paul Chaignon paul@cilium.io