Skip to content

bpf: egw: fix over-eager matching of EGW policy against reply traffic#36929

Merged
julianwiedmann merged 2 commits intocilium:mainfrom
julianwiedmann:1.18-egw-reply-fix
Jan 14, 2025
Merged

bpf: egw: fix over-eager matching of EGW policy against reply traffic#36929
julianwiedmann merged 2 commits intocilium:mainfrom
julianwiedmann:1.18-egw-reply-fix

Conversation

@julianwiedmann
Copy link
Copy Markdown
Member

@julianwiedmann julianwiedmann commented Jan 10, 2025

The code to handle EGW reply traffic lives in the nodeport ingress path, after a packet has been RevSNATed. In contrast to the outbound path, it doesn't consider whether the packet's source is an in-cluster entity.

This becomes problematic when the EGW policy specifies a broad-range destinationCIDR, such as 0.0.0.0/0, and accidentally matches in-cluster service traffic. More precisely:

  • a client pod matches an EGW policy of the form "from $client_pod to 0.0.0.0/0, exit via gateway nodeB"
  • client_pod accesses a nodeport service on gateway nodeB, where the request is DNATed to a remote backend, and SNATed to nodeB's IP.
  • once the backend's reply reaches nodeB, it is revSNATed to client_pod. At this point the packet matches the EGW policy entry, and is thus redirected into the overlay network and forwarded to client_pod.
  • As the RevDNAT step has been skipped, client_pod rejects this packet from an unknown source.

Fix this by applying the RevDNAT logic prior to looking for EGW policy matches. Double-checking whether the packet's origin is an in-cluster entity might still make sense, but no longer strictly necessary.

From a tcpdump on a kind cluster (with EGW, KPR without SocketLB and native routing):

# client accesses nodeport 31936 on remote LB node
IP 10.244.1.236.35556 > 172.18.0.2.31936: Flags [S]
# request is DNATed + SNATed, and forwarded to remote backend
IP 172.18.0.2.35556 > 10.244.3.121.80: Flags [S]
# backend's reply reaches the LB node
IP 10.244.3.121.80 > 172.18.0.2.35556: Flags [S.]
# LB node applies revSNAT, adds VXLAN encapsulation and forwards to client
IP 172.18.0.2.54624 > 172.18.0.4.8472: OTV, flags [I] (0x08), overlay 0, instance 6
IP 10.244.3.121.80 > 10.244.1.236.35556: Flags [S.]
# client's RST to the backend
IP 10.244.1.236.35556 > 10.244.3.121.80: Flags [R]
Fix a bug that prevents a pod from accessing Nodeport services when the pod is also in scope of a broad-range Egress Gateway policy.

@julianwiedmann julianwiedmann added kind/bug This is a bug in the Cilium logic. area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/bug This PR fixes an issue in a previous release of Cilium. feature/egress-gateway Impacts the egress IP gateway feature. area/loadbalancing Impacts load-balancing and Kubernetes service implementations needs-backport/1.16 needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Jan 10, 2025
@julianwiedmann julianwiedmann added affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch labels Jan 10, 2025
@julianwiedmann
Copy link
Copy Markdown
Member Author

/test

@julianwiedmann julianwiedmann added the dont-merge/preview-only Only for preview or testing, don't merge it. label Jan 10, 2025
@julianwiedmann julianwiedmann requested review from a team and pippolo84 and removed request for a team January 10, 2025 11:25
@julianwiedmann julianwiedmann marked this pull request as ready for review January 10, 2025 11:25
@julianwiedmann julianwiedmann requested a review from a team as a code owner January 10, 2025 11:25
@julianwiedmann
Copy link
Copy Markdown
Member Author

(I'll also want to cook a BPF-level test for this problem)

The code to handle EGW reply traffic lives in the nodeport ingress path,
after a packet has been RevSNATed. In contrast to the outbound path, it
doesn't consider whether the packet's source is an in-cluster entity.

This becomes problematic when the EGW policy specifies a broad-range
`destinationCIDR`, such as 0.0.0.0/0, and accidentally matches in-cluster
service traffic. More precisely:
* a client pod matches an EGW policy of the form
  "from $client_pod to 0.0.0.0/0, exit via gateway nodeB"
* client_pod accesses a nodeport service on gateway nodeB, where the
  request is DNATed to a remote backend, and SNATed to nodeB's IP.
* once the backend's reply reaches nodeB, it is revSNATed to client_pod.
  At this point the packet matches the EGW policy entry, and is thus
  redirected into the overlay network and forwarded to client_pod.
* As the RevDNAT step has been skipped, client_pod rejects this packet from
  an unknown source.

Fix this by applying the RevDNAT logic *prior* to looking for EGW policy
matches. Double-checking whether the packet's origin is an in-cluster
entity might still make sense, but no longer strictly necessary.


From a tcpdump on a kind cluster (with EGW, KPR without SocketLB and
native routing):

```
 # client accesses nodeport 31936 on remote LB node
IP 10.244.1.236.35556 > 172.18.0.2.31936: Flags [S]
 # request is DNATed + SNATed, and forwarded to remote backend
IP 172.18.0.2.35556 > 10.244.3.121.80: Flags [S]
 # backend's reply reaches the LB node
IP 10.244.3.121.80 > 172.18.0.2.35556: Flags [S.]
 # LB node applies revSNAT, adds VXLAN encapsulation and forwards to client
IP 172.18.0.2.54624 > 172.18.0.4.8472: OTV, flags [I] (0x08), overlay 0, instance 6
IP 10.244.3.121.80 > 10.244.1.236.35556: Flags [S.]
 # client's RST to the backend
IP 10.244.1.236.35556 > 10.244.3.121.80: Flags [R]
```

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Reply traffic for an SNATed in-cluster nodeport connection should not be
captured by a catch-all EGW policy. Otherwise the packet reaches the client
without the source being RevDNATed back to the nodeport, and thus
resulting in a RST by the client.

This adds testing for the scenario that was fixed in the preceding commit.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
@julianwiedmann julianwiedmann removed the dont-merge/preview-only Only for preview or testing, don't merge it. label Jan 13, 2025
@julianwiedmann
Copy link
Copy Markdown
Member Author

/test

Copy link
Copy Markdown
Member

@YutaroHayakawa YutaroHayakawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jan 14, 2025
@julianwiedmann julianwiedmann added this pull request to the merge queue Jan 14, 2025
Merged via the queue into cilium:main with commit a899c4a Jan 14, 2025
@julianwiedmann julianwiedmann deleted the 1.18-egw-reply-fix branch January 14, 2025 08:30
@rastislavs rastislavs mentioned this pull request Jan 21, 2025
45 tasks
@rastislavs rastislavs added backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. and removed needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Jan 21, 2025
@github-actions github-actions bot added backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. and removed backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. labels Jan 22, 2025
@rastislavs rastislavs mentioned this pull request Jan 22, 2025
19 tasks
@github-actions github-actions bot added backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. and removed backport-pending/1.16 labels Jan 24, 2025
julianwiedmann added a commit that referenced this pull request Mar 13, 2026
This removes the test introduced by
a899c4a ("bpf: test: cover NAT LB with catch-all EGW policy"), as part
of #36929.

The test targeted the scenario of an in-cluster client which accesses a
NodePort service on a remote node. But with the changes from
#44507, this type of access no longer
exists, as the traffic will always get DNATed at the source node.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Mar 13, 2026
This removes the test introduced by
a899c4a ("bpf: test: cover NAT LB with catch-all EGW policy"), as part
of #36929.

The test targeted the scenario of an in-cluster client which accesses a
NodePort service on a remote node. But with the changes from
#44507, this type of access no longer
exists, as the traffic will always get DNATed at the source node.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
javiercardona-work pushed a commit to javiercardona-work/cilium that referenced this pull request Mar 18, 2026
This removes the test introduced by
a899c4a ("bpf: test: cover NAT LB with catch-all EGW policy"), as part
of cilium#36929.

The test targeted the scenario of an in-cluster client which accesses a
NodePort service on a remote node. But with the changes from
cilium#44507, this type of access no longer
exists, as the traffic will always get DNATed at the source node.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/loadbalancing Impacts load-balancing and Kubernetes service implementations backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. feature/egress-gateway Impacts the egress IP gateway feature. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants