Skip to content

bpf:nat: Fix DSR support for remote NodePort services#41963

Merged
julianwiedmann merged 4 commits intocilium:mainfrom
gyutaeb:fix/dsr-remote-nodeport
Feb 9, 2026
Merged

bpf:nat: Fix DSR support for remote NodePort services#41963
julianwiedmann merged 4 commits intocilium:mainfrom
gyutaeb:fix/dsr-remote-nodeport

Conversation

@gyutaeb
Copy link
Copy Markdown
Member

@gyutaeb gyutaeb commented Oct 1, 2025

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

Problem Statement

Root Cause

image
  • The root cause is that remote node only performs DNAT. This sets the packets's source address to client's node address, leading to a hairpin problem. Consequently, the originating node cannot perform the necessary REV NAT for the reply packets.
  • The problem occurs through the following path:
    • Request
      • Client Pod -> Remote Node:nodePort
      • Local Node:sport -> Remote Node:nodePort (masquerading)
      • Local Node:sport -> Backend Pod:port (DSR DNAT)
      • Send the above packet over Geneve From Remote Node
      • Local Node route above packet to Backend Pod:port
    • Reply
      • BackendPod:port -> Local Node:sport
      • Route to loopback without REVNAT. Because of destination address is Local Node

Solution Summary

image
  • To resolve this, remote service requests are now handled on the source node when DSR is enabled, similar to the behavior of socket-level load balancing.

Key Consideration

  • If we choose to address this by egressing to a remote node, a REV NAT stage corresponding to the native device masquerading is required. Because REV NAT executes only on the native devices's ingress path, we conclude it is better to handle this on the local node--analogous to the BPF socket load-balancer. Performing REV NAT in cil_from_container would unnecessarily increase complexity.

Release Note

Fix in-cluster NodePort connectivity failure in DSR mode when SocketLB
is disabled. When a pod accesses a NodePort service via a remote node's
IP (instead of the ClusterIP) and the selected backend resides on the
same node as the client, the connection fails due to missing reverse NAT
on the reply path.

@gyutaeb gyutaeb requested a review from a team as a code owner October 1, 2025 09:34
@gyutaeb gyutaeb requested a review from aditighag October 1, 2025 09:34
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 1, 2025
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Oct 1, 2025
@gyutaeb gyutaeb force-pushed the fix/dsr-remote-nodeport branch 2 times, most recently from f0afa83 to ee141a1 Compare October 1, 2025 11:01
@julianwiedmann
Copy link
Copy Markdown
Member

👋 just to confirm - this is about the scenario where SocketLB is disabled, and a pod accesses a remote Nodeport service which selects a DSR backend on the client's node? I could have sworn that we already have an issue for this somewhere 🙂.

Ideally we'd solve this by addressing #34777.

@julianwiedmann julianwiedmann added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/dsr Relates to Cilium's Direct-Server-Return feature for KPR. area/kpr Anything related to our kube-proxy replacement. labels Oct 7, 2025
@gyutaeb
Copy link
Copy Markdown
Member Author

gyutaeb commented Oct 13, 2025

👋 just to confirm - this is about the scenario where SocketLB is disabled, and a pod accesses a remote Nodeport service which selects a DSR backend on the client's node? I could have sworn that we already have an issue for this somewhere 🙂.

Ideally we'd solve this by addressing #34777.

@julianwiedmann
👋 Thanks for confirming, Julian. Yes, exactly. This issue arises when remote Nodeport service selects a DSR backend on the client's node. I've reviewed a number of cases. My view align with #34777: the load balancing should happen at the source node, So I implemented this PR in that direction.

For the wildcard lookups, I preferred not to reuse the SocketLB functions, so I factored them into a separate lookup function.

@gyutaeb gyutaeb force-pushed the fix/dsr-remote-nodeport branch from ee141a1 to ad6ba19 Compare October 14, 2025 02:16
@julianwiedmann julianwiedmann added the dont-merge/preview-only Only for preview or testing, don't merge it. label Oct 15, 2025
@julianwiedmann
Copy link
Copy Markdown
Member

(adding the preview label for now, since we currently don't have tests or IPv6 support)

Copy link
Copy Markdown
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, and the detailed PR description.
IIUC, there are two main changes in the PR: (1) Perform LB wildcard svc lookup and xlate for nodeport-vip:port at the source node for DSR. (2) Create CT mappings for the xlate'd traffic so that it can be reverse xlate'd.

A high-level question: Is there a reason why (1) is implemented only for DSR? Why not align it with the socket-LB implementation for all cases?

I’d also suggest extracting the service wildcard lookup code into a helper function that explicitly highlights the functionality (similar to socket-LB). This would make it easier to draw parallels with the socket-LB implementation for feature parity.

@gyutaeb
Copy link
Copy Markdown
Member Author

gyutaeb commented Oct 24, 2025

@julianwiedmann @aditighag
Sorry for the delay. I've been tied up with CiliumCon NA preparations. I will take a look soon.

@gyutaeb
Copy link
Copy Markdown
Member Author

gyutaeb commented Oct 24, 2025

A high-level question: Is there a reason why (1) is implemented only for DSR? Why not align it with the socket-LB implementation for all cases?

I’d also suggest extracting the service wildcard lookup code into a helper function that explicitly highlights the functionality (similar to socket-LB). This would make it easier to draw parallels with the socket-LB implementation for feature parity.

@aditighag
Thanks for the review!

Initially, I aimed to make the wildcard svc lookup generic, similar to socket-LB. However I wasn't fully confident it was safe to broaden the scope. The first iteration enables it only for DSR to keep the blast radius small. I agree that a generic approach is better. I will revise the patch to make the lookup generic.

@gyutaeb gyutaeb force-pushed the fix/dsr-remote-nodeport branch 2 times, most recently from acd33b5 to 6061c03 Compare November 21, 2025 11:50
@gyutaeb
Copy link
Copy Markdown
Member Author

gyutaeb commented Nov 21, 2025

@julianwiedmann @aditighag
Thanks for your patience. I have addressed all your feedback.

  1. Verified proper operation of GC for NAT entries.
  2. Verified the RevNAT map mechanism in SocketLB.
  3. Made the wildcard lookup function generic.
  4. Added IPv6 support.
  5. Rebased onto the latest main.

Could you please take a look again?

@gyutaeb gyutaeb requested a review from aditighag November 21, 2025 12:01
@gyutaeb gyutaeb force-pushed the fix/dsr-remote-nodeport branch from 6061c03 to 53522cb Compare November 25, 2025 01:33
@gyutaeb
Copy link
Copy Markdown
Member Author

gyutaeb commented Nov 25, 2025

@pchaigno
Thanks for testing.
It looks like many CI jobs failed during the provisioning and setup stages. I've rebase this branch on the latest main.
Could you please rerun the tests?

@gyutaeb gyutaeb force-pushed the fix/dsr-remote-nodeport branch from 53522cb to 12a9a1c Compare November 25, 2025 06:29
@gyutaeb
Copy link
Copy Markdown
Member Author

gyutaeb commented Nov 25, 2025

(rebased branch again to facilitate testing)

@gyutaeb
Copy link
Copy Markdown
Member Author

gyutaeb commented Nov 25, 2025

@pchaigno
I've found the cause of the failures. It looks like one of the C macro definitions changed when I rebase on the latest main. I'll fix it and update you here.

2025-11-25T07:02:43.084954383Z time=2025-11-25T07:02:43.084671309Z level=error source=/go/src/github.com/cilium/cilium/pkg/datapath/loader/compile.go:219 msg="Failed to compile bpf_alignchecker.o: exit status 1" module=agent.datapath.loader compilerPID=59
2025-11-25T07:02:43.085012253Z time=2025-11-25T07:02:43.084800211Z level=warn source=/go/src/github.com/cilium/cilium/pkg/datapath/loader/compile.go:227 msg="In file included from /var/lib/cilium/bpf/bpf_alignchecker.c:38:" module=agent.datapath.loader
2025-11-25T07:02:43.085029323Z time=2025-11-25T07:02:43.084937022Z level=warn source=/go/src/github.com/cilium/cilium/pkg/datapath/loader/compile.go:227 msg="/var/lib/cilium/bpf/lib/lb.h:1789:21: error: use of undeclared identifier 'NODEPORT_PORT_MIN'" module=agent.datapath.loader
2025-11-25T07:02:43.085128155Z time=2025-11-25T07:02:43.085007873Z level=warn source=/go/src/github.com/cilium/cilium/pkg/datapath/loader/compile.go:227 msg=" 1789 |         if (service_port < NODEPORT_PORT_MIN || service_port > NODEPORT_PORT_MAX)" module=agent.datapath.loader
2025-11-25T07:02:43.085205446Z time=2025-11-25T07:02:43.085098574Z level=warn source=/go/src/github.com/cilium/cilium/pkg/datapath/loader/compile.go:227 msg="      |                            ^" module=agent.datapath.loader
2025-11-25T07:02:43.085288416Z time=2025-11-25T07:02:43.085150855Z level=warn source=/go/src/github.com/cilium/cilium/pkg/datapath/loader/compile.go:227 msg="/var/lib/cilium/bpf/lib/lb.h:1789:57: error: use of undeclared identifier 'NODEPORT_PORT_MAX'" module=agent.datapath.loader
2025-11-25T07:02:43.085334837Z time=2025-11-25T07:02:43.085270466Z level=warn source=/go/src/github.com/cilium/cilium/pkg/datapath/loader/compile.go:227 msg=" 1789 |         if (service_port < NODEPORT_PORT_MIN || service_port > NODEPORT_PORT_MAX)" module=agent.datapath.loader
2025-11-25T07:02:43.085454828Z time=2025-11-25T07:02:43.085339147Z level=warn source=/go/src/github.com/cilium/cilium/pkg/datapath/loader/compile.go:227 msg="      |   

@gyutaeb gyutaeb force-pushed the fix/dsr-remote-nodeport branch from 12a9a1c to 2267935 Compare November 25, 2025 07:56
@julianwiedmann
Copy link
Copy Markdown
Member

Thank you, @julianwiedmann. The thorough discussions and feedback were invaluable and really helped improve the quality of this patch. I appreciate all your support throughout the process :)

Happy to help, this was not an easy one :).

I think we should try & backport this to v1.19 as well - but maybe let it soak in main for a few weeks. I'll add the necessary labels. Given that this is a bug, we should also add a release-note to the PR description that describes the user-visible impact (what misbehavior would a user observe, and what conditions are required to hit it). I've added a place holder for now, could you give that a try please? You can look at old PRs with kind/bug for inspiration.

@julianwiedmann julianwiedmann added this pull request to the merge queue Feb 9, 2026
@julianwiedmann julianwiedmann added backport/author The backport will be carried out by the author of the PR. needs-backport/1.19 This PR / issue needs backporting to the v1.19 branch labels Feb 9, 2026
Merged via the queue into cilium:main with commit 81166b9 Feb 9, 2026
78 checks passed
@gyutaeb
Copy link
Copy Markdown
Member Author

gyutaeb commented Feb 10, 2026

I think we should try & backport this to v1.19 as well - but maybe let it soak in main for a few weeks. I'll add the necessary labels. Given that this is a bug, we should also add a release-note to the PR description that describes the user-visible impact (what misbehavior would a user observe, and what conditions are required to hit it). I've added a place holder for now, could you give that a try please? You can look at old PRs with kind/bug for inspiration.

@julianwiedmann I've updated the PR description with the release note as requested, and added a diagram to illustrate the issue. Please let me know if it looks good! I'll prepare the backport to v1.19 after soak period.

@julianwiedmann
Copy link
Copy Markdown
Member

I think we should try & backport this to v1.19 as well - but maybe let it soak in main for a few weeks. I'll add the necessary labels. Given that this is a bug, we should also add a release-note to the PR description that describes the user-visible impact (what misbehavior would a user observe, and what conditions are required to hit it). I've added a place holder for now, could you give that a try please? You can look at old PRs with kind/bug for inspiration.

@julianwiedmann I've updated the PR description with the release note as requested, and added a diagram to illustrate the issue. Please let me know if it looks good! I'll prepare the backport to v1.19 after soak period.

Thank you! I think we should also emphasize that this (1) affects in-cluster connectivity (where folks could/should use the ClusterIP), and (2) only happens when SocketLB is disabled.

@gyutaeb
Copy link
Copy Markdown
Member Author

gyutaeb commented Feb 10, 2026

I think we should try & backport this to v1.19 as well - but maybe let it soak in main for a few weeks. I'll add the necessary labels. Given that this is a bug, we should also add a release-note to the PR description that describes the user-visible impact (what misbehavior would a user observe, and what conditions are required to hit it). I've added a place holder for now, could you give that a try please? You can look at old PRs with kind/bug for inspiration.

@julianwiedmann I've updated the PR description with the release note as requested, and added a diagram to illustrate the issue. Please let me know if it looks good! I'll prepare the backport to v1.19 after soak period.

Thank you! I think we should also emphasize that this (1) affects in-cluster connectivity (where folks could/should use the ClusterIP), and (2) only happens when SocketLB is disabled.

@julianwiedmann Thanks for the feedback! Updated the release note to clarify that this only affects in-cluster NodePort access (rather than ClusterIP) with SocketLB disabled.

*/
if (!ct_has_egress_entry4(get_ct_map4(&tmp), &tmp)) {
svc = lb4_lookup_wildcard_nodeport_service(&key);
if (svc && !nodeport_uses_dsr4(svc))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From unrelated discussion - we could also check for SVC_FLAG_NODEPORT here, to be absolutely sure of what we're handling. Assuming that the controlplane sets things up as expected.

Maybe something to look into when relaxing the DSR requirement.

julianwiedmann added a commit that referenced this pull request Feb 24, 2026
#41963 added support for wildcard
matching of Nodeport services in bpf_lxc's per-packet LB path. When
accessing a DSR Nodeport service on a remote node, the LB now happens at
the source. This addressed the scenario where client and backend sit on the
same node, and therefore reply traffic would never pass through a native
interface hook for RevDNAT.

Extend this logic to also cover non-DSR Nodeport services. They are not
affected by the mentioned bug (since replies would first go back to the
intermediate LB node) - but doing the LB at the source still provides
benefits. It allows for precise network policy, and saves resources on the
intermediate node.

In detail the Network policy implications are:
1. client egress policy needs to allows access to the service's backends.
   But no longer needs to allow access to remote nodes.
2. backend ingress policy needs to allow access by client pods. But no
   longer needs to allow WORLD for serving in-cluster clients.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit that referenced this pull request Feb 25, 2026
#41963 added support for wildcard
matching of Nodeport services in bpf_lxc's per-packet LB path. When
accessing a DSR Nodeport service on a remote node, the LB now happens at
the source. This addressed the scenario where client and backend sit on the
same node, and therefore reply traffic would never pass through a native
interface hook for RevDNAT.

Extend this logic to also cover non-DSR Nodeport services. They are not
affected by the mentioned bug (since replies would first go back to the
intermediate LB node) - but doing the LB at the source still provides
benefits. It allows for precise network policy, and saves resources on the
intermediate node.

In detail the Network policy implications are:
1. client egress policy needs to allows access to the service's backends.
   But no longer needs to allow access to remote nodes.
2. backend ingress policy needs to allow access by client pods. But no
   longer needs to allow WORLD for serving in-cluster clients.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Copy link
Copy Markdown
Contributor

@ldelossa ldelossa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change, while policy changes will be needed to communicate thing, translation at source fits our other models much better 👍.

Copy link
Copy Markdown
Contributor

@ldelossa ldelossa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@gyutaeb
Copy link
Copy Markdown
Member Author

gyutaeb commented Feb 26, 2026

@ldelossa Thanks for the review and the kind words! I'm really glad to hear that this "translation at source" approach aligns well with the broader datapath architecture 👍.

github-merge-queue bot pushed a commit that referenced this pull request Feb 26, 2026
#41963 added support for wildcard
matching of Nodeport services in bpf_lxc's per-packet LB path. When
accessing a DSR Nodeport service on a remote node, the LB now happens at
the source. This addressed the scenario where client and backend sit on the
same node, and therefore reply traffic would never pass through a native
interface hook for RevDNAT.

Extend this logic to also cover non-DSR Nodeport services. They are not
affected by the mentioned bug (since replies would first go back to the
intermediate LB node) - but doing the LB at the source still provides
benefits. It allows for precise network policy, and saves resources on the
intermediate node.

In detail the Network policy implications are:
1. client egress policy needs to allows access to the service's backends.
   But no longer needs to allow access to remote nodes.
2. backend ingress policy needs to allow access by client pods. But no
   longer needs to allow WORLD for serving in-cluster clients.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
javiercardona-work pushed a commit to javiercardona-work/cilium that referenced this pull request Mar 18, 2026
cilium#41963 added support for wildcard
matching of Nodeport services in bpf_lxc's per-packet LB path. When
accessing a DSR Nodeport service on a remote node, the LB now happens at
the source. This addressed the scenario where client and backend sit on the
same node, and therefore reply traffic would never pass through a native
interface hook for RevDNAT.

Extend this logic to also cover non-DSR Nodeport services. They are not
affected by the mentioned bug (since replies would first go back to the
intermediate LB node) - but doing the LB at the source still provides
benefits. It allows for precise network policy, and saves resources on the
intermediate node.

In detail the Network policy implications are:
1. client egress policy needs to allows access to the service's backends.
   But no longer needs to allow access to remote nodes.
2. backend ingress policy needs to allow access by client pods. But no
   longer needs to allow WORLD for serving in-cluster clients.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit that referenced this pull request Mar 26, 2026
This code was initially introduced as part of the hs-ipcache feature
(#25745,
 #25854). It allowed for the XDP path
to peek into GENEVE-encapsulated traffic, LB it in-place (using DSR-GENEVE
mode), and redirect the packet straight out to the backend.

But with HS-ipcache gone, this code path is pretty much unused. It *might*
have seen some unintentional usage for E/W NodePort access when SocketLB
is disabled - but with #41963, we now
also LB at the source in such configurations. Either way, it's perfectly
fine to remove this optimization.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
@github-actions github-actions bot added the backport-done/1.19 The backport for Cilium 1.19.x for this PR is done. label Mar 26, 2026
julianwiedmann added a commit that referenced this pull request Mar 26, 2026
This code was initially introduced as part of the hs-ipcache feature
(#25745,
 #25854). It allowed for the XDP path
to peek into GENEVE-encapsulated traffic, LB it in-place (using DSR-GENEVE
mode), and redirect the packet straight out to the backend.

But with HS-ipcache gone, this code path is pretty much unused. It *might*
have seen some unintentional usage for E/W NodePort access when SocketLB
is disabled - but with #41963, we now
also LB at the source in such configurations. Either way, it's perfectly
fine to remove this optimization.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit that referenced this pull request Mar 26, 2026
This code was initially introduced as part of the hs-ipcache feature
(#25745,
 #25854). It allowed for the XDP path
to peek into GENEVE-encapsulated traffic, LB it in-place (using DSR-GENEVE
mode), and redirect the packet straight out to the backend.

But with HS-ipcache gone, this code path is pretty much unused. It *might*
have seen some unintentional usage for E/W NodePort access when SocketLB
is disabled - but with #41963, we now
also LB at the source in such configurations. Either way, it's perfectly
fine to remove this optimization.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
@julianwiedmann julianwiedmann removed the needs-backport/1.19 This PR / issue needs backporting to the v1.19 branch label Mar 27, 2026
github-merge-queue bot pushed a commit that referenced this pull request Mar 31, 2026
This code was initially introduced as part of the hs-ipcache feature
(#25745,
 #25854). It allowed for the XDP path
to peek into GENEVE-encapsulated traffic, LB it in-place (using DSR-GENEVE
mode), and redirect the packet straight out to the backend.

But with HS-ipcache gone, this code path is pretty much unused. It *might*
have seen some unintentional usage for E/W NodePort access when SocketLB
is disabled - but with #41963, we now
also LB at the source in such configurations. Either way, it's perfectly
fine to remove this optimization.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/kpr Anything related to our kube-proxy replacement. area/loadbalancing Impacts load-balancing and Kubernetes service implementations backport/author The backport will be carried out by the author of the PR. backport-done/1.19 The backport for Cilium 1.19.x for this PR is done. feature/dsr Relates to Cilium's Direct-Server-Return feature for KPR. kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants