Skip to content

bpf: nodeport: forward L7 svc traffic straight to proxy#36383

Merged
julianwiedmann merged 1 commit intocilium:mainfrom
julianwiedmann:1.17-bpf-redirect-2
Dec 6, 2024
Merged

bpf: nodeport: forward L7 svc traffic straight to proxy#36383
julianwiedmann merged 1 commit intocilium:mainfrom
julianwiedmann:1.17-bpf-redirect-2

Conversation

@julianwiedmann
Copy link
Copy Markdown
Member

L7 svc requests are currently redirected to cilium_net / cilium_host, where they get marked with MARK_MAGIC_TO_PROXY and then get rerouted to the proxy.

Simplify this by marking the packet in from-netdev and from-overlay, and letting it pass up the stack. This avoids the dependency on skb->cb surviving the transfer from the native program to cilium_net.

The BPF TPROXY path needs further investigation, so only apply this change when BPF TPROXY is disabled.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 5, 2024
@julianwiedmann julianwiedmann added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. release-note/misc This PR makes changes that have no direct user impact. area/loadbalancing Impacts load-balancing and Kubernetes service implementations labels Dec 5, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 5, 2024
@julianwiedmann
Copy link
Copy Markdown
Member Author

/test

@julianwiedmann julianwiedmann added the dont-merge/preview-only Only for preview or testing, don't merge it. label Dec 5, 2024
@julianwiedmann julianwiedmann marked this pull request as ready for review December 5, 2024 10:17
@julianwiedmann julianwiedmann requested a review from a team as a code owner December 5, 2024 10:17
@julianwiedmann
Copy link
Copy Markdown
Member Author

I'm thinking whether this will change anything from a netfilter-visibility / "asymmetric path" perspective ...

Copy link
Copy Markdown
Member

@jschwinger233 jschwinger233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the original "from_netdev/from_overlay -> cilium_net -> cilium_host -> stack -> tproxy hijack" is shortened to "from_netdev/from_overlay -> stack -> tproxy hijack".

Left a comment to understand if it's a missing part or intentional design.

L7 svc requests are currently redirected to cilium_net / cilium_host,
where they get marked with MARK_MAGIC_TO_PROXY and then get rerouted to
the proxy.

Simplify this by marking the packet in from-netdev and from-overlay, and
letting it pass up the stack. This avoids the dependency on skb->cb
surviving the transfer from the native program to cilium_net.

The BPF TPROXY path needs further investigation, so only apply this change
when BPF TPROXY is disabled.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
@julianwiedmann julianwiedmann removed the dont-merge/preview-only Only for preview or testing, don't merge it. label Dec 6, 2024
@julianwiedmann
Copy link
Copy Markdown
Member Author

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 6, 2024
@julianwiedmann julianwiedmann added this pull request to the merge queue Dec 6, 2024
Merged via the queue into cilium:main with commit 1c7f706 Dec 6, 2024
@julianwiedmann julianwiedmann deleted the 1.17-bpf-redirect-2 branch December 6, 2024 10:14
smagnani96 added a commit that referenced this pull request Mar 6, 2026
Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With #36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

In previous commits, we added a datapath config variable so that each
program is aware of its attached link type. With this, if the device is
a bridge, the L7 LB path falls back to the pre-#36383 hairpin redirect
via cilium_net, bypassing the punt-to-stack path that ip_sabotage would break.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
smagnani96 added a commit that referenced this pull request Mar 7, 2026
Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With #36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

In previous commits, we added a datapath config variable so that each
program is aware of its attached link type. With this, if the device is
a bridge, the L7 LB path falls back to the pre-#36383 hairpin redirect
via cilium_net, bypassing the punt-to-stack path that ip_sabotage would break.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
smagnani96 added a commit that referenced this pull request Mar 9, 2026
Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With #36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

In previous commits, we added a datapath config variable so that each
program is aware of its attached link type. With this, if the device is
a bridge, the L7 LB path falls back to the pre-#36383 hairpin redirect
via cilium_net, bypassing the punt-to-stack path that ip_sabotage would break.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
smagnani96 added a commit that referenced this pull request Mar 10, 2026
[ upstream commit 0ba47cd ]

[ backporter's notes:
  * Added bits into bpf/tests/pktgen.h without backporting whole unrelated PRs
  * adjusted LB4_SERVICES_MAP_V2, LB4_REVERSE_NAT_MAP, and their respective
    IPv6 versions references, different than in upstream
  * added `ENABLE_SERVICE_PROTOCOL_DIFFERENTIATION` in the new test,
    alongside with some different infra to tail call the ingress program,
    and macros to check redirect interface
]

Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With #36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

In previous commits, we added a datapath config variable so that each
program is aware of its attached link type. With this, if the device is
a bridge, the L7 LB path falls back to the pre-#36383 hairpin redirect
via cilium_net, bypassing the punt-to-stack path that ip_sabotage would break.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
smagnani96 added a commit that referenced this pull request Mar 10, 2026
[ upstream commit 0ba47cd ]

[ backporter's notes:
  * Added bits into bpf/tests/pktgen.h without backporting whole unrelated PRs
  * adjusted LB4_SERVICES_MAP_V2, LB4_REVERSE_NAT_MAP, and their respective
    IPv6 versions references, different than in upstream
  * added `ENABLE_SERVICE_PROTOCOL_DIFFERENTIATION` in the new test,
    alongside with some different infra to tail call the ingress program,
    and macros to check redirect interface
]

Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With #36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

In previous commits, we added a datapath config variable so that each
program is aware of its attached link type. With this, if the device is
a bridge, the L7 LB path falls back to the pre-#36383 hairpin redirect
via cilium_net, bypassing the punt-to-stack path that ip_sabotage would break.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
smagnani96 added a commit that referenced this pull request Mar 10, 2026
Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With #36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

We therefore add a new datapath config variable that will be set to true
only when compiling the Netdev datapath program (cil_{from,to}_netdev)
and the network device is a bridge. This allows us to fall back to the
pre-#36383 hairpin redirect via cilium_net for L7 LB traffic, bypassing the
ip_sabotage_in() function and allowing the TPROXY rule to match as expected.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
smagnani96 added a commit that referenced this pull request Mar 10, 2026
[ upstream commit 473a9e6 ]

[ backporter's notes: fixed conflicts:
  * adapting to the old loader implementation, adding changes only to
    the Netdev program attached to network devices.
  * added bits into bpf/tests/pktgen.h without backporting whole unrelated PRs
  * adjusted LB4_SERVICES_MAP_V2, LB4_REVERSE_NAT_MAP, and their respective
    IPv6 versions references, different than in upstream
  * added `ENABLE_SERVICE_PROTOCOL_DIFFERENTIATION` in the new test,
    alongside with some different infra to tail call the ingress program,
    and macros to check redirect interface
  * changed check in bpf/tests/l7_lb_hairpin.c to check for HOST_IFINDEX
    instead of cilium_net, as the latter is not used in this backport
]

Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With #36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

We therefore add a new datapath config variable that will be set to true
only when compiling the Netdev datapath program (cil_{from,to}_netdev)
and the network device is a bridge. This allows us to fall back to the
pre-#36383 hairpin redirect via cilium_net for L7 LB traffic, bypassing the
ip_sabotage_in() function and allowing the TPROXY rule to match as expected.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
smagnani96 added a commit that referenced this pull request Mar 11, 2026
[ upstream commit 473a9e6 ]

[ backporter's notes: fixed conflicts:
  * adapting to the old loader implementation, adding changes only to
    the Netdev program attached to network devices.
  * added bits into bpf/tests/pktgen.h without backporting whole unrelated PRs
  * adjusted LB4_SERVICES_MAP_V2, LB4_REVERSE_NAT_MAP, and their respective
    IPv6 versions references, different than in upstream
  * added `ENABLE_SERVICE_PROTOCOL_DIFFERENTIATION` in the new test,
    alongside with some different infra to tail call the ingress program,
    and macros to check redirect interface
  * changed check in bpf/tests/l7_lb_hairpin.c to check for HOST_IFINDEX
    instead of cilium_net, as the latter is not used in this backport
]

Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With #36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

We therefore add a new datapath config variable that will be set to true
only when compiling the Netdev datapath program (cil_{from,to}_netdev)
and the network device is a bridge. This allows us to fall back to the
pre-#36383 hairpin redirect via cilium_net for L7 LB traffic, bypassing the
ip_sabotage_in() function and allowing the TPROXY rule to match as expected.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
smagnani96 added a commit that referenced this pull request Mar 11, 2026
Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With #36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

We therefore add a new datapath config variable that will be set to true
only when compiling the Netdev datapath program (cil_{from,to}_netdev)
and the network device is a bridge. This allows us to fall back to the
pre-#36383 hairpin redirect via cilium_net for L7 LB traffic, bypassing the
ip_sabotage_in() function and allowing the TPROXY rule to match as expected.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
smagnani96 added a commit that referenced this pull request Mar 11, 2026
Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With #36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

We therefore add a new datapath config variable that will be set to true
only when compiling the Netdev datapath program (cil_{from,to}_netdev)
and the network device is a bridge. This allows us to fall back to the
pre-#36383 hairpin redirect via cilium_net for L7 LB traffic, bypassing the
ip_sabotage_in() function and allowing the TPROXY rule to match as expected.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Mar 12, 2026
Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With #36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

We therefore add a new datapath config variable that will be set to true
only when compiling the Netdev datapath program (cil_{from,to}_netdev)
and the network device is a bridge. This allows us to fall back to the
pre-#36383 hairpin redirect via cilium_net for L7 LB traffic, bypassing the
ip_sabotage_in() function and allowing the TPROXY rule to match as expected.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
smagnani96 added a commit that referenced this pull request Mar 12, 2026
[ upstream commit 52f35a9 ]

[ backporter's notes: fixed conflicts:
  * adapting to the old loader implementation, adding changes only to
    the Netdev program attached to network devices.
  * added bits into bpf/tests/pktgen.h without backporting whole unrelated PRs
  * adjusted LB4_SERVICES_MAP_V2, LB4_REVERSE_NAT_MAP, and their respective
    IPv6 versions references, different than in upstream
  * added `ENABLE_SERVICE_PROTOCOL_DIFFERENTIATION` in the new test,
    alongside with some different infra to tail call the ingress program,
    and macros to check redirect interface
  * changed check in bpf/tests/l7_lb_hairpin.c to check for HOST_IFINDEX
    instead of cilium_net, as the latter is not used in this backport
]

Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With #36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

We therefore add a new datapath config variable that will be set to true
only when compiling the Netdev datapath program (cil_{from,to}_netdev)
and the network device is a bridge. This allows us to fall back to the
pre-#36383 hairpin redirect via cilium_net for L7 LB traffic, bypassing the
ip_sabotage_in() function and allowing the TPROXY rule to match as expected.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
smagnani96 added a commit that referenced this pull request Mar 12, 2026
[ upstream commit 52f35a9 ]

[ backporter's notes: fixed conflicts:
  * adapting to the old loader implementation, adding changes only to
    the Netdev program attached to network devices.
  * added bits into bpf/tests/pktgen.h without backporting whole unrelated PRs
  * adjusted LB4_SERVICES_MAP_V2, LB4_REVERSE_NAT_MAP, and their respective
    IPv6 versions references, different than in upstream
  * added `ENABLE_SERVICE_PROTOCOL_DIFFERENTIATION` in the new test,
    alongside with some different infra to tail call the ingress program,
    and macros to check redirect interface
  * changed check in bpf/tests/l7_lb_hairpin.c to check for HOST_IFINDEX
    instead of cilium_net, as the latter is not used in this backport
]

Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With #36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

We therefore add a new datapath config variable that will be set to true
only when compiling the Netdev datapath program (cil_{from,to}_netdev)
and the network device is a bridge. This allows us to fall back to the
pre-#36383 hairpin redirect via cilium_net for L7 LB traffic, bypassing the
ip_sabotage_in() function and allowing the TPROXY rule to match as expected.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
smagnani96 added a commit that referenced this pull request Mar 12, 2026
[ upstream commit 52f35a9 ]

[ backporter's notes: fixed conflicts:
  * adapting to the old loader implementation, adding changes only to
    the Netdev program attached to network devices.
  * added bits into bpf/tests/pktgen.h without backporting whole unrelated PRs
  * changed check in bpf/tests/l7_lb_hairpin.c to check for CILIUM_NET_IFINDEX
    instead of CONFIG(cilium_net_ifindex), as the latter is not used in this backport
]

Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With #36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

We therefore add a new datapath config variable that will be set to true
only when compiling the Netdev datapath program (cil_{from,to}_netdev)
and the network device is a bridge. This allows us to fall back to the
pre-#36383 hairpin redirect via cilium_net for L7 LB traffic, bypassing the
ip_sabotage_in() function and allowing the TPROXY rule to match as expected.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
smagnani96 added a commit that referenced this pull request Mar 12, 2026
[ upstream commit 52f35a9 ]

[ backporter's notes: fixed conflicts:
  * adapting to the old loader implementation, adding changes only to
    the Netdev program attached to network devices.
  * changed check in bpf/tests/l7_lb_hairpin.c to check for CILIUM_NET_IFINDEX
    instead of CONFIG(cilium_net_ifindex), as the latter is not used in this backport
]

Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With #36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

We therefore add a new datapath config variable that will be set to true
only when compiling the Netdev datapath program (cil_{from,to}_netdev)
and the network device is a bridge. This allows us to fall back to the
pre-#36383 hairpin redirect via cilium_net for L7 LB traffic, bypassing the
ip_sabotage_in() function and allowing the TPROXY rule to match as expected.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Mar 13, 2026
[ upstream commit 52f35a9 ]

[ backporter's notes: fixed conflicts:
  * adapting to the old loader implementation, adding changes only to
    the Netdev program attached to network devices.
  * changed check in bpf/tests/l7_lb_hairpin.c to check for CILIUM_NET_IFINDEX
    instead of CONFIG(cilium_net_ifindex), as the latter is not used in this backport
]

Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With #36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

We therefore add a new datapath config variable that will be set to true
only when compiling the Netdev datapath program (cil_{from,to}_netdev)
and the network device is a bridge. This allows us to fall back to the
pre-#36383 hairpin redirect via cilium_net for L7 LB traffic, bypassing the
ip_sabotage_in() function and allowing the TPROXY rule to match as expected.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
julianwiedmann pushed a commit that referenced this pull request Mar 13, 2026
[ upstream commit 52f35a9 ]

[ backporter's notes: fixed conflicts:
  * adapting to the old loader implementation, adding changes only to
    the Netdev program attached to network devices.
  * added bits into bpf/tests/pktgen.h without backporting whole unrelated PRs
  * changed check in bpf/tests/l7_lb_hairpin.c to check for CILIUM_NET_IFINDEX
    instead of CONFIG(cilium_net_ifindex), as the latter is not used in this backport
]

Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With #36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

We therefore add a new datapath config variable that will be set to true
only when compiling the Netdev datapath program (cil_{from,to}_netdev)
and the network device is a bridge. This allows us to fall back to the
pre-#36383 hairpin redirect via cilium_net for L7 LB traffic, bypassing the
ip_sabotage_in() function and allowing the TPROXY rule to match as expected.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
julianwiedmann pushed a commit that referenced this pull request Mar 13, 2026
[ upstream commit 52f35a9 ]

[ backporter's notes: fixed conflicts:
  * adapting to the old loader implementation, adding changes only to
    the Netdev program attached to network devices.
  * added bits into bpf/tests/pktgen.h without backporting whole unrelated PRs
  * adjusted LB4_SERVICES_MAP_V2, LB4_REVERSE_NAT_MAP, and their respective
    IPv6 versions references, different than in upstream
  * added `ENABLE_SERVICE_PROTOCOL_DIFFERENTIATION` in the new test,
    alongside with some different infra to tail call the ingress program,
    and macros to check redirect interface
  * changed check in bpf/tests/l7_lb_hairpin.c to check for HOST_IFINDEX
    instead of cilium_net, as the latter is not used in this backport
]

Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With #36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

We therefore add a new datapath config variable that will be set to true
only when compiling the Netdev datapath program (cil_{from,to}_netdev)
and the network device is a bridge. This allows us to fall back to the
pre-#36383 hairpin redirect via cilium_net for L7 LB traffic, bypassing the
ip_sabotage_in() function and allowing the TPROXY rule to match as expected.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Mar 13, 2026
[ upstream commit 52f35a9 ]

[ backporter's notes: fixed conflicts:
  * adapting to the old loader implementation, adding changes only to
    the Netdev program attached to network devices.
  * added bits into bpf/tests/pktgen.h without backporting whole unrelated PRs
  * changed check in bpf/tests/l7_lb_hairpin.c to check for CILIUM_NET_IFINDEX
    instead of CONFIG(cilium_net_ifindex), as the latter is not used in this backport
]

Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With #36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

We therefore add a new datapath config variable that will be set to true
only when compiling the Netdev datapath program (cil_{from,to}_netdev)
and the network device is a bridge. This allows us to fall back to the
pre-#36383 hairpin redirect via cilium_net for L7 LB traffic, bypassing the
ip_sabotage_in() function and allowing the TPROXY rule to match as expected.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Mar 13, 2026
[ upstream commit 52f35a9 ]

[ backporter's notes: fixed conflicts:
  * adapting to the old loader implementation, adding changes only to
    the Netdev program attached to network devices.
  * added bits into bpf/tests/pktgen.h without backporting whole unrelated PRs
  * adjusted LB4_SERVICES_MAP_V2, LB4_REVERSE_NAT_MAP, and their respective
    IPv6 versions references, different than in upstream
  * added `ENABLE_SERVICE_PROTOCOL_DIFFERENTIATION` in the new test,
    alongside with some different infra to tail call the ingress program,
    and macros to check redirect interface
  * changed check in bpf/tests/l7_lb_hairpin.c to check for HOST_IFINDEX
    instead of cilium_net, as the latter is not used in this backport
]

Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With #36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

We therefore add a new datapath config variable that will be set to true
only when compiling the Netdev datapath program (cil_{from,to}_netdev)
and the network device is a bridge. This allows us to fall back to the
pre-#36383 hairpin redirect via cilium_net for L7 LB traffic, bypassing the
ip_sabotage_in() function and allowing the TPROXY rule to match as expected.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
javiercardona-work pushed a commit to javiercardona-work/cilium that referenced this pull request Mar 18, 2026
Fix L7 proxy traffic disruption (DROP_REASON_NOSOCKET) on nodes where
the BPF program is attached to a Linux bridge device and the
br_netfilter module is loaded with net.bridge.bridge-nf-call-iptables=1.

With cilium#36383, we changed the non-BPF-TPROXY L7 LB redirect path to mark
packets with MARK_MAGIC_TO_PROXY and punt them to the stack, relying on
an iptables TPROXY rule in PREROUTING to deliver the packet to the
proxy. This optimization avoids a hairpin redirect through cilium_net.

However, when br_netfilter is active on a bridge device, the kernel's
ip_sabotage_in() function interferes:

1. Packet arrives on bridge port -> br_netfilter evaluates PREROUTING
   (1st time, BPF hasn't run yet -> TPROXY rule doesn't match)
2. ip_sabotage_in() sets NF_HOOK_STATE_SABOTAGED on the packet
3. tc-ingress BPF runs, sets MARK_MAGIC_TO_PROXY, punts to stack
4. IP stack calls PREROUTING again, but ip_sabotage causes it to be
   SKIPPED entirely
5. TPROXY rule never fires -> no listening socket on VIP ->
   DROP_REASON_NOSOCKET

We therefore add a new datapath config variable that will be set to true
only when compiling the Netdev datapath program (cil_{from,to}_netdev)
and the network device is a bridge. This allows us to fall back to the
pre-cilium#36383 hairpin redirect via cilium_net for L7 LB traffic, bypassing the
ip_sabotage_in() function and allowing the TPROXY rule to match as expected.

Non-bridge devices continue using the optimized punt-to-stack path.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/loadbalancing Impacts load-balancing and Kubernetes service implementations area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants