bpf: avoid redundant mac rewrite in common case by ldelossa · Pull Request #42926 · cilium/cilium

ldelossa · 2025-11-22T01:48:42Z

In bpf_lxc.c there is a redundant destination MAC rewrite.

pass_to_stack:
#ifdef ENABLE_ROUTING
	ret = ipv6_l3(ctx, ETH_HLEN, NULL, (__u8 *)&router_mac.addr, METRIC_EGRESS);
	if (unlikely(ret != CTX_ACT_OK))
		return ret;
#endif

After several cups of coffee and internal banter we have determined that this mac rewrite is unnecessary.

ARP responder

In the common case we employee an ARP responder in the eBPF data path.

In Cilium a pod's default route is over the VETH device, which is a layer 2 device.
During an application's egress packet flow ARP is performed in and the packet is queued until ARP responds.

Cilium will respond to ARP and the egress packet will then dequeue.

flowchart LR

subgraph pod-ns
	direction LR
	pod-->ip_output-->ip_neigh_for_gw-->q["queued for ARP or neigh lookup returned"]
end

In the common case where Cilium's ARP responder is in place, the data-path will respond with the MAC of the host-side VETH.

Notice, the queued ARP and its responses occur within the pod's network namespace.
Therefore, we have a valid MAC address the host network namespace is happy to accept during the VETH forwarding handoff.

ARP Passthrough

There are scenarios where ARP passthrough is enabled and Cilium is not performing eBPF proxy ARP.

One scenario is CNI chaining.
However, in this scenario RequireRouting is disabled for all endpoints and the above ipv{4,6}_l3 call which rewrites the destination is not made.
When RequireRouting is set to false the assumed scenario is that the host network namespace is handling ARP, whether that be via proxy arp, or some other mechanism, is not of Cilium's concern.

The other scenario where ARP passthrough is enabled is when Netkit mode is enabled.
However, Netkit is a layer 3 device by default (and in Cilium's use case).
This device is a bit unique as its configured as a layer 2 device with the IFF_NOARP flag set.
This results in layer 2 headers being appended to egress traffic while no ARP is performed.
A ping over Netkit devices peers looks as follows.

18:15:42.358787 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 98: 10.1.0.1 > 8.8.8.8: ICMP echo request, id 39351, seq 4, length 64

So in the Netkit forwarding path any neighbor creation will have the hard-coded mac address of 0 and no ARP will be required to resolve this.
The Netkit driver calls eth_skb_pkt_type to set pkt_type in the forwarding driver, prior to running the eBPF program.
Therefore, the packet is already PACKET_HOST when bpf_lxc runs and we have no need to rewrite the destination MAC if we are going directly to the stack next, we've been accepted at layer 2 and we're heading towards layer 3 with PACKET_HOST.

Submission to stack

The redundant destination mac in focus is performed only when:

ENABLE_ROUTING is defined
1. defined by default
2. undefined for CNI chaining mode and EnableEndpointRoutes=true
The destination is a remote pod
We are immediately heading to the stack next

Consider that by the time we even evaluate this destination mac rewrite eth_type_trans has already been called from the receive path of the ingress device.
Therefore, even if we rewrite the destination MAC at this point it has no effect, the pkt_type is already set and we are going directly into layer 3 processing regardless, where the layer 2 header is of no consequence.

Redundant TTL Dec

The ipv{4,6}_l3 functions are also doing a TTL decrement.
However, this is also redundant.
When the packet is submitted to stack if the host stack decides its routable it will perform its own TTL decrement in the ip_forward path.

Summary

From all perspectives the final ipv4{4,6}_l3 call in bpf_lxc appears redundant.
If the packet is immediately going to the stack both the TTL decrement and the destination MAC rewrite is of no consequence to the next layer 3 processing path.

The genesis of this redirect is from commit:

commit 99108598251cf4db382e6e6e0cef9d76a47a93ca
Author: Thomas Graf <thomas@cilium.io>
Date:   Thu Jan 28 00:19:17 2016 +0100

    Use the MAC of the host facing veth end as router MAC
    
    Signed-off-by: Thomas Graf <thomas@cilium.io>

In the above commit the original destination MAC rewrite was put in place.
This was redundant even at the genesis for the reasons mentioned above.

bpf: avoid redundant mac rewrite in bpf_lxc egress path

ldelossa · 2025-11-22T03:27:27Z

/ci-eks

ldelossa · 2025-11-22T04:11:36Z

/test

ldelossa · 2025-11-22T05:08:08Z

/ci-gke

ldelossa · 2025-11-22T14:52:40Z

/ci-gke

julianwiedmann · 2025-12-03T08:29:05Z

I have way too little context on how all these ARP paths are meant to work, sorry. But it looks like this PR will help to define the expected behavior a bit more, looking forward to it :)

ldelossa · 2025-12-03T18:57:02Z

@julianwiedmann yup, no worries, going to talk about this a bit out of band with @borkmann soon

ldelossa · 2025-12-10T01:19:56Z

/ci-eks

ldelossa · 2025-12-10T01:20:16Z

/ci-gke

ldelossa · 2025-12-10T01:20:29Z

/test

ldelossa · 2025-12-10T22:07:23Z

Marking as don't merge until I take a look at: #43226

After several cups of coffee and internal banter we have determined that this mac rewrite is unnecessary. The destination mac rewrites occur just prior to being submitted to the stack. The host network namespace will perform layer 3 routing and rewrite the mac subsequently regardless of our original rewrite. The layer 3 path also decrements the TTL, making this redundant as well. Additionally, in both Veth and Netkit forwarding paths `eth_type_trans()` and its sibling `eth_skb_pkt_type` are called, respectively, to set the packet's type PRIOR to our eBPF program running, so any modification to the destination MAC in the effort to ensure packet reception into layer 3 does not make sense. The genesis of this redirect is from commit: ``` commit 9910859 Author: Thomas Graf <thomas@cilium.io> Date: Thu Jan 28 00:19:17 2016 +0100 Use the MAC of the host facing veth end as router MAC ``` In the above commit the original destination MAC rewrite was put in place. This was redundant even at the genesis for the reasons mentioned above. Signed-off-by: Louis DeLosSantos <louis.delos@isovalent.com>

ldelossa · 2025-12-16T16:16:29Z

rebased taking #43226 into account.

ldelossa · 2025-12-16T17:17:19Z

/test

ldelossa · 2025-12-16T17:17:32Z

/ci-eks

ldelossa · 2025-12-16T17:17:36Z

/ci-gke

borkmann

this lgtm!

julianwiedmann · 2025-12-18T14:43:34Z

amazing archaeology, thank you @ldelossa 🚀

[ upstream commit ea17ed8 ] [ backporter's notes: * using macro CILIUM_HOST_IFINDEX instead of CONFIG(cilium_host_ifindex) as it was not backported. * fixed conflicts in bpf_lxc.c, as #43226, #42926 and others haven't been backported. ] In commit d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB"), we enabled support for re-entering the LXC egress path after a packet is processed by a L7 LB. This allows us to correctly apply egress policies to packets sent by a L7 LB to local backends. Without per-endpoint routes, the pod policies are checked from the tail call in bpf_host. We reach this code from cilium_host (ctx->ifindex). With per-endpoint routes, the packet is handled directly in the local backend pod `cil_to_container` path, completely skipping ingress policies. Returning CTX_ACT_OK at this point would bypass policies. To fix this, we need to hairpin the packet back to cil_to_container (the ctx->mark is already cleared at the beginning of the codepath). This would allow us to correctly apply ingress policies on the local backend ingress path, restoring the expected behavior. Fixes: d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB") Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>

[ upstream commit ea17ed8 ] [ backporter's notes: * using macro HOST_IFINDEX instead of CONFIG(cilium_host_ifindex) as it was not backported. * fixed conflicts in bpf_lxc.c, as #43226, #42926 and others haven't been backported. ] In commit d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB"), we enabled support for re-entering the LXC egress path after a packet is processed by a L7 LB. This allows us to correctly apply egress policies to packets sent by a L7 LB to local backends. Without per-endpoint routes, the pod policies are checked from the tail call in bpf_host. We reach this code from cilium_host (ctx->ifindex). With per-endpoint routes, the packet is handled directly in the local backend pod `cil_to_container` path, completely skipping ingress policies. Returning CTX_ACT_OK at this point would bypass policies. To fix this, we need to hairpin the packet back to cil_to_container (the ctx->mark is already cleared at the beginning of the codepath). This would allow us to correctly apply ingress policies on the local backend ingress path, restoring the expected behavior. Fixes: d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB") Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>

[ upstream commit ea17ed8 ] [ backporter's notes: * using macro CILIUM_HOST_IFINDEX instead of CONFIG(cilium_host_ifindex) as it was not backported. * fixed conflicts in bpf_lxc.c, as #43226, #42926 and others haven't been backported. ] In commit d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB"), we enabled support for re-entering the LXC egress path after a packet is processed by a L7 LB. This allows us to correctly apply egress policies to packets sent by a L7 LB to local backends. Without per-endpoint routes, the pod policies are checked from the tail call in bpf_host. We reach this code from cilium_host (ctx->ifindex). With per-endpoint routes, the packet is handled directly in the local backend pod `cil_to_container` path, completely skipping ingress policies. Returning CTX_ACT_OK at this point would bypass policies. To fix this, we need to hairpin the packet back to cil_to_container (the ctx->mark is already cleared at the beginning of the codepath). This would allow us to correctly apply ingress policies on the local backend ingress path, restoring the expected behavior. Fixes: d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB") Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>

[ upstream commit ea17ed8 ] [ backporter's notes: * using macro HOST_IFINDEX instead of CONFIG(cilium_host_ifindex) as it was not backported. * fixed conflicts in bpf_lxc.c, as #43226, #42926 and others haven't been backported. ] In commit d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB"), we enabled support for re-entering the LXC egress path after a packet is processed by a L7 LB. This allows us to correctly apply egress policies to packets sent by a L7 LB to local backends. Without per-endpoint routes, the pod policies are checked from the tail call in bpf_host. We reach this code from cilium_host (ctx->ifindex). With per-endpoint routes, the packet is handled directly in the local backend pod `cil_to_container` path, completely skipping ingress policies. Returning CTX_ACT_OK at this point would bypass policies. To fix this, we need to hairpin the packet back to cil_to_container (the ctx->mark is already cleared at the beginning of the codepath). This would allow us to correctly apply ingress policies on the local backend ingress path, restoring the expected behavior. Fixes: d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB") Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 22, 2025

ldelossa force-pushed the ldelossa/bpf-lxc-redundant-mac-rewrite branch 4 times, most recently from 6b82b23 to bf0bbea Compare November 22, 2025 02:52

ldelossa added the release-note/misc This PR makes changes that have no direct user impact. label Nov 25, 2025

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 25, 2025

ldelossa marked this pull request as ready for review November 25, 2025 23:28

ldelossa requested a review from a team as a code owner November 25, 2025 23:28

ldelossa requested review from borkmann, jrife and julianwiedmann November 25, 2025 23:28

julianwiedmann removed their request for review December 3, 2025 08:29

ldelossa force-pushed the ldelossa/bpf-lxc-redundant-mac-rewrite branch 2 times, most recently from 720f3e5 to 7baba9f Compare December 10, 2025 01:14

ldelossa added the dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. label Dec 10, 2025

ldelossa force-pushed the ldelossa/bpf-lxc-redundant-mac-rewrite branch from 7baba9f to b143cb2 Compare December 16, 2025 16:15

ldelossa removed the dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. label Dec 16, 2025

borkmann approved these changes Dec 18, 2025

View reviewed changes

ldelossa added this pull request to the merge queue Dec 18, 2025

Merged via the queue into main with commit fdf020e Dec 18, 2025
456 of 471 checks passed

ldelossa deleted the ldelossa/bpf-lxc-redundant-mac-rewrite branch December 18, 2025 14:13

julianwiedmann mentioned this pull request Dec 19, 2025

bpf: lxc: minor cleanups #43441

Merged

cilium-release-bot bot added this to cilium v1.19.0 Feb 3, 2026

cilium-release-bot bot moved this to Released in cilium v1.19.0 Feb 3, 2026

Conversation

ldelossa commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ARP responder

ARP Passthrough

Submission to stack

Redundant TTL Dec

Summary

Uh oh!

ldelossa commented Nov 22, 2025

Uh oh!

ldelossa commented Nov 22, 2025

Uh oh!

ldelossa commented Nov 22, 2025

Uh oh!

ldelossa commented Nov 22, 2025

Uh oh!

julianwiedmann commented Dec 3, 2025

Uh oh!

ldelossa commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldelossa commented Dec 10, 2025

Uh oh!

ldelossa commented Dec 10, 2025

Uh oh!

ldelossa commented Dec 10, 2025

Uh oh!

ldelossa commented Dec 10, 2025

Uh oh!

ldelossa commented Dec 16, 2025

Uh oh!

ldelossa commented Dec 16, 2025

Uh oh!

ldelossa commented Dec 16, 2025

Uh oh!

ldelossa commented Dec 16, 2025

Uh oh!

borkmann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

julianwiedmann commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ldelossa commented Nov 22, 2025 •

edited

Loading

ldelossa commented Dec 3, 2025 •

edited

Loading