Skip to content

bpf: avoid redundant mac rewrite in common case#42926

Merged
ldelossa merged 1 commit intomainfrom
ldelossa/bpf-lxc-redundant-mac-rewrite
Dec 18, 2025
Merged

bpf: avoid redundant mac rewrite in common case#42926
ldelossa merged 1 commit intomainfrom
ldelossa/bpf-lxc-redundant-mac-rewrite

Conversation

@ldelossa
Copy link
Copy Markdown
Contributor

@ldelossa ldelossa commented Nov 22, 2025

In bpf_lxc.c there is a redundant destination MAC rewrite.

pass_to_stack:
#ifdef ENABLE_ROUTING
	ret = ipv6_l3(ctx, ETH_HLEN, NULL, (__u8 *)&router_mac.addr, METRIC_EGRESS);
	if (unlikely(ret != CTX_ACT_OK))
		return ret;
#endif

After several cups of coffee and internal banter we have determined that this mac rewrite is unnecessary.

ARP responder

In the common case we employee an ARP responder in the eBPF data path.

In Cilium a pod's default route is over the VETH device, which is a layer 2 device.
During an application's egress packet flow ARP is performed in and the packet is queued until ARP responds.

Cilium will respond to ARP and the egress packet will then dequeue.

flowchart LR

subgraph pod-ns
	direction LR
	pod-->ip_output-->ip_neigh_for_gw-->q["queued for ARP or neigh lookup returned"]
end
Loading

In the common case where Cilium's ARP responder is in place, the data-path will respond with the MAC of the host-side VETH.

Notice, the queued ARP and its responses occur within the pod's network namespace.
Therefore, we have a valid MAC address the host network namespace is happy to accept during the VETH forwarding handoff.

ARP Passthrough

There are scenarios where ARP passthrough is enabled and Cilium is not performing eBPF proxy ARP.

One scenario is CNI chaining.
However, in this scenario RequireRouting is disabled for all endpoints and the above ipv{4,6}_l3 call which rewrites the destination is not made.
When RequireRouting is set to false the assumed scenario is that the host network namespace is handling ARP, whether that be via proxy arp, or some other mechanism, is not of Cilium's concern.

The other scenario where ARP passthrough is enabled is when Netkit mode is enabled.
However, Netkit is a layer 3 device by default (and in Cilium's use case).
This device is a bit unique as its configured as a layer 2 device with the IFF_NOARP flag set.
This results in layer 2 headers being appended to egress traffic while no ARP is performed.
A ping over Netkit devices peers looks as follows.

18:15:42.358787 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 98: 10.1.0.1 > 8.8.8.8: ICMP echo request, id 39351, seq 4, length 64

So in the Netkit forwarding path any neighbor creation will have the hard-coded mac address of 0 and no ARP will be required to resolve this.
The Netkit driver calls eth_skb_pkt_type to set pkt_type in the forwarding driver, prior to running the eBPF program.
Therefore, the packet is already PACKET_HOST when bpf_lxc runs and we have no need to rewrite the destination MAC if we are going directly to the stack next, we've been accepted at layer 2 and we're heading towards layer 3 with PACKET_HOST.

Submission to stack

The redundant destination mac in focus is performed only when:

  1. ENABLE_ROUTING is defined
    1. defined by default
    2. undefined for CNI chaining mode and EnableEndpointRoutes=true
  2. The destination is a remote pod
  3. We are immediately heading to the stack next

Consider that by the time we even evaluate this destination mac rewrite eth_type_trans has already been called from the receive path of the ingress device.
Therefore, even if we rewrite the destination MAC at this point it has no effect, the pkt_type is already set and we are going directly into layer 3 processing regardless, where the layer 2 header is of no consequence.

Redundant TTL Dec

The ipv{4,6}_l3 functions are also doing a TTL decrement.
However, this is also redundant.
When the packet is submitted to stack if the host stack decides its routable it will perform its own TTL decrement in the ip_forward path.

Summary

From all perspectives the final ipv4{4,6}_l3 call in bpf_lxc appears redundant.
If the packet is immediately going to the stack both the TTL decrement and the destination MAC rewrite is of no consequence to the next layer 3 processing path.

The genesis of this redirect is from commit:

commit 99108598251cf4db382e6e6e0cef9d76a47a93ca
Author: Thomas Graf <thomas@cilium.io>
Date:   Thu Jan 28 00:19:17 2016 +0100

    Use the MAC of the host facing veth end as router MAC
    
    Signed-off-by: Thomas Graf <thomas@cilium.io>

In the above commit the original destination MAC rewrite was put in place.
This was redundant even at the genesis for the reasons mentioned above.

bpf: avoid redundant mac rewrite in bpf_lxc egress path

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 22, 2025
@ldelossa ldelossa force-pushed the ldelossa/bpf-lxc-redundant-mac-rewrite branch 4 times, most recently from 6b82b23 to bf0bbea Compare November 22, 2025 02:52
@ldelossa
Copy link
Copy Markdown
Contributor Author

/ci-eks

@ldelossa
Copy link
Copy Markdown
Contributor Author

/test

@ldelossa
Copy link
Copy Markdown
Contributor Author

/ci-gke

1 similar comment
@ldelossa
Copy link
Copy Markdown
Contributor Author

/ci-gke

@ldelossa ldelossa added the release-note/misc This PR makes changes that have no direct user impact. label Nov 25, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 25, 2025
@ldelossa ldelossa marked this pull request as ready for review November 25, 2025 23:28
@ldelossa ldelossa requested a review from a team as a code owner November 25, 2025 23:28
@julianwiedmann
Copy link
Copy Markdown
Member

I have way too little context on how all these ARP paths are meant to work, sorry. But it looks like this PR will help to define the expected behavior a bit more, looking forward to it :)

@julianwiedmann julianwiedmann removed their request for review December 3, 2025 08:29
@ldelossa
Copy link
Copy Markdown
Contributor Author

ldelossa commented Dec 3, 2025

@julianwiedmann yup, no worries, going to talk about this a bit out of band with @borkmann soon

@ldelossa ldelossa force-pushed the ldelossa/bpf-lxc-redundant-mac-rewrite branch 2 times, most recently from 720f3e5 to 7baba9f Compare December 10, 2025 01:14
@ldelossa
Copy link
Copy Markdown
Contributor Author

/ci-eks

@ldelossa
Copy link
Copy Markdown
Contributor Author

/ci-gke

@ldelossa
Copy link
Copy Markdown
Contributor Author

/test

@ldelossa ldelossa added the dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. label Dec 10, 2025
@ldelossa
Copy link
Copy Markdown
Contributor Author

Marking as don't merge until I take a look at: #43226

After several cups of coffee and internal banter we have determined that this
mac rewrite is unnecessary.

The destination mac rewrites occur just prior to being submitted to the
stack. The host network namespace will perform layer 3 routing and
rewrite the mac subsequently regardless of our original rewrite. The
layer 3 path also decrements the TTL, making this redundant as well.

Additionally, in both Veth and Netkit forwarding paths
`eth_type_trans()` and its sibling `eth_skb_pkt_type` are called,
respectively, to set the packet's type PRIOR to our eBPF program running, so
any modification to the destination MAC in the effort to ensure packet reception
into layer 3 does not make sense.

The genesis of this redirect is from commit:
```
commit 9910859
Author: Thomas Graf <thomas@cilium.io>
Date:   Thu Jan 28 00:19:17 2016 +0100

    Use the MAC of the host facing veth end as router MAC
```

In the above commit the original destination MAC rewrite was put in place.
This was redundant even at the genesis for the reasons mentioned above.

Signed-off-by: Louis DeLosSantos <louis.delos@isovalent.com>
@ldelossa ldelossa force-pushed the ldelossa/bpf-lxc-redundant-mac-rewrite branch from 7baba9f to b143cb2 Compare December 16, 2025 16:15
@ldelossa
Copy link
Copy Markdown
Contributor Author

rebased taking #43226 into account.

@ldelossa ldelossa removed the dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. label Dec 16, 2025
@ldelossa
Copy link
Copy Markdown
Contributor Author

/test

@ldelossa
Copy link
Copy Markdown
Contributor Author

/ci-eks

@ldelossa
Copy link
Copy Markdown
Contributor Author

/ci-gke

Copy link
Copy Markdown
Member

@borkmann borkmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this lgtm!

@ldelossa ldelossa added this pull request to the merge queue Dec 18, 2025
Merged via the queue into main with commit fdf020e Dec 18, 2025
456 of 471 checks passed
@ldelossa ldelossa deleted the ldelossa/bpf-lxc-redundant-mac-rewrite branch December 18, 2025 14:13
@julianwiedmann
Copy link
Copy Markdown
Member

amazing archaeology, thank you @ldelossa 🚀

@cilium-release-bot cilium-release-bot bot moved this to Released in cilium v1.19.0 Feb 3, 2026
smagnani96 added a commit that referenced this pull request Mar 16, 2026
[ upstream commit ea17ed8 ]

[ backporter's notes:
  * using macro CILIUM_HOST_IFINDEX instead of CONFIG(cilium_host_ifindex) as it was not backported.
  * fixed conflicts in bpf_lxc.c, as #43226, #42926 and others haven't been backported.
]

In commit d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB"),
we enabled support for re-entering the LXC egress path after a packet is
processed by a L7 LB. This allows us to correctly apply egress policies to
packets sent by a L7 LB to local backends.

Without per-endpoint routes, the pod policies are checked from the tail
call in bpf_host. We reach this code from cilium_host (ctx->ifindex).

With per-endpoint routes, the packet is handled directly in the
local backend pod `cil_to_container` path, completely skipping ingress
policies. Returning CTX_ACT_OK at this point would bypass policies.
To fix this, we need to hairpin the packet back to cil_to_container
(the ctx->mark is already cleared at the beginning of the codepath).
This would allow us to correctly apply ingress policies on the local
backend ingress path, restoring the expected behavior.

Fixes: d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB")

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
smagnani96 added a commit that referenced this pull request Mar 16, 2026
[ upstream commit ea17ed8 ]

[ backporter's notes:
  * using macro CILIUM_HOST_IFINDEX instead of CONFIG(cilium_host_ifindex) as it was not backported.
  * fixed conflicts in bpf_lxc.c, as #43226, #42926 and others haven't been backported.
]

In commit d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB"),
we enabled support for re-entering the LXC egress path after a packet is
processed by a L7 LB. This allows us to correctly apply egress policies to
packets sent by a L7 LB to local backends.

Without per-endpoint routes, the pod policies are checked from the tail
call in bpf_host. We reach this code from cilium_host (ctx->ifindex).

With per-endpoint routes, the packet is handled directly in the
local backend pod `cil_to_container` path, completely skipping ingress
policies. Returning CTX_ACT_OK at this point would bypass policies.
To fix this, we need to hairpin the packet back to cil_to_container
(the ctx->mark is already cleared at the beginning of the codepath).
This would allow us to correctly apply ingress policies on the local
backend ingress path, restoring the expected behavior.

Fixes: d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB")

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
smagnani96 added a commit that referenced this pull request Mar 16, 2026
[ upstream commit ea17ed8 ]

[ backporter's notes:
  * using macro HOST_IFINDEX instead of CONFIG(cilium_host_ifindex) as it was not backported.
  * fixed conflicts in bpf_lxc.c, as #43226, #42926 and others haven't been backported.
]

In commit d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB"),
we enabled support for re-entering the LXC egress path after a packet is
processed by a L7 LB. This allows us to correctly apply egress policies to
packets sent by a L7 LB to local backends.

Without per-endpoint routes, the pod policies are checked from the tail
call in bpf_host. We reach this code from cilium_host (ctx->ifindex).

With per-endpoint routes, the packet is handled directly in the
local backend pod `cil_to_container` path, completely skipping ingress
policies. Returning CTX_ACT_OK at this point would bypass policies.
To fix this, we need to hairpin the packet back to cil_to_container
(the ctx->mark is already cleared at the beginning of the codepath).
This would allow us to correctly apply ingress policies on the local
backend ingress path, restoring the expected behavior.

Fixes: d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB")

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Mar 16, 2026
[ upstream commit ea17ed8 ]

[ backporter's notes:
  * using macro CILIUM_HOST_IFINDEX instead of CONFIG(cilium_host_ifindex) as it was not backported.
  * fixed conflicts in bpf_lxc.c, as #43226, #42926 and others haven't been backported.
]

In commit d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB"),
we enabled support for re-entering the LXC egress path after a packet is
processed by a L7 LB. This allows us to correctly apply egress policies to
packets sent by a L7 LB to local backends.

Without per-endpoint routes, the pod policies are checked from the tail
call in bpf_host. We reach this code from cilium_host (ctx->ifindex).

With per-endpoint routes, the packet is handled directly in the
local backend pod `cil_to_container` path, completely skipping ingress
policies. Returning CTX_ACT_OK at this point would bypass policies.
To fix this, we need to hairpin the packet back to cil_to_container
(the ctx->mark is already cleared at the beginning of the codepath).
This would allow us to correctly apply ingress policies on the local
backend ingress path, restoring the expected behavior.

Fixes: d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB")

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Mar 16, 2026
[ upstream commit ea17ed8 ]

[ backporter's notes:
  * using macro HOST_IFINDEX instead of CONFIG(cilium_host_ifindex) as it was not backported.
  * fixed conflicts in bpf_lxc.c, as #43226, #42926 and others haven't been backported.
]

In commit d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB"),
we enabled support for re-entering the LXC egress path after a packet is
processed by a L7 LB. This allows us to correctly apply egress policies to
packets sent by a L7 LB to local backends.

Without per-endpoint routes, the pod policies are checked from the tail
call in bpf_host. We reach this code from cilium_host (ctx->ifindex).

With per-endpoint routes, the packet is handled directly in the
local backend pod `cil_to_container` path, completely skipping ingress
policies. Returning CTX_ACT_OK at this point would bypass policies.
To fix this, we need to hairpin the packet back to cil_to_container
(the ctx->mark is already cleared at the beginning of the codepath).
This would allow us to correctly apply ingress policies on the local
backend ingress path, restoring the expected behavior.

Fixes: d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB")

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note/misc This PR makes changes that have no direct user impact.

Projects

No open projects
Status: Released

Development

Successfully merging this pull request may close these issues.

4 participants