bpf: avoid redundant mac rewrite in common case#42926
Merged
Conversation
6b82b23 to
bf0bbea
Compare
Contributor
Author
|
/ci-eks |
Contributor
Author
|
/test |
Contributor
Author
|
/ci-gke |
1 similar comment
Contributor
Author
|
/ci-gke |
Member
|
I have way too little context on how all these ARP paths are meant to work, sorry. But it looks like this PR will help to define the expected behavior a bit more, looking forward to it :) |
Contributor
Author
|
@julianwiedmann yup, no worries, going to talk about this a bit out of band with @borkmann soon |
720f3e5 to
7baba9f
Compare
Contributor
Author
|
/ci-eks |
Contributor
Author
|
/ci-gke |
Contributor
Author
|
/test |
Contributor
Author
|
Marking as don't merge until I take a look at: #43226 |
After several cups of coffee and internal banter we have determined that this mac rewrite is unnecessary. The destination mac rewrites occur just prior to being submitted to the stack. The host network namespace will perform layer 3 routing and rewrite the mac subsequently regardless of our original rewrite. The layer 3 path also decrements the TTL, making this redundant as well. Additionally, in both Veth and Netkit forwarding paths `eth_type_trans()` and its sibling `eth_skb_pkt_type` are called, respectively, to set the packet's type PRIOR to our eBPF program running, so any modification to the destination MAC in the effort to ensure packet reception into layer 3 does not make sense. The genesis of this redirect is from commit: ``` commit 9910859 Author: Thomas Graf <thomas@cilium.io> Date: Thu Jan 28 00:19:17 2016 +0100 Use the MAC of the host facing veth end as router MAC ``` In the above commit the original destination MAC rewrite was put in place. This was redundant even at the genesis for the reasons mentioned above. Signed-off-by: Louis DeLosSantos <louis.delos@isovalent.com>
7baba9f to
b143cb2
Compare
Contributor
Author
|
rebased taking #43226 into account. |
Contributor
Author
|
/test |
Contributor
Author
|
/ci-eks |
Contributor
Author
|
/ci-gke |
Member
|
amazing archaeology, thank you @ldelossa 🚀 |
smagnani96
added a commit
that referenced
this pull request
Mar 16, 2026
[ upstream commit ea17ed8 ] [ backporter's notes: * using macro CILIUM_HOST_IFINDEX instead of CONFIG(cilium_host_ifindex) as it was not backported. * fixed conflicts in bpf_lxc.c, as #43226, #42926 and others haven't been backported. ] In commit d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB"), we enabled support for re-entering the LXC egress path after a packet is processed by a L7 LB. This allows us to correctly apply egress policies to packets sent by a L7 LB to local backends. Without per-endpoint routes, the pod policies are checked from the tail call in bpf_host. We reach this code from cilium_host (ctx->ifindex). With per-endpoint routes, the packet is handled directly in the local backend pod `cil_to_container` path, completely skipping ingress policies. Returning CTX_ACT_OK at this point would bypass policies. To fix this, we need to hairpin the packet back to cil_to_container (the ctx->mark is already cleared at the beginning of the codepath). This would allow us to correctly apply ingress policies on the local backend ingress path, restoring the expected behavior. Fixes: d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB") Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
smagnani96
added a commit
that referenced
this pull request
Mar 16, 2026
[ upstream commit ea17ed8 ] [ backporter's notes: * using macro CILIUM_HOST_IFINDEX instead of CONFIG(cilium_host_ifindex) as it was not backported. * fixed conflicts in bpf_lxc.c, as #43226, #42926 and others haven't been backported. ] In commit d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB"), we enabled support for re-entering the LXC egress path after a packet is processed by a L7 LB. This allows us to correctly apply egress policies to packets sent by a L7 LB to local backends. Without per-endpoint routes, the pod policies are checked from the tail call in bpf_host. We reach this code from cilium_host (ctx->ifindex). With per-endpoint routes, the packet is handled directly in the local backend pod `cil_to_container` path, completely skipping ingress policies. Returning CTX_ACT_OK at this point would bypass policies. To fix this, we need to hairpin the packet back to cil_to_container (the ctx->mark is already cleared at the beginning of the codepath). This would allow us to correctly apply ingress policies on the local backend ingress path, restoring the expected behavior. Fixes: d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB") Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
smagnani96
added a commit
that referenced
this pull request
Mar 16, 2026
[ upstream commit ea17ed8 ] [ backporter's notes: * using macro HOST_IFINDEX instead of CONFIG(cilium_host_ifindex) as it was not backported. * fixed conflicts in bpf_lxc.c, as #43226, #42926 and others haven't been backported. ] In commit d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB"), we enabled support for re-entering the LXC egress path after a packet is processed by a L7 LB. This allows us to correctly apply egress policies to packets sent by a L7 LB to local backends. Without per-endpoint routes, the pod policies are checked from the tail call in bpf_host. We reach this code from cilium_host (ctx->ifindex). With per-endpoint routes, the packet is handled directly in the local backend pod `cil_to_container` path, completely skipping ingress policies. Returning CTX_ACT_OK at this point would bypass policies. To fix this, we need to hairpin the packet back to cil_to_container (the ctx->mark is already cleared at the beginning of the codepath). This would allow us to correctly apply ingress policies on the local backend ingress path, restoring the expected behavior. Fixes: d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB") Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
github-merge-queue bot
pushed a commit
that referenced
this pull request
Mar 16, 2026
[ upstream commit ea17ed8 ] [ backporter's notes: * using macro CILIUM_HOST_IFINDEX instead of CONFIG(cilium_host_ifindex) as it was not backported. * fixed conflicts in bpf_lxc.c, as #43226, #42926 and others haven't been backported. ] In commit d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB"), we enabled support for re-entering the LXC egress path after a packet is processed by a L7 LB. This allows us to correctly apply egress policies to packets sent by a L7 LB to local backends. Without per-endpoint routes, the pod policies are checked from the tail call in bpf_host. We reach this code from cilium_host (ctx->ifindex). With per-endpoint routes, the packet is handled directly in the local backend pod `cil_to_container` path, completely skipping ingress policies. Returning CTX_ACT_OK at this point would bypass policies. To fix this, we need to hairpin the packet back to cil_to_container (the ctx->mark is already cleared at the beginning of the codepath). This would allow us to correctly apply ingress policies on the local backend ingress path, restoring the expected behavior. Fixes: d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB") Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
github-merge-queue bot
pushed a commit
that referenced
this pull request
Mar 16, 2026
[ upstream commit ea17ed8 ] [ backporter's notes: * using macro HOST_IFINDEX instead of CONFIG(cilium_host_ifindex) as it was not backported. * fixed conflicts in bpf_lxc.c, as #43226, #42926 and others haven't been backported. ] In commit d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB"), we enabled support for re-entering the LXC egress path after a packet is processed by a L7 LB. This allows us to correctly apply egress policies to packets sent by a L7 LB to local backends. Without per-endpoint routes, the pod policies are checked from the tail call in bpf_host. We reach this code from cilium_host (ctx->ifindex). With per-endpoint routes, the packet is handled directly in the local backend pod `cil_to_container` path, completely skipping ingress policies. Returning CTX_ACT_OK at this point would bypass policies. To fix this, we need to hairpin the packet back to cil_to_container (the ctx->mark is already cleared at the beginning of the codepath). This would allow us to correctly apply ingress policies on the local backend ingress path, restoring the expected behavior. Fixes: d1d8e7a ("datapath: Add support for re-entering LXC egress path after L7 LB") Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In
bpf_lxc.cthere is a redundant destination MAC rewrite.After several cups of coffee and internal banter we have determined that this mac rewrite is unnecessary.
ARP responder
In the common case we employee an ARP responder in the eBPF data path.
In Cilium a pod's default route is over the VETH device, which is a layer 2 device.
During an application's egress packet flow ARP is performed in and the packet is queued until ARP responds.
Cilium will respond to ARP and the egress packet will then dequeue.
In the common case where Cilium's ARP responder is in place, the data-path will respond with the MAC of the host-side VETH.
Notice, the queued ARP and its responses occur within the pod's network namespace.
Therefore, we have a valid MAC address the host network namespace is happy to accept during the VETH forwarding handoff.
ARP Passthrough
There are scenarios where ARP passthrough is enabled and Cilium is not performing eBPF proxy ARP.
One scenario is CNI chaining.
However, in this scenario
RequireRoutingis disabled for all endpoints and the aboveipv{4,6}_l3call which rewrites the destination is not made.When
RequireRoutingis set to false the assumed scenario is that the host network namespace is handling ARP, whether that be via proxy arp, or some other mechanism, is not of Cilium's concern.The other scenario where ARP passthrough is enabled is when Netkit mode is enabled.
However, Netkit is a layer 3 device by default (and in Cilium's use case).
This device is a bit unique as its configured as a layer 2 device with the
IFF_NOARPflag set.This results in layer 2 headers being appended to egress traffic while no ARP is performed.
A ping over Netkit devices peers looks as follows.
So in the Netkit forwarding path any neighbor creation will have the hard-coded mac address of 0 and no ARP will be required to resolve this.
The Netkit driver calls
eth_skb_pkt_typeto setpkt_typein the forwarding driver, prior to running the eBPF program.Therefore, the packet is already
PACKET_HOSTwhenbpf_lxcruns and we have no need to rewrite the destination MAC if we are going directly to the stack next, we've been accepted at layer 2 and we're heading towards layer 3 withPACKET_HOST.Submission to stack
The redundant destination mac in focus is performed only when:
ENABLE_ROUTINGis definedConsider that by the time we even evaluate this destination mac rewrite
eth_type_transhas already been called from the receive path of the ingress device.Therefore, even if we rewrite the destination MAC at this point it has no effect, the
pkt_typeis already set and we are going directly into layer 3 processing regardless, where the layer 2 header is of no consequence.Redundant TTL Dec
The
ipv{4,6}_l3functions are also doing a TTL decrement.However, this is also redundant.
When the packet is submitted to stack if the host stack decides its routable it will perform its own TTL decrement in the
ip_forwardpath.Summary
From all perspectives the final
ipv4{4,6}_l3call inbpf_lxcappears redundant.If the packet is immediately going to the stack both the TTL decrement and the destination MAC rewrite is of no consequence to the next layer 3 processing path.
The genesis of this redirect is from commit:
In the above commit the original destination MAC rewrite was put in place.
This was redundant even at the genesis for the reasons mentioned above.