sockets: Terminate sockets with BPF socket iterators#38693
sockets: Terminate sockets with BPF socket iterators#38693joestringer merged 5 commits intocilium:mainfrom
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
2f1f523 to
eb68f15
Compare
7fafc73 to
86636db
Compare
|
Removing myself as a reviewer since Jussi already acked for sig-foundations ✅ |
Looks like the file descriptor leak check is failing. Will need to investigate this tomorrow. Edit: This went away with a retrigger and testing locally, I could not detect any file descriptor leaks while using the feature. I created #40832 to try to debug the leak (if any), but cannot reproduce there. I will try a few more times. Related: #40179 |
|
/test |
|
Tracking what appear to be some CI flakes...
|
|
/test |
|
@joamaki I had to change the code in // socketDestroyerFactory creates a level of indirection that makes it possible
// to inject mock SocketDestroyers in tests. SocketDestoyer cannot simply be
// Provide()ed above, since the invocation of NewSocketDestroyer must be
// deferred until all maps are created. We want to run it right before
// socketTerminationLoop is invoked.
type socketDestroyerFactory func(socketTerminationParams) (sockets.SocketDestroyer, error)
func makeSocketDestroyer(p socketTerminationParams) (sockets.SocketDestroyer, error) {
sockRevNat4, sockRevNat6 := p.LBMaps.SockRevNat()
sd, err := sockets.NewSocketDestroyer(p.Log, sockRevNat4, sockRevNat6)
if err != nil {
return nil, err
}
return sd, nil
} |
|
Looks like this still needs a review from @cilium/ci-structure . Maybe @pchaigno ? |
|
Looks for me, @ti-mo I'm assuming these changes all look on your side as well? |
|
Created a follow up issue here: #40862 |
Move cilium_lb(4|6)_reverse_sk map definitions to sock.h so that they can be shared with bpf_sock_term.c, a program added in subsequent commits which can destroy sockets connected to a particular backend. Fixes: cilium#37907 Signed-off-by: Jordan Rife <jrife@google.com>
Implement bpf_sock_term.c, the program that handles socket destruction, and its unit tests. cil_sock_(udp|tcp)_destroy_v4 and cil_sock_(udp|tcp)_destroy_v6 are meant to be used in conjunction with BPF socket iterators to iterate through all sockets in a network namespace and destroy any matching the configured filter. cilium_sock_term_filter is a variable containing the filter, a value consisting of a backend's address family, address, and port. The programs combine the current socket's cookie and the configured filter into a cilium_lb(4|6)_reverse_sk map key and check for the existence of that key to see if this socket was connected to the now obsolete backend. If so, it destroys it with the bpf_sock_destroy() kfunc. An earlier prototype of the program simply used DECLARE_CONFIG for each of the filter fields, requiring a new program to be configured and loaded for each new backend filter. This was convenient since there was no need to coordinate access to the variable in userspace between parallel calls to TerminateUDPConnectionsToBackend(), but was ultimately too slow (about 100x slower according to a simple benchmark). So, userspace needs to synchronize access to cilium_sock_term_filter and this program using a mutex. Each "iterator instance" needs to follow this sequence: 1) Acquire some lock. 2) Set the value of the filter in cilium_sock_term_filter. 3) Create a new socket iterator. 4) Read all items in the iterator. 5) Release the lock. It's OK for different calls to TerminateConnectionsToBackend() to intertwine as long as these "critical sections" are sequential between threads. Fixes: cilium#37907 Signed-off-by: Jordan Rife <jrife@google.com>
bpf_sock_term.o was built from the start to be "clang free". Use bpf2go to embed compiled BPF bytecode and generate Go skeletons. Reorder struct definitions inside union u6addr to ensure that the generated Go struct has a [16]uint8 for the Addr field and name the anonymous structs to get bpf2go to stop complaining. Add llvm-strip to the builder image, a dependency of bpf2go, and use the builder image for BPF generation (`make -C bpf generate`) to ensure reproducible results on systems with different LLVM versions. Bump github.com/cilium/ebpf module version to pull in [1]. [1]: cilium/ebpf#1829 Fixes: cilium#37907 Signed-off-by: Jordan Rife <jrife@google.com>
Implement an alternative strategy for destroying sockets in the current net namespace called BPFSocketDestroyer and move the original netlink-based Destroy() function to a new SocketDestroyer called NetlinkSocketDestroyer. At startup, check if the bpf_sock_destroy kfunc is supported on the current kernel, and if not, fall back to NetlinkSocketDestroyer. BPFSockDestroyer#Destroy() sets the value of the socket filter, initializes the socket iterator, then iterates to completion. It takes a lock over this whole sequence to ensure calls to Destroy() are serialized across parallel instances of TerminateConnectionsToBackend(). Create SocketDestroyer inside the one shot job inside registerSocketTermination to guarantee that all LB maps have been initialized, and plumb the socket revnat maps through to LoadSockTerm. Fixes: cilium#37907 Signed-off-by: Jordan Rife <jrife@google.com>
Introduce test coverage and benchmarks for both SocketDestroyer implementations. Fixes: cilium#37907 Signed-off-by: Jordan Rife <jrife@google.com>
|
/test |
|
/ci-ipsec-e2e |
|
Timo is out for a few weeks, and it looks like @jrife has done his best to address or respond to the earlier feedback. Strictly speaking that means that @cilium/loader hasn't given an approval, so I think it's reasonable to give a little extra time for someone there to see (maybe @rgo3 or @dylandreimerink ?) but if there's no further feedback in the next few days then it seems like this should be fine to merge. We can always iterate more in-tree. |
|
Filed #40925 for merge queue status check failure case, we've seen this before and it's unrelated to this PR. |
|
EDIT: @dylandreimerink found the culprit (🙏🏼 ), going to open a PR in a minute. 👋🏼 From commit Any idea how to fix this? |
This is a follow up to #37907...
bpf_sock_destroy(), a kfunc introduced in the 6.5 kernel by @aditighag was implemented as a way to destroy UDP sockets connected to a stale UDP backend. While the original RFC mentions one possible advantage being the elimination of exposure and traversal of every network namespace, the final patch series still requires that socket iteration be done per namespace.However, this approach still reduces dependence on
CONFIG_INET_DIAG_DESTROY, which the RFC claims is disabled by default. Perhaps a future kernel improvement to BPF socket iterators could allow for namespace iteration to be pushed to the BPF layer as well.It seems that this change has been planned for awhile, so this PR takes a stab at implementing it. Originally, this was blocked by the availability of
bpf_sock_destroy()in production kernels and the availability of newer versions of LLVM inside Cilium. However, 6.8+ is now the default kernel version on many Ubuntu distros, and LLVM has long since been upgraded, not that the LLVM version shipped with Cilium matters much, since this PR bakes inbpf_sock_term.ointo the Cilium agent image (more on this below).This PR splits BPF, control plane, and test changes into different commits for easier review.
BPF Changes:
bpf_sock_term.cbpf_sock_term.cimplementscil_sock_udp_destroy.cil_sock_udp_destroyis meant to be used in conjunction with BPF socket iterators to iterate through all sockets in a network namespace and destroy any matching the configured filter.cilium_sock_term_filteris a single-element array containing the filter, a value consisting of a backend's address family, address, and port.An earlier prototype of the program simply used
DECLARE_CONFIGfor each of the filter fields, requiring a new program to be configured and loaded for each new backend filter. This was convenient since there was no need to coordinate access to the map in userspace between parallel calls toTerminateUDPConnectionsToBackend(), but was ultimately too slow(about 100x slower according to a simple benchmark). So, userspace needs to synchronize access to the map and this program using a mutex. Each "iterator instance" needs to follow this sequence:
It's OK for different calls to
TerminateUDPConnectionsToBackend()to intertwine as long as these "critical sections" are sequential between threads.The program combines the current socket's cookie and the configured filter into a
cilium_lb(4|6)_reverse_skmap key and checks for the existence of that key to see if this socket was connected to the now obsolete backend. If so, it destroys it with thebpf_sock_destroy()kfunc.A couple of points worth calling out:
bpf_sock_term.cuses sparse definitions ofstruct bpf_iter__udpand its dependencies to avoid pulling invmlinux.hor a long chain of dependencies; it's not as straightforward as other definitions which can be cleanly copied in by syncing some header from the kernel.sk, I couldn't pass that pointer to thebpf_sock_destroy()kfunc without the verifier complaining about unsafe pointers or some such thing. This could use some further investigation, but at least right now I'm not sure how to get this to work. So, I guess there's some risk of the field accesses breaking on newer kernels if offsets change instruct bpf_iter__udp. This probably warrants some discussion.bpf_sock_term.cis compiled once and built into the Cilium agent image, a big departure from other programs which are compiled at runtime, but hopefully not an unwelcome one. With the clang-free milestone underway, I figured I would just skip right to the end with this new program and bake it into the image. I'm not sure how well this aligns with the long term vision, so it would probably be good to get some input from @ti-mo and @dylandreimerink who are driving the clang-free effort.Control Plane Changes
Beyond the obvious plumbing that needs to happen to initialize programs and maps, this PR implements an alternative strategy for destroying sockets in the current net namespace called
BPFSocketDestroyerand moves the original netlink-basedDestroy()function to a newSocketDestroyercalledNetlinkSocketDestroyer.backendConnectionHandlerusesBPFSocketDestroyerby default but falls back toNetlinkSocketDestroyerif that fails, as would be the case if the current kernel does not support thebpf_sock_destroy()kfunc.BPFSockDestroyer#Destroy()lazy-initializes the socket iterator program, sets the value in the socket filter map, initializes the socket iterator, then iterates to completion. It takes a lock over this whole sequence to ensure calls toDestroy()are serialized across parallel instances ofTerminateUDPConnectionsToBackend().Test
This PR includes BPF unit tests as well as integration tests for netlink and BPF socket destroyer variants, but here's a small demo of the BPF socket destroyer working with a workload I put together earlier to test connected UDP sockets.
Meanwhile,
Related Kernel Patches
I made some improvements upstream that eliminate scenarios where some sockets may be skipped or repeated during iteration for TCP and UDP socket iterators. This should make socket iterators as a mechanism for this sort of thing a bit more reliable.
[RFC PATCH bpf-next 0/3] Avoid skipping sockets with socket iterators[PATCH v7 bpf-next 0/7] bpf: udp: Exactly-once socket iteration[PATCH v6 bpf-next 00/12] bpf: tcp: Exactly-once socket iterationFixes: #37907