bpf: more scalability improvements#11694
Conversation
|
Please set the appropriate release note label. |
23ca008 to
999ee8d
Compare
... which uses jiffies if possible and otherwise falls back to ktime. See also commit cd7921f ("bpf: implement time source switch and use it in CT") for more information wrt ktime vs jiffies. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
For more info see commit cd7921f ("bpf: implement time source switch and use it in CT") for more information wrt ktime vs jiffies. Given the CT maps have been switched, we also should do it for NAT to allow for potential created time correlation of entries. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
If there is no entry in the affinity map, then it's pointless to fetch bpf_mono_now() and prep the match key. Only go for it when really needed. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The only thing that lb4_update_affinity_by_netns() is doing for existing entries is updating the last_used timestamp, nothing else. So lets not hit the BPF map update spinlock from all CPUs. Instead use a READ_ONCE()/ WRITE_ONCE() scheme and update the timestamp at the time we do the lookup where we have the val already. Only call the lb4_update_affinity_by_netns() update when the backend_from_affinity is false which indicates that a backend was just newly selected. The lb4_update_affinity_by_netns() is addressed separately to ease review. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
999ee8d to
d4afb6d
Compare
Using the kernel's fib lookup is too expensive and not strictly needed
for the case of having a hairpinned load-balancer as in XDP. Given we
already have a faster NODEPORT_NEIGH{4,6} cache that we populate for
inbound connections, just look up the backend mac from there. Avoiding
the fib lookup in XDP bumps performance by ~1.5Mpps in my environment.
We can only make this assumption for the hairpin case where the LB is
exposed to a single public dev. We can extend nodeport_lb_hairpin()
later on also for the tc case. The assumption on why this is okay is
because backend pods have src IP validation in bpf_lx via checking on
is_valid_lxc_src_ip{,v4}() which prevents from spoofing (src IP/MAC)
in general.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Do the same conversion as done in the DSR case also for SNAT NodePort which similarly helps to improve the performance by avoiding the fib lookup in the kernel. Improving by similar numbers as with DSR case. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
94a1b6a to
5fb6e96
Compare
tklauser
left a comment
There was a problem hiding this comment.
LGTM mostly, some small nits and would be nice to have this covered by unit tests as for the other maps size options.
Also, I think the cmdref for cilium-agent probably needs to be re-generated as part of commit 5fb6e96a447e106644b6ed472b18bb47586bf2ca ("bpf: make neigh{4,6} maps size configurable from agent and helm ").
There was a problem hiding this comment.
Could you maybe also update the unit tests to cover the newly introduced Neigh map size here:
cilium/pkg/option/config_test.go
Line 265 in af9fd59
and here:
cilium/pkg/option/config_test.go
Line 472 in af9fd59
?
There was a problem hiding this comment.
Agree, adding as follow-up as discussed. Also that we should fix the dynamic sizing calc even before this patchset since it doesn't take neigh BPF map into account.
d1b7127 to
531ee33
Compare
Given we reuse neigh BPF tables for client and backends, it doesn't make sense to push the HW addresss down only once upon k8s node events like nodeUpdate() and nodeDelete() where we ARP probe and update the fib. Since this is an LRU cache they can get evicted. Therefore update once we did a successful fib lookup so we don't need to perform another one next time and have the guarantee that these continue to be cached once evicted. From control plane, we need to retire old entries however, so deletes are pushed from there. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Enforce retiring of neigh cache entries either on update or deletion of new nodes. This will enforce the datapath to re-cache the neigh entry from the fib. This is needed in order to not leave stale entries around. Only do this when we have acceleration enabled right now since this is equivalent to what the datapath does as well. We might later extend this for tc as well. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
So far eth_store_{s,d}addr() had to fall back to byte-by-byte copies of
src/dst mac addresses from stack into XDP buffer since the case of
fib_lookup is special in that fib_params.{s,d}mac would cause unaligned
access on the BPF stack access when used with our optimized memcpy. For
the neigh table cases we do not have this issue and can therefore reduce
the number of insns needed by using word and half-word wide copies.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Small enough that it shouldn't matter much, but add a note here wrt backend selection. Slighly clean up some of the other comments. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
We only ever need to track neigh entries of clients in case of SNAT when the reply comes back to us or in the case where its local to the node. For DSR and Hyrbid (DSR for TCP) we don't need to track them here since the node only ever sees to original requests which improves scalability for new client mappings as tracking can be avoided altogether. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
531ee33 to
0995c19
Compare
|
test-me-please |
The BPF neigh{4,6} maps are decoupled from SNAT NodePort mode these
days and are used for i) request IP/Mac mappings for remote backends,
ii) for local backends and iii) backend to Mac mappings. Make its size
configurable in order to allow for more flexibility. Default to the
same as NAT map size.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
0995c19 to
0e38c41
Compare
|
test-me-please |
joestringer
left a comment
There was a problem hiding this comment.
One issue on storage of the mono higher resolution counter and one minor initialization suggestion below.
| struct bpf_fib_lookup fib_params = { | ||
| .family = AF_INET6, | ||
| .ifindex = DIRECT_ROUTING_DEV_IFINDEX, | ||
| }; |
There was a problem hiding this comment.
We probably don't need to initialize this until the fib_lookup() fallback case below.
There was a problem hiding this comment.
At least the ifindex is used also in the non-fib case. Given prior case was unconditional struct bpf_fib_lookup fib_params = {}; I thought I'd reuse it for the neigh map case. I can look into reworking it a bit as follow-up.
Needs release-blocker flag given the jiffies conversion which falls under the agent clock source probe flag.
See commit msgs.