bpf: more scalability improvements by borkmann · Pull Request #11694 · cilium/cilium

borkmann · 2020-05-26T13:07:01Z

Needs release-blocker flag given the jiffies conversion which falls under the agent clock source probe flag.

See commit msgs.

maintainer-s-little-helper · 2020-05-26T13:07:04Z

Please set the appropriate release note label.

coveralls · 2020-05-26T13:53:43Z

Coverage decreased (-0.003%) to 36.905% when pulling 0e38c41 on pr/dp-improvements into 05163ed on master.

... which uses jiffies if possible and otherwise falls back to ktime. See also commit cd7921f ("bpf: implement time source switch and use it in CT") for more information wrt ktime vs jiffies. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

For more info see commit cd7921f ("bpf: implement time source switch and use it in CT") for more information wrt ktime vs jiffies. Given the CT maps have been switched, we also should do it for NAT to allow for potential created time correlation of entries. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

If there is no entry in the affinity map, then it's pointless to fetch bpf_mono_now() and prep the match key. Only go for it when really needed. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

The only thing that lb4_update_affinity_by_netns() is doing for existing entries is updating the last_used timestamp, nothing else. So lets not hit the BPF map update spinlock from all CPUs. Instead use a READ_ONCE()/ WRITE_ONCE() scheme and update the timestamp at the time we do the lookup where we have the val already. Only call the lb4_update_affinity_by_netns() update when the backend_from_affinity is false which indicates that a backend was just newly selected. The lb4_update_affinity_by_netns() is addressed separately to ease review. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Using the kernel's fib lookup is too expensive and not strictly needed for the case of having a hairpinned load-balancer as in XDP. Given we already have a faster NODEPORT_NEIGH{4,6} cache that we populate for inbound connections, just look up the backend mac from there. Avoiding the fib lookup in XDP bumps performance by ~1.5Mpps in my environment. We can only make this assumption for the hairpin case where the LB is exposed to a single public dev. We can extend nodeport_lb_hairpin() later on also for the tc case. The assumption on why this is okay is because backend pods have src IP validation in bpf_lx via checking on is_valid_lxc_src_ip{,v4}() which prevents from spoofing (src IP/MAC) in general. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Do the same conversion as done in the DSR case also for SNAT NodePort which similarly helps to improve the performance by avoiding the fib lookup in the kernel. Improving by similar numbers as with DSR case. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

tklauser

LGTM mostly, some small nits and would be nice to have this covered by unit tests as for the other maps size options.

Also, I think the cmdref for cilium-agent probably needs to be re-generated as part of commit 5fb6e96a447e106644b6ed472b18bb47586bf2ca ("bpf: make neigh{4,6} maps size configurable from agent and helm ").

tklauser · 2020-05-29T08:34:46Z

Could you maybe also update the unit tests to cover the newly introduced Neigh map size here:

cilium/pkg/option/config_test.go

Line 265 in af9fd59

func TestCheckMapSizeLimits(t *testing.T) {

and here:

cilium/pkg/option/config_test.go

Line 472 in af9fd59

func TestCheckIPv4NativeRoutingCIDR(t *testing.T) {

?

Agree, adding as follow-up as discussed. Also that we should fix the dynamic sizing calc even before this patchset since it doesn't take neigh BPF map into account.

Given we reuse neigh BPF tables for client and backends, it doesn't make sense to push the HW addresss down only once upon k8s node events like nodeUpdate() and nodeDelete() where we ARP probe and update the fib. Since this is an LRU cache they can get evicted. Therefore update once we did a successful fib lookup so we don't need to perform another one next time and have the guarantee that these continue to be cached once evicted. From control plane, we need to retire old entries however, so deletes are pushed from there. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Enforce retiring of neigh cache entries either on update or deletion of new nodes. This will enforce the datapath to re-cache the neigh entry from the fib. This is needed in order to not leave stale entries around. Only do this when we have acceleration enabled right now since this is equivalent to what the datapath does as well. We might later extend this for tc as well. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

So far eth_store_{s,d}addr() had to fall back to byte-by-byte copies of src/dst mac addresses from stack into XDP buffer since the case of fib_lookup is special in that fib_params.{s,d}mac would cause unaligned access on the BPF stack access when used with our optimized memcpy. For the neigh table cases we do not have this issue and can therefore reduce the number of insns needed by using word and half-word wide copies. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Small enough that it shouldn't matter much, but add a note here wrt backend selection. Slighly clean up some of the other comments. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

We only ever need to track neigh entries of clients in case of SNAT when the reply comes back to us or in the case where its local to the node. For DSR and Hyrbid (DSR for TCP) we don't need to track them here since the node only ever sees to original requests which improves scalability for new client mappings as tracking can be avoided altogether. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

borkmann · 2020-05-29T22:22:30Z

test-me-please

The BPF neigh{4,6} maps are decoupled from SNAT NodePort mode these days and are used for i) request IP/Mac mappings for remote backends, ii) for local backends and iii) backend to Mac mappings. Make its size configurable in order to allow for more flexibility. Default to the same as NAT map size. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

borkmann · 2020-05-29T23:03:21Z

test-me-please

joestringer

One issue on storage of the mono higher resolution counter and one minor initialization suggestion below.

joestringer · 2020-05-30T03:34:13Z

+	struct bpf_fib_lookup fib_params = {
+		.family		= AF_INET6,
+		.ifindex	= DIRECT_ROUTING_DEV_IFINDEX,
+	};


We probably don't need to initialize this until the fib_lookup() fallback case below.

At least the ifindex is used also in the non-fib case. Given prior case was unconditional struct bpf_fib_lookup fib_params = {}; I thought I'd reuse it for the neigh map case. I can look into reworking it a bit as follow-up.

maintainer-s-little-helper Bot added the dont-merge/needs-release-note label May 26, 2020

borkmann requested a review from brb May 26, 2020 13:07

borkmann added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/misc This PR makes changes that have no direct user impact. labels May 26, 2020

maintainer-s-little-helper Bot removed the dont-merge/needs-release-note label May 26, 2020

borkmann added the priority/release-blocker label May 26, 2020

borkmann force-pushed the pr/dp-improvements branch 6 times, most recently from 23ca008 to 999ee8d Compare May 28, 2020 13:30

borkmann added 4 commits May 29, 2020 00:49

bpf: only fetch mono time and prep match key when needed

27ec077

If there is no entry in the affinity map, then it's pointless to fetch bpf_mono_now() and prep the match key. Only go for it when really needed. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

borkmann force-pushed the pr/dp-improvements branch from 999ee8d to d4afb6d Compare May 28, 2020 22:57

borkmann added 2 commits May 29, 2020 00:58

borkmann force-pushed the pr/dp-improvements branch 3 times, most recently from 94a1b6a to 5fb6e96 Compare May 29, 2020 00:00

tklauser requested changes May 29, 2020

View reviewed changes

borkmann force-pushed the pr/dp-improvements branch 2 times, most recently from d1b7127 to 531ee33 Compare May 29, 2020 21:26

borkmann added 3 commits May 30, 2020 00:20

borkmann added 2 commits May 30, 2020 00:20

bpf: add note about small race condition in affinity for sock-lb

17732dd

Small enough that it shouldn't matter much, but add a note here wrt backend selection. Slighly clean up some of the other comments. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

borkmann force-pushed the pr/dp-improvements branch from 531ee33 to 0995c19 Compare May 29, 2020 22:21

borkmann marked this pull request as ready for review May 29, 2020 22:21

borkmann requested review from a team May 29, 2020 22:21

borkmann requested review from a team as code owners May 29, 2020 22:21

borkmann requested review from a team and joestringer May 29, 2020 22:21

borkmann added pending-review labels May 29, 2020

borkmann force-pushed the pr/dp-improvements branch from 0995c19 to 0e38c41 Compare May 29, 2020 23:03

maintainer-s-little-helper Bot requested a review from tklauser May 30, 2020 01:35

joestringer requested changes May 30, 2020

View reviewed changes

joestringer approved these changes May 30, 2020

View reviewed changes

Comment thread bpf/lib/lb.h

joestringer merged commit b7be1c0 into master May 30, 2020

joestringer deleted the pr/dp-improvements branch May 30, 2020 23:36

tklauser mentioned this pull request Jun 3, 2020

v1.8 backports 2020-06-03 #11856

Merged

tklauser added backport-pending/1.8 and removed needs-backport/1.8 labels Jun 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpf: more scalability improvements#11694

bpf: more scalability improvements#11694
joestringer merged 12 commits intomasterfrom
pr/dp-improvements

borkmann commented May 26, 2020 •

edited

Loading

Uh oh!

maintainer-s-little-helper Bot commented May 26, 2020

Uh oh!

coveralls commented May 26, 2020 •

edited

Loading

Uh oh!

tklauser left a comment

Uh oh!

Uh oh!

Uh oh!

tklauser May 29, 2020

Uh oh!

borkmann May 29, 2020 •

edited

Loading

Uh oh!

borkmann commented May 29, 2020

Uh oh!

borkmann commented May 29, 2020

Uh oh!

joestringer left a comment

Uh oh!

Uh oh!

joestringer May 30, 2020

Uh oh!

borkmann May 30, 2020

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

borkmann commented May 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maintainer-s-little-helper Bot commented May 26, 2020

Uh oh!

coveralls commented May 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tklauser left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tklauser May 29, 2020

Choose a reason for hiding this comment

Uh oh!

borkmann May 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

borkmann commented May 29, 2020

Uh oh!

borkmann commented May 29, 2020

Uh oh!

joestringer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joestringer May 30, 2020

Choose a reason for hiding this comment

Uh oh!

borkmann May 30, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

borkmann commented May 26, 2020 •

edited

Loading

coveralls commented May 26, 2020 •

edited

Loading

borkmann May 29, 2020 •

edited

Loading