bpf: getpeername hook implementation for socket lb#11617
Conversation
|
Please set the appropriate release note label. |
3 similar comments
|
Please set the appropriate release note label. |
|
Please set the appropriate release note label. |
|
Please set the appropriate release note label. |
pchaigno
left a comment
There was a problem hiding this comment.
Except for Martynas's comment and one missing test case, LGTM!
0d4d4ae to
974f580
Compare
|
Coverage increased (+0.03%) to 36.894% when pulling 1dcfcc8b55c16dc9696b5324e18c8b296783df17 on pr/sock-lb-peername into cc84c1d on master. |
aa8d4aa to
327eb41
Compare
|
Does commit |
Yeah, I'll take care of it for 1.7. I've marked it as needs-backport for 1.7. |
|
test-docs-please (green) |
@borkmann I don't think that's required anymore since the doc tests run in GitHub Action. |
Same reasoning as with a630353 ("bpf: only update nodeport neigh entry if stale or non-existant"). For sock LB users with UDP proto, this map update is in the hot-path and constantly hitting the map update spinlock which is completely unnecessary and very high overhead. Only update if needed. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Make it more clear where we attach what and simplify the entry point naming accordingly. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The vast majority of applications does not bother about getpeername(),
but in a few occasions we've seen breakage when validating the peer's
address since it returns unexpectedly the backend tuple instead of the
service one.
The implementation here is actually nicely straight forward in that we
can fully reuse sock{4,6}_update_revnat() as well as the __sock{4,6}_xlate_rev()
implementation which also handles v4-in-v6. This works because the kernel
populates the struct sockaddr of its known peer in inet{,6}_getname() and
we have recorded that exactly on prior connect hook [0], so it tells us
exactly the address it wants to see. This is relevant for {un,}connected
UDP, for example.
I was originally thinking to use socket local storage for this, but i)
without BTF support it won't work and ii) we can spare an additional map
by just using LB{4,6}_REVERSE_NAT_SK_MAP in an efficient manner (that is,
w/o hitting the update spinlock every time).
Simple example:
# ./cilium/cilium service list
ID Frontend Service Type Backend
1 1.2.3.4:80 ClusterIP 1 => 10.0.0.10:80
Before; curl's verbose output example, no getpeername() reverse xlation:
# curl --verbose 1.2.3.4
* Rebuilt URL to: 1.2.3.4/
* Trying 1.2.3.4...
* TCP_NODELAY set
* Connected to 1.2.3.4 (10.0.0.10) port 80 (#0)
> GET / HTTP/1.1
> Host: 1.2.3.4
> User-Agent: curl/7.58.0
> Accept: */*
[...]
After; with getpeername() reverse xlation:
# curl --verbose 1.2.3.4
* Rebuilt URL to: 1.2.3.4/
* Trying 1.2.3.4...
* TCP_NODELAY set
* Connected to 1.2.3.4 (1.2.3.4) port 80 (#0)
> GET / HTTP/1.1
> Host: 1.2.3.4
> User-Agent: curl/7.58.0
> Accept: */*
[...]
Attached hooks:
# bpftool cgroup list /var/run/cilium/cgroupv2/
ID AttachType AttachFlags Name
14492 connect4
14487 connect6
14494 post_bind4
14489 post_bind6
14495 sendmsg4
14490 sendmsg6
14496 recvmsg4
14491 recvmsg6
14493 getpeername4
14488 getpeername6
[0] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=1b66d253610c7f8f257103808a9460223a087469
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
While adding the getpeername support, I noticed that for wild-carded lookups we record the wrong IP:port for reverse translation. Service map: # ./cilium/cilium service list ID Frontend Service Type Backend 1 1.2.3.4:31000 NodePort 1 => 8.8.8.8:53 2 0.0.0.0:31000 NodePort 1 => 8.8.8.8:53 Non-wildcarded NodePort query works as expected: # nslookup -port=31000 google.com 1.2.3.4 Server: 1.2.3.4 Address: 1.2.3.4#31000 Non-authoritative answer: Name: google.com Address: 216.58.215.238 Name: google.com Address: 2a00:1450:400a:801::200e Wildcarded NodePort query reverses to wildcard address instead of original one passed from the app: # nslookup -port=31000 google.com 127.0.0.1 ;; reply from unexpected source: 0.0.0.0#31000, expected 127.0.0.1#31000 ;; reply from unexpected source: 0.0.0.0#31000, expected 127.0.0.1#31000 ;; reply from unexpected source: 0.0.0.0#31000, expected 127.0.0.1#31000 ;; connection timed out; no servers could be reached After the fix: # nslookup -port=31000 google.com 127.0.0.1 Server: 127.0.0.1 Address: 127.0.0.1#31000 Non-authoritative answer: Name: google.com Address: 216.58.215.238 Name: google.com Address: 2a00:1450:400a:800::200e Same for getpeername, service map: # ./cilium/cilium service list ID Frontend Service Type Backend 1 1.2.3.4:31000 NodePort 1 => 193.99.144.80:80 2 0.0.0.0:31000 NodePort 1 => 193.99.144.80:80 Before: # curl --verbose 127.0.0.1:31000 * Rebuilt URL to: 127.0.0.1:31000/ * Trying 127.0.0.1... * TCP_NODELAY set * Connected to 127.0.0.1 (0.0.0.0) port 31000 (#0) > GET / HTTP/1.1 > Host: 127.0.0.1:31000 > User-Agent: curl/7.58.0 After: # curl --verbose 127.0.0.1:31000 * Rebuilt URL to: 127.0.0.1:31000/ * Trying 127.0.0.1... * TCP_NODELAY set * Connected to 127.0.0.1 (127.0.0.1) port 31000 (#0) > GET / HTTP/1.1 > Host: 127.0.0.1:31000 > User-Agent: curl/7.58.0 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Update the limitations paragraph to indicate that the getpeername(2) translation has been fixed in Cilium and takes place in newer kernels. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
sessionAffinity and multi-dev support has been merged into Cilium 1.8 as well as getpeername(2) handling for newer kernels. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
1dcfcc8 to
44f204f
Compare
|
test-me-please |
|
retest-4.19 |
|
test-me-please |
|
restest-4.9 |
|
test-me-please |
|
retest-net-next |
|
retest-runtime |
|
retest-net-next |
|
vbox error: |
|
retest-net-next |
2 similar comments
|
retest-net-next |
|
retest-net-next |
Depends on #11649 being merged first.
See commit msgs. Related kernel changes:
Fixes: #9974