Skip to content

bpf: getpeername hook implementation for socket lb#11617

Merged
borkmann merged 10 commits intomasterfrom
pr/sock-lb-peername
May 23, 2020
Merged

bpf: getpeername hook implementation for socket lb#11617
borkmann merged 10 commits intomasterfrom
pr/sock-lb-peername

Conversation

@borkmann
Copy link
Copy Markdown
Member

@borkmann borkmann commented May 20, 2020

@borkmann borkmann added pending-review area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. labels May 20, 2020
@borkmann borkmann requested review from a team and brb May 20, 2020 15:00
@borkmann borkmann requested review from a team as code owners May 20, 2020 15:00
@maintainer-s-little-helper
Copy link
Copy Markdown

Please set the appropriate release note label.

3 similar comments
@maintainer-s-little-helper
Copy link
Copy Markdown

Please set the appropriate release note label.

@maintainer-s-little-helper
Copy link
Copy Markdown

Please set the appropriate release note label.

@maintainer-s-little-helper
Copy link
Copy Markdown

Please set the appropriate release note label.

@borkmann borkmann added release-note/major This PR introduces major new functionality to Cilium. and removed dont-merge/needs-release-note labels May 20, 2020
Comment thread vagrant_box_defaults.rb
Copy link
Copy Markdown
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for Martynas's comment and one missing test case, LGTM!

Comment thread bpf/include/bpf/builtins.h
@borkmann borkmann force-pushed the pr/sock-lb-peername branch 2 times, most recently from 0d4d4ae to 974f580 Compare May 22, 2020 09:29
@coveralls
Copy link
Copy Markdown

coveralls commented May 22, 2020

Coverage Status

Coverage increased (+0.03%) to 36.894% when pulling 1dcfcc8b55c16dc9696b5324e18c8b296783df17 on pr/sock-lb-peername into cc84c1d on master.

@borkmann borkmann force-pushed the pr/sock-lb-peername branch from aa8d4aa to 327eb41 Compare May 22, 2020 13:32
@borkmann borkmann requested a review from a team as a code owner May 22, 2020 13:54
@pchaigno
Copy link
Copy Markdown
Member

Does commit bpf: properly reverse xlate svc:port tuple from wildcarded lookups require a backport?

Copy link
Copy Markdown
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@borkmann
Copy link
Copy Markdown
Member Author

borkmann commented May 22, 2020

Does commit bpf: properly reverse xlate svc:port tuple from wildcarded lookups require a backport?

Yeah, I'll take care of it for 1.7. I've marked it as needs-backport for 1.7.

@borkmann
Copy link
Copy Markdown
Member Author

borkmann commented May 22, 2020

test-docs-please (green)

@pchaigno
Copy link
Copy Markdown
Member

test-docs-please (green)

@borkmann I don't think that's required anymore since the doc tests run in GitHub Action.

borkmann added 6 commits May 22, 2020 23:03
Same reasoning as with a630353 ("bpf: only update nodeport neigh
entry if stale or non-existant"). For sock LB users with UDP proto, this
map update is in the hot-path and constantly hitting the map update
spinlock which is completely unnecessary and very high overhead. Only
update if needed.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Make it more clear where we attach what and simplify the entry point
naming accordingly.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The vast majority of applications does not bother about getpeername(),
but in a few occasions we've seen breakage when validating the peer's
address since it returns unexpectedly the backend tuple instead of the
service one.

The implementation here is actually nicely straight forward in that we
can fully reuse sock{4,6}_update_revnat() as well as the __sock{4,6}_xlate_rev()
implementation which also handles v4-in-v6. This works because the kernel
populates the struct sockaddr of its known peer in inet{,6}_getname() and
we have recorded that exactly on prior connect hook [0], so it tells us
exactly the address it wants to see. This is relevant for {un,}connected
UDP, for example.

I was originally thinking to use socket local storage for this, but i)
without BTF support it won't work and ii) we can spare an additional map
by just using LB{4,6}_REVERSE_NAT_SK_MAP in an efficient manner (that is,
w/o hitting the update spinlock every time).

Simple example:

  # ./cilium/cilium service list
  ID   Frontend     Service Type   Backend
  1    1.2.3.4:80   ClusterIP      1 => 10.0.0.10:80

Before; curl's verbose output example, no getpeername() reverse xlation:

  # curl --verbose 1.2.3.4
  * Rebuilt URL to: 1.2.3.4/
  *   Trying 1.2.3.4...
  * TCP_NODELAY set
  * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0)
  > GET / HTTP/1.1
  > Host: 1.2.3.4
  > User-Agent: curl/7.58.0
  > Accept: */*
  [...]

After; with getpeername() reverse xlation:

  # curl --verbose 1.2.3.4
  * Rebuilt URL to: 1.2.3.4/
  *   Trying 1.2.3.4...
  * TCP_NODELAY set
  * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0)
  > GET / HTTP/1.1
  >  Host: 1.2.3.4
  > User-Agent: curl/7.58.0
  > Accept: */*
  [...]

Attached hooks:

  # bpftool cgroup list /var/run/cilium/cgroupv2/
  ID       AttachType      AttachFlags     Name
  14492    connect4
  14487    connect6
  14494    post_bind4
  14489    post_bind6
  14495    sendmsg4
  14490    sendmsg6
  14496    recvmsg4
  14491    recvmsg6
  14493    getpeername4
  14488    getpeername6

  [0] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=1b66d253610c7f8f257103808a9460223a087469

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
While adding the getpeername support, I noticed that for wild-carded lookups
we record the wrong IP:port for reverse translation.

Service map:

  # ./cilium/cilium service list
  ID   Frontend        Service Type   Backend
  1    1.2.3.4:31000   NodePort       1 => 8.8.8.8:53
  2    0.0.0.0:31000   NodePort       1 => 8.8.8.8:53

Non-wildcarded NodePort query works as expected:

  # nslookup -port=31000 google.com 1.2.3.4
  Server:         1.2.3.4
  Address:        1.2.3.4#31000

  Non-authoritative answer:
  Name:   google.com
  Address: 216.58.215.238
  Name:   google.com
  Address: 2a00:1450:400a:801::200e

Wildcarded NodePort query reverses to wildcard address instead of original
one passed from the app:

  # nslookup -port=31000 google.com 127.0.0.1
  ;; reply from unexpected source: 0.0.0.0#31000, expected 127.0.0.1#31000
  ;; reply from unexpected source: 0.0.0.0#31000, expected 127.0.0.1#31000
  ;; reply from unexpected source: 0.0.0.0#31000, expected 127.0.0.1#31000
  ;; connection timed out; no servers could be reached

After the fix:

  # nslookup -port=31000 google.com 127.0.0.1
  Server:		127.0.0.1
  Address:	127.0.0.1#31000

  Non-authoritative answer:
  Name:	google.com
  Address: 216.58.215.238
  Name:	google.com
  Address: 2a00:1450:400a:800::200e

Same for getpeername, service map:

  # ./cilium/cilium service list
  ID   Frontend        Service Type   Backend
  1    1.2.3.4:31000   NodePort       1 => 193.99.144.80:80
  2    0.0.0.0:31000   NodePort       1 => 193.99.144.80:80

Before:

  # curl --verbose 127.0.0.1:31000
  * Rebuilt URL to: 127.0.0.1:31000/
  *   Trying 127.0.0.1...
  * TCP_NODELAY set
  * Connected to 127.0.0.1 (0.0.0.0) port 31000 (#0)
  > GET / HTTP/1.1
  > Host: 127.0.0.1:31000
  > User-Agent: curl/7.58.0

After:

  # curl --verbose 127.0.0.1:31000
  * Rebuilt URL to: 127.0.0.1:31000/
  *   Trying 127.0.0.1...
  * TCP_NODELAY set
  * Connected to 127.0.0.1 (127.0.0.1) port 31000 (#0)
  > GET / HTTP/1.1
  > Host: 127.0.0.1:31000
  > User-Agent: curl/7.58.0

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Update the limitations paragraph to indicate that the getpeername(2) translation
has been fixed in Cilium and takes place in newer kernels.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
sessionAffinity and multi-dev support has been merged into Cilium 1.8 as
well as getpeername(2) handling for newer kernels.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
@borkmann borkmann force-pushed the pr/sock-lb-peername branch from 1dcfcc8 to 44f204f Compare May 22, 2020 21:03
@borkmann
Copy link
Copy Markdown
Member Author

test-me-please

@borkmann
Copy link
Copy Markdown
Member Author

retest-4.19

@borkmann
Copy link
Copy Markdown
Member Author

test-me-please

@borkmann
Copy link
Copy Markdown
Member Author

restest-4.9

@borkmann
Copy link
Copy Markdown
Member Author

test-me-please

@borkmann
Copy link
Copy Markdown
Member Author

retest-net-next

@borkmann
Copy link
Copy Markdown
Member Author

retest-runtime

@borkmann
Copy link
Copy Markdown
Member Author

retest-net-next

@borkmann
Copy link
Copy Markdown
Member Author

vbox error:

10:59:40  Progress state: VBOX_E_OBJECT_IN_USE
10:59:40  VBoxManage: error: Machine delete failed
10:59:40  VBoxManage: error: Cannot close medium '/root/VirtualBox VMs/ubuntu-18.04.4-amd64-55_1590229760245_39345/ubuntu-18.04.4-amd64-55-disk001.vmdk' because it has 2 child media
10:59:40  VBoxManage: error: Details: code VBOX_E_OBJECT_IN_USE (0x80bb000c), component MediumWrap, interface IMedium
10:59:40  VBoxManage: error: Context: "RTEXITCODE handleUnregisterVM(HandlerArg*)" at line 162 of file VBoxManageMisc.cpp
10:59:40  VBoxManage: error: Machine 'dc5268d5-da68-4fc4-9b13-921a35692493' is not currently running
10:59:40  0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
10:59:40  VBoxManage: error: Machine '3c6abf7b-0259-4188-973c-e78ed91e0dac' is not currently running
10:59:42  0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%

@borkmann
Copy link
Copy Markdown
Member Author

retest-net-next

2 similar comments
@borkmann
Copy link
Copy Markdown
Member Author

retest-net-next

@borkmann
Copy link
Copy Markdown
Member Author

retest-net-next

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/major This PR introduces major new functionality to Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cilium, host reachable services: implement getpeername hook

6 participants