Skip to content

hubble: implement peer service, enable it locally#10969

Merged
nebril merged 6 commits intomasterfrom
pr/rolinh/hubble-peer
Apr 23, 2020
Merged

hubble: implement peer service, enable it locally#10969
nebril merged 6 commits intomasterfrom
pr/rolinh/hubble-peer

Conversation

@rolinh
Copy link
Copy Markdown
Member

@rolinh rolinh commented Apr 14, 2020

Please, see individual commits for details.

@rolinh rolinh added kind/feature This introduces new functionality. release-note/major This PR introduces major new functionality to Cilium. area/hubble labels Apr 14, 2020
@rolinh rolinh requested a review from a team April 14, 2020 13:55
@rolinh rolinh requested a review from a team as a code owner April 14, 2020 13:55
@coveralls
Copy link
Copy Markdown

coveralls commented Apr 14, 2020

Coverage Status

Coverage increased (+0.1%) to 44.74% when pulling 0c5a9e9 on pr/rolinh/hubble-peer into 531e61f on master.

Copy link
Copy Markdown
Contributor

@michi-covalent michi-covalent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change looks good, but it wasn't clear to me how i can test this in my cluster. i guess nobody is actually calling rpc Notify(NotifyRequest) yet?

Comment thread pkg/hubble/peer/service.go Outdated
Comment thread pkg/hubble/peer/peer.go
@rolinh
Copy link
Copy Markdown
Member Author

rolinh commented Apr 15, 2020

the change looks good, but it wasn't clear to me how i can test this in my cluster.

You can use a gRPC client. For example, with grpcurl, you can invoke the service like this from the root of the git repository:

$ grpcurl -import-path api/v1/peer -proto peer.proto -unix -plaintext /var/run/cilium/hubble.sock peer.Peer.Notify
{
  "name": "runtime1",
  "address": "10.0.2.15",
  "type": "PEER_ADDED"
}

Note that this call "blocks". So one way to test this is to invoke this command then add/remove nodes from your cluster. Of course, the cilium-agent needs the --enable-hubble flag.

i guess nobody is actually calling rpc Notify(NotifyRequest) yet?

Not yet. I have an in-development version of hubble-proxy hubble-relay that is using this service but this will be for a follow-up PR once this one is merged.

Comment thread pkg/hubble/peer/peer.go
Comment thread pkg/hubble/peer/service.go Outdated
Comment thread pkg/hubble/peer/service.go Outdated
// interface so that it can be mocked for the purpose of testing.
type Service struct {
stop chan (struct{})
mgr *manager.Manager
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, consider using an interface instead of implementation.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this is required for testing, like mentioned in the struct comment just above. If it's fine with you, I'll add an interface in pkg/node/manager/manager.go.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface could be defined here, not in pkg/node/manager/manager.go, this allows you to not need to import pkg/node/manager in this package.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the interface with this commit faae24149afbb9da49688945123c09a79716f903 already and implemented tests of the service.
I think that importing the interface from pkg/node/manager is not so bad because it's otherwise really hard to understand who implements it if you're not familiar enough with cilium's codebase.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rolinh there are tools built for that purpose [0] so there's no point in keeping the interfaces in the same place as 1 of the implementations. What happens if in the future we will have a different node manager that implements the same interface? Where will the interface live?

[0] https://docs.google.com/document/d/1_Y9xCEMj5S-7rv2ooHpZNH15JgRT5iM742gJkw5LtmQ/edit#heading=h.f6hyappclfor

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind moving the interface to pkg/hubble/peer if you have strong feelings about it; I don't think it matters much.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface could be defined here, not in pkg/node/manager/manager.go, this allows you to not need to import pkg/node/manager in this package.

BTW, by moving the interface to pkg/hubble/peer I could save on importing pkg/node/manager but then I'll have to import pkg/datapath as it's required for the interface definition. I'm not "saving" on import anyway.

@rolinh rolinh force-pushed the pr/rolinh/hubble-peer branch from c5b9ce6 to 2650dd5 Compare April 15, 2020 08:10
Copy link
Copy Markdown
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I do see an issue with blocking the Node Manager which I think must be addressed, but all my other comments are just comments and don't require any action.

Comment thread pkg/hubble/server/serveroption/option.go Outdated
Comment thread pkg/hubble/defaults/defaults.go Outdated
Comment thread pkg/hubble/peer/service.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stream.Send() can block (it does I/O). Which means the unbuffered h.C channel causes the NodeManager to be blocked as well in h.Node{Add,Update,Delete}.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't a context.Context part of the Send parameters?

Copy link
Copy Markdown
Member Author

@rolinh rolinh Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the ServerStream interface doesn't use a context.Context. SendMsg() is used by Send(), as can be seen in the generated code in peer.pb.go:

type Peer_NotifyServer interface {
	Send(*ChangeNotification) error
	grpc.ServerStream
}

type peerNotifyServer struct {
	grpc.ServerStream
}

func (x *peerNotifyServer) Send(m *ChangeNotification) error {
	return x.ServerStream.SendMsg(m)
}

The interface defines a context but it's global to ServerStream.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gandro I have added a test that simulates a blocked send (TestService_NotifyWithBlockedSend) and it failed (which is good as it tests what we want).

To fix the problem and have the test pass, I've added small buffer (of size 10 by default but this is configurable through service options) that, when full, triggers the server to stop steaming to the client.
This eventually pushes the problem back to the client which will have to call Notify() again in such a case. I don't think it's a problem overall because upon calling Notify(), the client receives the list of all peers before receiving continuous updates so maintaining a local cache client-side is easy.

Please, have a look again at the Notify() implementation and tell me if you think this way of handling the problem is OK.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, yes, the code itself looks good! 🎉 We have to get a bit of experience with this, because I honestly have no clue how big the buffer size should be.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to get a bit of experience with this, because I honestly have no clue how big the buffer size should be.

Neither do I. I assumed that since the connection is over a local unix domain socket, a buffer of 10 should be sufficient. I also don't want to make it unnecessarily large because if the caller can't keep up then there's probably a problem elsewhere.

Comment thread pkg/hubble/peer/service.go Outdated
@michi-covalent
Copy link
Copy Markdown
Contributor

@rolinh ok it passed my rigorous testing of scaling nodepool size from 1 to 3 and then scaling it back down to 1:

# ./grpcurl -import-path api/v1/peer -proto peer.proto -unix -plaintext /var/run/cilium/hubble.sock peer.Peer.Notify
{
  "name": "gke-michi-default-pool-87b89101-ljbs",
  "address": "10.138.0.53",
  "type": "PEER_ADDED"
}
{
  "name": "gke-michi-default-pool-87b89101-4r14",
  "address": "10.138.0.55",
  "type": "PEER_ADDED"
}
{
  "name": "gke-michi-default-pool-87b89101-lwrc",
  "address": "10.138.0.56",
  "type": "PEER_ADDED"
}
{
  "name": "gke-michi-default-pool-87b89101-4r14",
  "address": "10.138.0.55",
  "type": "PEER_DELETED"
}
{
  "name": "gke-michi-default-pool-87b89101-lwrc",
  "address": "10.138.0.56",
  "type": "PEER_DELETED"
}

Comment thread pkg/hubble/peer/service.go Outdated
// interface so that it can be mocked for the purpose of testing.
type Service struct {
stop chan (struct{})
mgr *manager.Manager
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface could be defined here, not in pkg/node/manager/manager.go, this allows you to not need to import pkg/node/manager in this package.

Comment thread pkg/hubble/peer/handler.go Outdated
Comment thread pkg/hubble/peer/service.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't a context.Context part of the Send parameters?

@rolinh rolinh force-pushed the pr/rolinh/hubble-peer branch from 2650dd5 to c586a3a Compare April 16, 2020 09:17
@rolinh rolinh requested a review from a team as a code owner April 16, 2020 09:17
@rolinh rolinh force-pushed the pr/rolinh/hubble-peer branch 2 times, most recently from c301be8 to 01611db Compare April 17, 2020 13:03
@rolinh rolinh requested a review from gandro April 17, 2020 13:20
@rolinh
Copy link
Copy Markdown
Member Author

rolinh commented Apr 18, 2020

test-me-please

EDIT: test flake

@rolinh
Copy link
Copy Markdown
Member Author

rolinh commented Apr 20, 2020

restart-ginkgo

EDIT: hit flake again

Comment thread pkg/hubble/peer/service.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, yes, the code itself looks good! 🎉 We have to get a bit of experience with this, because I honestly have no clue how big the buffer size should be.

@rolinh
Copy link
Copy Markdown
Member Author

rolinh commented Apr 21, 2020

restart-ginkgo

EDIT: yet another flake :(
EDIT2: just rebased against latest master

@rolinh rolinh force-pushed the pr/rolinh/hubble-peer branch from 01611db to bfcc725 Compare April 21, 2020 11:56
@rolinh
Copy link
Copy Markdown
Member Author

rolinh commented Apr 21, 2020

test-me-please

EDIT: update to adapt to changes from e99049f

@rolinh rolinh force-pushed the pr/rolinh/hubble-peer branch from bfcc725 to c0137af Compare April 21, 2020 12:20
@rolinh
Copy link
Copy Markdown
Member Author

rolinh commented Apr 21, 2020

test-me-please

EDIT: rebased against master again to fix conflict introduced by f332fb5

@rolinh rolinh force-pushed the pr/rolinh/hubble-peer branch from c0137af to 85fd93d Compare April 21, 2020 12:49
@rolinh
Copy link
Copy Markdown
Member Author

rolinh commented Apr 21, 2020

test-me-please

@nebril
Copy link
Copy Markdown
Member

nebril commented Apr 21, 2020

test-gke

1 similar comment
@nebril
Copy link
Copy Markdown
Member

nebril commented Apr 21, 2020

test-gke

@rolinh rolinh force-pushed the pr/rolinh/hubble-peer branch from 85fd93d to c7cd0ad Compare April 22, 2020 06:46
@rolinh
Copy link
Copy Markdown
Member Author

rolinh commented Apr 22, 2020

Rebased against master to fix conflict with fef319a.

@rolinh
Copy link
Copy Markdown
Member Author

rolinh commented Apr 22, 2020

test-me-please

@nebril
Copy link
Copy Markdown
Member

nebril commented Apr 22, 2020

test-gke

@rolinh rolinh force-pushed the pr/rolinh/hubble-peer branch from c7cd0ad to dfc21b9 Compare April 22, 2020 12:47
@rolinh
Copy link
Copy Markdown
Member Author

rolinh commented Apr 22, 2020

Found a bug in the peer service implementation where the wrong context.Background() was used instead of the existing ctx.

@rolinh
Copy link
Copy Markdown
Member Author

rolinh commented Apr 22, 2020

test-me-please

rolinh added 6 commits April 22, 2020 15:39
This commits removes the global variable DefaultOptions which contained
no defaults but was just initializing the listeners map.
The map initialization is done in the constructor instead in order to
avoid problems with shallow copy of default options (a 2nd call to
NewServer would use modified defaults).

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Make sure to only require the grpc interface definition for service
options. The server is a simple grpc server which serves 1 or more
services and it doesn't need to know about the internal of each service
implementation. This change makes this explicit.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This is useful for any package that needs to implement gRPC tests, thus
move it to a more generic location.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This interface will make it easier to implement tests, notably for the
peer service that uses the node manager.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This commit adds a new peer module for hubble that implements the peer
gRPC server interface.

This new gRPC service notifies of nodes being added, deleted or
modified. This information is obtained by subscribing to the cilium node
manager via a new datapath.NodeHandler interface implementation
(peer.handler).

This new service will be consumed by hubble proxy in order to connect to
hubble peers. It is intended to only be served over a local unix domain
socket.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This commits enables the peer service for the hubble gRPC server that
listens on the local unix domain socket. This will allow node discovery
for the peer proxy.

For all other hubble listen addresses, the peer service is not enabled.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
@rolinh rolinh force-pushed the pr/rolinh/hubble-peer branch from dfc21b9 to 0c5a9e9 Compare April 22, 2020 13:43
@rolinh
Copy link
Copy Markdown
Member Author

rolinh commented Apr 22, 2020

rebased against master, fixed test failure uncovered by the context fix.

@rolinh
Copy link
Copy Markdown
Member Author

rolinh commented Apr 22, 2020

test-me-please

@nebril
Copy link
Copy Markdown
Member

nebril commented Apr 22, 2020

test-gke

@nebril
Copy link
Copy Markdown
Member

nebril commented Apr 22, 2020

test-gke

Edit: looks like it timed out: https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/546/execution/node/112/log/

Copy link
Copy Markdown
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@christarazi
Copy link
Copy Markdown
Member

test-gke

@rolinh
Copy link
Copy Markdown
Member Author

rolinh commented Apr 23, 2020

The GKE test was running for 10+ hours and stuck at this step (CI issue):

[Pipeline] Start of Pipeline
[Pipeline] node
Still waiting to schedule task
‘jenkins-node-gke-fixed0’ is offline

@rolinh
Copy link
Copy Markdown
Member Author

rolinh commented Apr 23, 2020

test-gke

@nebril nebril merged commit 2d0ae5b into master Apr 23, 2020
@nebril nebril deleted the pr/rolinh/hubble-peer branch April 23, 2020 12:07
@tgraf tgraf mentioned this pull request May 6, 2020
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature This introduces new functionality. release-note/major This PR introduces major new functionality to Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants