hubble/relay: improve peer connections handling by rolinh · Pull Request #12556 · cilium/cilium

rolinh · 2020-07-16T17:21:43Z

This PR improves peer connections handling and introduce unit tests. In order to implement unit tests, a major refactoring was necessary. This refactoring also has for side-effect to improve separation of concerns as all the peer connection handling logic is now into its own package (pkg/hubble/relay/pool).

For more details, please go through the list of commits.

Rel: #11425

coveralls · 2020-07-16T18:03:43Z

Coverage increased (+0.1%) to 37.102% when pulling f95e76a on pr/rolinh/relay-conn-mgmt-improvments into 1173f18 on master.

rolinh · 2020-07-17T07:32:43Z

test-me-please

gandro

🎉 This is an awesome change! The resulting code looks much cleaner and the fact that we can now easily mock connectivity issues and faulty client responses is simply amazing.

The implementation itself looks solid. I do have a few (mostly minor) higher-level questions and comments that popped up during the review that I wanted to discuss before approving.

gandro · 2020-07-17T11:56:18Z

How does this interact with the backoff? I think if we only trigger reconnects every ConnCheckInterval, then the minimal backoff will essentially always be at least ConnCheckInterval, in this case 2 minutes. The first retry which is supposed to happen after 10 seconds according to exponential backoff will actually happen after 2 minutes.

In addition, we will start all re-connection attempts always at the same time for all offline peers whose backoff has expired in the last 2 minutes. There is no jitter.

I'm not sure what the best approach is and we can do it in a follow-up PR. But is seems to me that only reconnecting all peers ConnCheckInterval will lead to very bursty reconnection behavior.

Ideally, we'd have an interval for checking the connection status, and a a separate logic scheduling the reconnection (i.e. a timer which expires at the smallest nextConnAttempt)

I think if we only trigger reconnects every ConnCheckInterval

Reconnects are also triggered with a client requests that uses ReportOffline.

The first retry which is supposed to happen after 10 seconds according to exponential backoff will actually happen after 2 minutes.

Unless there's a client request, yes, this is true. I'm not sure it's worth trying to reconnect more aggressively than this.

In addition, we will start all re-connection attempts always at the same time for all offline peers whose backoff has expired in the last 2 minutes. There is no jitter.

Very good point. I thought about this as well but as this PR was already getting pretty big, I thought this could be addressed in a follow-up PR. Note that even before this PR there's also anbother burst when Hubble Relay is started and is notified of all the peers in the cluster.

I fully agree with your point here and I'm happy to discuss a solution to implement in a follow-up PR.

Unless there's a client request, yes, this is true. I'm not sure it's worth trying to reconnect more aggressively than this.

If there is a client reconnect, then the backoff is ignored anwyway. Currently, actual back-off function will look like this.

2m0s (ConnCheckInterval) 2m0s (ConnCheckInterval) 2m0s (ConnCheckInterval) 2m0s (ConnCheckInterval) 2m40s 5m20s 10m40s 21m20s 42m40s 1h25m20s 2h50m40s 5h41m20s 11h22m40s 12h0m0s

I agree that we do not have to fix this in this PR, but it seems like this is not intentional.

gandro · 2020-07-17T12:05:32Z

Note to reviewers, since it's not clear from context: m.connect() will not reconnect if there is already a re-connection attempt in progress. So it's fine if the same peer is reported offline multiple times, we will still only do one actually re-connection attempt in m.connect.

rolinh · 2020-07-20T08:23:10Z

Note: rebased against master to pick up #12572

aanm · 2020-07-20T13:45:25Z

what's this mutex protecting?

All peer struct members.

When declaring a struct where the mutex must protect access to one or more fields, place the mutex above the fields that it will protect as a best practice.

Where does this quote come from? In any case, I've reordered the struct members.

aanm · 2020-07-20T13:47:22Z

This stop couldn't this stop be context? [1]

It could be replaced by a context but I try to follow the best practice documented in the context package documentation:

Programs that use Contexts should follow these rules to keep interfaces consistent across packages and enable static analysis tools to check context propagation:

Do not store Contexts inside a struct type; ...

aanm · 2020-07-20T13:47:37Z

[1] then it could be used as a parent of this context.

aanm · 2020-07-20T14:03:45Z

I might be wrong but this mutex could be replaced with a lock.RWMutex

Why would we do this? I think it would be detrimental to performance in this case.

I don't think we would benefit from using a RWMutex in this specific case.

rolinh · 2020-07-20T15:54:12Z

test-me-please

gandro

The new commits look fine to me in terms of approach! Thanks for addressing my nits and adding some additional checks in the unit tests.

I don't understand the point of the defers though. They seem to try to achieve something, but I don't understand what.

rolinh · 2020-07-20T19:21:35Z

I don't understand the point of the defers though. They seem to try to achieve something, but I don't understand what.

Indeed, not really useful :) I removed all the defer statements in the tests.

rolinh · 2020-07-21T04:49:03Z

test-me-please

aanm · 2020-07-21T10:29:59Z

When declaring a struct where the mutex must protect access to one or more fields, place the mutex above the fields that it will protect as a best practice.

kaworu

Great patch! I only have some minor comments and questions.

This is a first step that will allow implementing unit tests for the relay package. A new interface, `peerSyncer`, is defined and implemented by a `syncer` struct. Having the interface will allow creating a mock for testing purposes. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

On peer change notifications or call to `pool.Manager.ReportOffline(name string)`, the backoff interval should be ignored to re-attempt a connection. This change typically addresses the following scenario: 1. Node Foo is offline 2. The connectivity checker attempts to reconnect several time; each time the backoff is increased. 3. Node Foo becomes available; a peer change notification is received 4. Oops, no connection attempt is made to Foo because of a large backoff duration Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

This is not necessary as the connect function closes pre-existing connections. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

This is necessary to avoid import cycles when implementing fakes for the peer interfaces in the `testutils` package. Note that the `types` sub-package is commonly found throughout Cilium's codebase so this change follows the convention. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

This will allow mocking gRPC clients which is useful for unit testing. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

These unit tests are not exhaustive although they still provide a decent coverage. However, it provides a good structure to implement more tests in the future such as complex peer connectivity issue scenarios or simply regression tests for bugs that are fixed. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

This commit removes `Target()` from the peer `ClientBuilder` interface and, instead, adds the target as a parameter to `Client()`. As a consequence, Hubble Relay code is adapted to the change. The peer pool manager now has a new option to provide the address of the peer gRPC service. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

The debug option was only used to set the log level of the logger. As the caller now has control over the logger to be used, this option is no longer necessary. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

This change makes tests more reliable and is cleaner at the same time as once the function returns, the pool manager is really shut down. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

This commit tests the pool manager logger output for certain tests. This change revealed some flakiness in the tests that have been addressed. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

The peers map is expected to be more frequently read than written to, especially with the `List()` method being exposed and used by Hubble Relay for every request it handles. Therefore, a RWLock is more approriate than a regular lock. Suggested-by: André Martins <andre@cilium.io> Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

The name `pool` clashes with the package `pool`. It is not a problem per se but creates a bit of confusion so rename it to `pm`. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

rolinh · 2020-07-22T07:48:00Z

test-me-please

kaworu · 2020-07-22T13:29:28Z

+// ClientConn is an interface that defines the functions clients need to
+// perform unary and streaming RPCs. It is implemented by *grpc.ClientConn.
+type ClientConn interface {
+	// GetState returns the connectivity.State of ClientConn.


Even in grpc v1.30.0 (the latest at the time of writing) GetState() is marked as experimental.

Since this PR make use of it in several places, would it make sense to keep track of this somewhere?

I think it's important to pay attention to this when upgrading the go-grpc dep but we will anyway have to face breaking changes when upgrading.

rolinh added kind/enhancement This would improve or streamline existing functionality. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/hubble labels Jul 16, 2020

rolinh requested a review from a team July 16, 2020 17:21

glibsm reviewed Jul 16, 2020

View reviewed changes

Comment thread pkg/hubble/relay/peer.go Outdated

glibsm reviewed Jul 16, 2020

View reviewed changes

Comment thread pkg/hubble/relay/pool/option.go Outdated

glibsm reviewed Jul 16, 2020

View reviewed changes

Comment thread pkg/hubble/peer/client.go Outdated

gandro reviewed Jul 17, 2020

View reviewed changes

rolinh force-pushed the pr/rolinh/relay-conn-mgmt-improvments branch from 378c675 to deaa98a Compare July 20, 2020 08:21

rolinh force-pushed the pr/rolinh/relay-conn-mgmt-improvments branch from deaa98a to 24cbc7c Compare July 20, 2020 08:31

aanm requested changes Jul 20, 2020

View reviewed changes

rolinh force-pushed the pr/rolinh/relay-conn-mgmt-improvments branch from eff36f5 to 793c473 Compare July 20, 2020 15:27

rolinh requested a review from aanm July 20, 2020 15:31

rolinh force-pushed the pr/rolinh/relay-conn-mgmt-improvments branch from 793c473 to e504a29 Compare July 20, 2020 15:39

gandro reviewed Jul 20, 2020

View reviewed changes

rolinh force-pushed the pr/rolinh/relay-conn-mgmt-improvments branch from e504a29 to 5ee6f8e Compare July 20, 2020 19:19

rolinh requested a review from gandro July 20, 2020 19:20

aanm requested changes Jul 21, 2020

View reviewed changes

rolinh force-pushed the pr/rolinh/relay-conn-mgmt-improvments branch from 5ee6f8e to f4ea3c3 Compare July 21, 2020 11:58

rolinh requested a review from aanm July 21, 2020 12:05

aanm approved these changes Jul 21, 2020

View reviewed changes

kaworu requested changes Jul 21, 2020

View reviewed changes

Comment thread pkg/hubble/relay/pool/manager.go Outdated

Comment thread pkg/hubble/relay/server.go Outdated

Comment thread pkg/hubble/relay/pool/manager.go Outdated

Comment thread pkg/hubble/relay/pool/manager.go Outdated

rolinh added 16 commits July 21, 2020 16:44

hubble/relay: make the connection check interval configurable

cfe2615

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

hubble/relay: report offline peers to the pool manager

1db1402

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

hubble/relay: do not call disconnect before connection attempt

c3107e0

This is not necessary as the connect function closes pre-existing connections. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

hubble: move FakePeerNotifyServer to testutils

6e113c0

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

hubble/testutils: implement fakers for peer/types/{Client,ClientBuilder}

c9a5df5

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

hubble/testutils: add FakeGRPCClientStream

431382f

This will allow mocking gRPC clients which is useful for unit testing. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

hubble/testutils: implement FakePeerNotifyClient

c34a041

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

hubble/relay: wait on goroutine to finish when pool.Stop() is called

334035b

This change makes tests more reliable and is cleaner at the same time as once the function returns, the pool manager is really shut down. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

hubble/relay: assert logger messages in pool manager tests

7c4f4e5

This commit tests the pool manager logger output for certain tests. This change revealed some flakiness in the tests that have been addressed. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

hubble/relay: rename struct member from pool to pm to avoid confusion

f95e76a

The name `pool` clashes with the package `pool`. It is not a problem per se but creates a bit of confusion so rename it to `pm`. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

rolinh force-pushed the pr/rolinh/relay-conn-mgmt-improvments branch from f4ea3c3 to f95e76a Compare July 21, 2020 14:45

rolinh requested a review from kaworu July 21, 2020 14:46

gandro approved these changes Jul 22, 2020

View reviewed changes

kaworu approved these changes Jul 22, 2020

View reviewed changes

maintainer-s-little-helper Bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 22, 2020

rolinh merged commit 9fdcca2 into master Jul 22, 2020

rolinh deleted the pr/rolinh/relay-conn-mgmt-improvments branch July 22, 2020 13:34

pchaigno mentioned this pull request Jul 23, 2020

v1.8 backports 2020-07-23 #12627

Merged

pchaigno added backport-pending/1.8 and removed needs-backport/1.8 labels Jul 23, 2020

christarazi added backport-done/1.8 and removed backport-pending/1.8 labels Jul 23, 2020

rolinh mentioned this pull request Aug 10, 2020

hubble/relay: implement unit tests #11425

Closed

Conversation

rolinh commented Jul 16, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coveralls commented Jul 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rolinh commented Jul 17, 2020

Uh oh!

gandro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gandro Jul 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rolinh commented Jul 20, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rolinh Jul 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rolinh commented Jul 20, 2020

Uh oh!

gandro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rolinh commented Jul 20, 2020

Uh oh!

rolinh commented Jul 21, 2020

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coveralls commented Jul 16, 2020 •

edited

Loading

gandro Jul 17, 2020 •

edited

Loading

rolinh Jul 20, 2020 •

edited

Loading