hubble/relay: improve peer connections handling#12556
Conversation
|
test-me-please |
gandro
left a comment
There was a problem hiding this comment.
🎉 This is an awesome change! The resulting code looks much cleaner and the fact that we can now easily mock connectivity issues and faulty client responses is simply amazing.
The implementation itself looks solid. I do have a few (mostly minor) higher-level questions and comments that popped up during the review that I wanted to discuss before approving.
There was a problem hiding this comment.
How does this interact with the backoff? I think if we only trigger reconnects every ConnCheckInterval, then the minimal backoff will essentially always be at least ConnCheckInterval, in this case 2 minutes. The first retry which is supposed to happen after 10 seconds according to exponential backoff will actually happen after 2 minutes.
In addition, we will start all re-connection attempts always at the same time for all offline peers whose backoff has expired in the last 2 minutes. There is no jitter.
I'm not sure what the best approach is and we can do it in a follow-up PR. But is seems to me that only reconnecting all peers ConnCheckInterval will lead to very bursty reconnection behavior.
Ideally, we'd have an interval for checking the connection status, and a a separate logic scheduling the reconnection (i.e. a timer which expires at the smallest nextConnAttempt)
There was a problem hiding this comment.
I think if we only trigger reconnects every
ConnCheckInterval
Reconnects are also triggered with a client requests that uses ReportOffline.
The first retry which is supposed to happen after 10 seconds according to exponential backoff will actually happen after 2 minutes.
Unless there's a client request, yes, this is true. I'm not sure it's worth trying to reconnect more aggressively than this.
In addition, we will start all re-connection attempts always at the same time for all offline peers whose backoff has expired in the last 2 minutes. There is no jitter.
Very good point. I thought about this as well but as this PR was already getting pretty big, I thought this could be addressed in a follow-up PR. Note that even before this PR there's also anbother burst when Hubble Relay is started and is notified of all the peers in the cluster.
I fully agree with your point here and I'm happy to discuss a solution to implement in a follow-up PR.
There was a problem hiding this comment.
Unless there's a client request, yes, this is true. I'm not sure it's worth trying to reconnect more aggressively than this.
If there is a client reconnect, then the backoff is ignored anwyway. Currently, actual back-off function will look like this.
2m0s (ConnCheckInterval)
2m0s (ConnCheckInterval)
2m0s (ConnCheckInterval)
2m0s (ConnCheckInterval)
2m40s
5m20s
10m40s
21m20s
42m40s
1h25m20s
2h50m40s
5h41m20s
11h22m40s
12h0m0s
I agree that we do not have to fix this in this PR, but it seems like this is not intentional.
There was a problem hiding this comment.
Note to reviewers, since it's not clear from context: m.connect() will not reconnect if there is already a re-connection attempt in progress. So it's fine if the same peer is reported offline multiple times, we will still only do one actually re-connection attempt in m.connect.
378c675 to
deaa98a
Compare
|
Note: rebased against master to pick up #12572 |
deaa98a to
24cbc7c
Compare
There was a problem hiding this comment.
When declaring a struct where the mutex must protect access to one or more fields, place the mutex above the fields that it will protect as a best practice.
There was a problem hiding this comment.
Where does this quote come from? In any case, I've reordered the struct members.
There was a problem hiding this comment.
This stop couldn't this stop be context? [1]
There was a problem hiding this comment.
It could be replaced by a context but I try to follow the best practice documented in the context package documentation:
Programs that use Contexts should follow these rules to keep interfaces consistent across packages and enable static analysis tools to check context propagation:
Do not store Contexts inside a struct type; ...
There was a problem hiding this comment.
[1] then it could be used as a parent of this context.
There was a problem hiding this comment.
I might be wrong but this mutex could be replaced with a lock.RWMutex
There was a problem hiding this comment.
Why would we do this? I think it would be detrimental to performance in this case.
There was a problem hiding this comment.
I don't think we would benefit from using a RWMutex in this specific case.
eff36f5 to
793c473
Compare
793c473 to
e504a29
Compare
|
test-me-please |
gandro
left a comment
There was a problem hiding this comment.
The new commits look fine to me in terms of approach! Thanks for addressing my nits and adding some additional checks in the unit tests.
I don't understand the point of the defers though. They seem to try to achieve something, but I don't understand what.
e504a29 to
5ee6f8e
Compare
Indeed, not really useful :) I removed all the |
|
test-me-please |
There was a problem hiding this comment.
When declaring a struct where the mutex must protect access to one or more fields, place the mutex above the fields that it will protect as a best practice.
5ee6f8e to
f4ea3c3
Compare
kaworu
left a comment
There was a problem hiding this comment.
Great patch! I only have some minor comments and questions.
This is a first step that will allow implementing unit tests for the relay package. A new interface, `peerSyncer`, is defined and implemented by a `syncer` struct. Having the interface will allow creating a mock for testing purposes. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
On peer change notifications or call to
`pool.Manager.ReportOffline(name string)`, the backoff interval should
be ignored to re-attempt a connection. This change typically addresses
the following scenario:
1. Node Foo is offline
2. The connectivity checker attempts to reconnect several time; each
time the backoff is increased.
3. Node Foo becomes available; a peer change notification is received
4. Oops, no connection attempt is made to Foo because of a large
backoff duration
Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This is not necessary as the connect function closes pre-existing connections. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This is necessary to avoid import cycles when implementing fakes for the peer interfaces in the `testutils` package. Note that the `types` sub-package is commonly found throughout Cilium's codebase so this change follows the convention. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This will allow mocking gRPC clients which is useful for unit testing. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
These unit tests are not exhaustive although they still provide a decent coverage. However, it provides a good structure to implement more tests in the future such as complex peer connectivity issue scenarios or simply regression tests for bugs that are fixed. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This commit removes `Target()` from the peer `ClientBuilder` interface and, instead, adds the target as a parameter to `Client()`. As a consequence, Hubble Relay code is adapted to the change. The peer pool manager now has a new option to provide the address of the peer gRPC service. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
The debug option was only used to set the log level of the logger. As the caller now has control over the logger to be used, this option is no longer necessary. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This change makes tests more reliable and is cleaner at the same time as once the function returns, the pool manager is really shut down. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This commit tests the pool manager logger output for certain tests. This change revealed some flakiness in the tests that have been addressed. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
The peers map is expected to be more frequently read than written to, especially with the `List()` method being exposed and used by Hubble Relay for every request it handles. Therefore, a RWLock is more approriate than a regular lock. Suggested-by: André Martins <andre@cilium.io> Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
The name `pool` clashes with the package `pool`. It is not a problem per se but creates a bit of confusion so rename it to `pm`. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
f4ea3c3 to
f95e76a
Compare
|
test-me-please |
| // ClientConn is an interface that defines the functions clients need to | ||
| // perform unary and streaming RPCs. It is implemented by *grpc.ClientConn. | ||
| type ClientConn interface { | ||
| // GetState returns the connectivity.State of ClientConn. |
There was a problem hiding this comment.
Even in grpc v1.30.0 (the latest at the time of writing) GetState() is marked as experimental.
Since this PR make use of it in several places, would it make sense to keep track of this somewhere?
There was a problem hiding this comment.
I think it's important to pay attention to this when upgrading the go-grpc dep but we will anyway have to face breaking changes when upgrading.
This PR improves peer connections handling and introduce unit tests. In order to implement unit tests, a major refactoring was necessary. This refactoring also has for side-effect to improve separation of concerns as all the peer connection handling logic is now into its own package (
pkg/hubble/relay/pool).For more details, please go through the list of commits.
Rel: #11425