v1.8 backports 2020-07-23#12627
Merged
christarazi merged 42 commits intov1.8from Jul 23, 2020
Merged
Conversation
[ upstream commit 7930f74 ] This small change should clarify that the identity transfer in overlay mode is an optimization and not a requirement. We have had several users confused by this paragraph asking how security identity was obtained in direct routing mode. Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 0591d63 ] Adding the "datapath" to --debug-verbose enables debug messages produced by the datapath. Previously, it could have been enabled for new endpoints by invoking "cilium config Debug{,LB}=Enabled" on each node, and for existing - by "cilium config $EID Debug{,LB}=Enabled". Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 599c3af ] In certain environments, the argument list might become large enough so `make bench` or `make bench-privileged` fail with: make: execvp: /bin/bash: Argument list too long Follow the approach from commit cfa665d ("Explicitly break unit-tests foreach into for loop.") for the `bench` and `bench-privileged` targets as well to fix this. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 46818da ] This is a first step that will allow implementing unit tests for the relay package. A new interface, `peerSyncer`, is defined and implemented by a `syncer` struct. Having the interface will allow creating a mock for testing purposes. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 207da71 ] This provides a better abstraction as the client that calls `List()` is expected to receive a list of peers that are "ready" to connect to and does not have to manage re-connect. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 5d9eec6 ] This commit is an in-depth reimplementation of the peers synchronization/connections handling logic. All logic regarding management of peers and connections to them has been moved to a separate package with a clearly defined interface for the consumer (`pool.PeerManager`). The goal here is threefold: 1. Separation of concerns (by hiding all internal peer handling logic into a dedicated package) 2. Increased testability 3. Improved connections handling The consumer of the `pool` package does not have to care whether a peer is connected or not anymore. If a peer is not connected, it's `Conn` attribute will be `<nil>`. If `Conn` is not `<nil>` but the peer cannot be reached, the consumer can signal it by using the manager's `ReportOffline(name string)` method. The new `pool.Manager` (which implements the `pool.PeerManager` interface) also implements a better mechanism to ensure peers connection are healthy by using a dedicated routine that monitors for offline/reported offline peers and attempts to (re-)connect to them. This improves the situation as with the older implementation, re-connecting to a peer required a new call to the Hubble Relay service (and the peers that were "woken up" were not available for the request). Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit e4e5fcd ] Such interface are useful to create good abstractions with code that require a Peer service client. They can also easily be mocked which is essential to write good unit tests. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit c61fb77 ] LocalClientBuilder creates a new Client which is suitable when the gRPC connection to the Peer service is local (typically a Unix Domain Socket). This scenario is the most common one and is directly useful to Hubble Relay. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 5168499 ] Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit f2a8f31 ] This commits defines an interface for a gRPC client connection builder. When instantiating a new pool.Manager, the caller may specify a grpc client connection builder implementation to be used to create connection to Hubble peers. This allows for fine grained control over the client connection parameters by the caller. In addition, as this is an interface, it can easily be mocked which is useful when writing unit tests. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 5be8aca ] With the rework of peer connections handling, a goroutine periodically checks whether peer connections are healthy or not. This check is necessary as there are scenarios where a connection can be lost without any peer service notification or client request (for instance, a node reboot). This commit improves reconnection attempts by implementing a backoff mechanism so attempts are not too frequent. There is a default backoff mechanism that can be configured by caller when instantiating a new pool.Manager. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 3120dc1 ] Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 24f65cc ] Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 35471ca ] On peer change notifications or call to `pool.Manager.ReportOffline(name string)`, the backoff interval should be ignored to re-attempt a connection. This change typically addresses the following scenario: 1. Node Foo is offline 2. The connectivity checker attempts to reconnect several time; each time the backoff is increased. 3. Node Foo becomes available; a peer change notification is received 4. Oops, no connection attempt is made to Foo because of a large backoff duration Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 08e7f3d ] This is not necessary as the connect function closes pre-existing connections. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 2064bdc ] Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit bee30fa ] This is necessary to avoid import cycles when implementing fakes for the peer interfaces in the `testutils` package. Note that the `types` sub-package is commonly found throughout Cilium's codebase so this change follows the convention. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit df12d9e ] Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 304bced ] This will allow mocking gRPC clients which is useful for unit testing. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 7b42b5f ] Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 065ec3d ] These unit tests are not exhaustive although they still provide a decent coverage. However, it provides a good structure to implement more tests in the future such as complex peer connectivity issue scenarios or simply regression tests for bugs that are fixed. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 7b7845a ] This commit removes `Target()` from the peer `ClientBuilder` interface and, instead, adds the target as a parameter to `Client()`. As a consequence, Hubble Relay code is adapted to the change. The peer pool manager now has a new option to provide the address of the peer gRPC service. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 029ca1f ] The debug option was only used to set the log level of the logger. As the caller now has control over the logger to be used, this option is no longer necessary. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 4297a79 ] This change makes tests more reliable and is cleaner at the same time as once the function returns, the pool manager is really shut down. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 4d557d9 ] This commit tests the pool manager logger output for certain tests. This change revealed some flakiness in the tests that have been addressed. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 6404664 ] The peers map is expected to be more frequently read than written to, especially with the `List()` method being exposed and used by Hubble Relay for every request it handles. Therefore, a RWLock is more approriate than a regular lock. Suggested-by: André Martins <andre@cilium.io> Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 9fdcca2 ] The name `pool` clashes with the package `pool`. It is not a problem per se but creates a bit of confusion so rename it to `pm`. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 37af0ea ] The `access-log` option was removed in 1.8, so move it from the "Deprecated options" section to the correct "Removed options" section in the upgrade guide. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 794c34b ] Commit 19ad311 incorrectly changed the return value to nil instead of returning an error. Fixes: 19ad311 ("kvstore: Fix Watch() to return when client is closed or context is cancelled") Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit bf8e432 ] The firstSession channel has been used to allow blocking until the initial session has been established. However, because callers have not been handling errors, it was not possible to close the channel if the etcd client was shut down before the session was ever established and thus the channel was leaked. This was even more problematic for etcd operations performed without a context with a timeout as most operations currently block on firstSession or context. Consolidate the errChan and firstSession as they effectively served the same purpose. The only slight change is that the firstSession channel is closed *after* the etcd version has been verified which is the right behavior anyway. Having the error condition returned via the firstSession channel allows to return it to owning controllers as well when renewing sessions for better visibility. Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit e959c9f ] Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 589bdf3 ] Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 02a1810 ] Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 0262854 ] The context passed into NewSession() was supposed to enforce a timeout on the NewSession() operation which is triggering a Grant() and KeepAlive() instruction. However, the context passed into NewSession() will also be associated with the resulting lease. As per etcd documentation, this will result in: > If the context is canceled before Close() completes, the session's lease will > be abandoned and left to expire instead of being revoked. Because of this, any session renewal triggering a new session would create a session that is immediately closed again due to the context passed into NewSession being cancelled when the controller run ends successfully. This resulted in any renewed session to have an effective lifetime of 10 milliseconds before requiring renewal again. Fixes: #12619 Fixes: 8524fca ("kvstore: Add session renew backoff") Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit fff58ef ] Follow-up to #12305, where we raised the timeout from 100ms to 500ms. This seemed to suppress most of the flakes reported in #12042, but we saw one again recently: Try restoring the timeout value to its original value of 1 second. Most of the time the RTT time for the exchange is way below 100ms anyway and we won't have a difference on tests duration. In the worst and very unlikely case where all DNS TCP exchanges are super-slow, we only have 5 exchanges in the tests and cannot spend more than a total 5 seconds on them (or one would timeout and the test fail). Fixes: #12042 Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
rolinh
approved these changes
Jul 23, 2020
Member
rolinh
left a comment
There was a problem hiding this comment.
Looks good for my changes as well 👍
Member
Author
|
test-backport-1.8 |
Member
|
test-upstream-k8s |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Once this PR is merged, you can update the PR labels via: