v1.8 backports 2020-07-23 by pchaigno · Pull Request #12627 · cilium/cilium

pchaigno · 2020-07-23T14:18:07Z

docs: Clarify that identity transfer is an optimization #12589 -- docs: Clarify that identity transfer is an optimization (@pchaigno)
daemon: Add "datapath" opt to --debug-verbose flag #12568 -- daemon: Add "datapath" opt to --debug-verbose flag (@brb)
Makefile: process packages in subshells for bench targets #12607 -- Makefile: process packages in subshells for bench targets (@tklauser)
hubble/relay: improve peer connections handling #12556 -- hubble/relay: improve peer connections handling (@rolinh)
docs: move removal of access-log option to correct section #12615 -- docs: move removal of access-log option to correct section (@tklauser)
install/kubernetes: remove deprecated tofqdns-enable-poller option #12618 -- install/kubernetes: remove deprecated tofqdns-enable-poller option (@tklauser)
Fixes for EKS NodePort XDP getting started guide #12623 -- Fixes for EKS NodePort XDP getting started guide (@tklauser)
etcd: Fix several etcd related issues #12605 -- etcd: Fix several etcd related issues (@tgraf)
fqdn/dnsproxy/proxy_test: increase again timeout for DNS TCP exchanges #12606 -- fqdn/dnsproxy/proxy_test: increase again timeout for DNS TCP exchanges (@qmonnet)
k8s: Fix CRD schema version to 1.21.1

Once this PR is merged, you can update the PR labels via:

$ for pr in 12589 12568 12607 12556 12615 12618 12623 12605 12606; do contrib/backporting/set-labels.py $pr done 1.8; done

[ upstream commit b9496e1 ] 7796070 removed bpf_lb.o and any LB state transfers across nodes for DSR. Fixes: 7796070 ("bpf: Remove bpf_lb.c") Reported-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 7930f74 ] This small change should clarify that the identity transfer in overlay mode is an optimization and not a requirement. We have had several users confused by this paragraph asking how security identity was obtained in direct routing mode. Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 0591d63 ] Adding the "datapath" to --debug-verbose enables debug messages produced by the datapath. Previously, it could have been enabled for new endpoints by invoking "cilium config Debug{,LB}=Enabled" on each node, and for existing - by "cilium config $EID Debug{,LB}=Enabled". Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 1173f18 ] Add a note to the upgrade guide how to enable datapath debug messages which were disabled by [1] (released in v1.8.0). [1]: 583e67c Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 599c3af ] In certain environments, the argument list might become large enough so `make bench` or `make bench-privileged` fail with: make: execvp: /bin/bash: Argument list too long Follow the approach from commit cfa665d ("Explicitly break unit-tests foreach into for loop.") for the `bench` and `bench-privileged` targets as well to fix this. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 46818da ] This is a first step that will allow implementing unit tests for the relay package. A new interface, `peerSyncer`, is defined and implemented by a `syncer` struct. Having the interface will allow creating a mock for testing purposes. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 207da71 ] This provides a better abstraction as the client that calls `List()` is expected to receive a list of peers that are "ready" to connect to and does not have to manage re-connect. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 5d9eec6 ] This commit is an in-depth reimplementation of the peers synchronization/connections handling logic. All logic regarding management of peers and connections to them has been moved to a separate package with a clearly defined interface for the consumer (`pool.PeerManager`). The goal here is threefold: 1. Separation of concerns (by hiding all internal peer handling logic into a dedicated package) 2. Increased testability 3. Improved connections handling The consumer of the `pool` package does not have to care whether a peer is connected or not anymore. If a peer is not connected, it's `Conn` attribute will be `<nil>`. If `Conn` is not `<nil>` but the peer cannot be reached, the consumer can signal it by using the manager's `ReportOffline(name string)` method. The new `pool.Manager` (which implements the `pool.PeerManager` interface) also implements a better mechanism to ensure peers connection are healthy by using a dedicated routine that monitors for offline/reported offline peers and attempts to (re-)connect to them. This improves the situation as with the older implementation, re-connecting to a peer required a new call to the Hubble Relay service (and the peers that were "woken up" were not available for the request). Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit e4e5fcd ] Such interface are useful to create good abstractions with code that require a Peer service client. They can also easily be mocked which is essential to write good unit tests. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit c61fb77 ] LocalClientBuilder creates a new Client which is suitable when the gRPC connection to the Peer service is local (typically a Unix Domain Socket). This scenario is the most common one and is directly useful to Hubble Relay. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 5168499 ] Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit f2a8f31 ] This commits defines an interface for a gRPC client connection builder. When instantiating a new pool.Manager, the caller may specify a grpc client connection builder implementation to be used to create connection to Hubble peers. This allows for fine grained control over the client connection parameters by the caller. In addition, as this is an interface, it can easily be mocked which is useful when writing unit tests. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 5be8aca ] With the rework of peer connections handling, a goroutine periodically checks whether peer connections are healthy or not. This check is necessary as there are scenarios where a connection can be lost without any peer service notification or client request (for instance, a node reboot). This commit improves reconnection attempts by implementing a backoff mechanism so attempts are not too frequent. There is a default backoff mechanism that can be configured by caller when instantiating a new pool.Manager. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 3120dc1 ] Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 24f65cc ] Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 35471ca ] On peer change notifications or call to `pool.Manager.ReportOffline(name string)`, the backoff interval should be ignored to re-attempt a connection. This change typically addresses the following scenario: 1. Node Foo is offline 2. The connectivity checker attempts to reconnect several time; each time the backoff is increased. 3. Node Foo becomes available; a peer change notification is received 4. Oops, no connection attempt is made to Foo because of a large backoff duration Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 08e7f3d ] This is not necessary as the connect function closes pre-existing connections. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 2064bdc ] Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit bee30fa ] This is necessary to avoid import cycles when implementing fakes for the peer interfaces in the `testutils` package. Note that the `types` sub-package is commonly found throughout Cilium's codebase so this change follows the convention. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit df12d9e ] Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 304bced ] This will allow mocking gRPC clients which is useful for unit testing. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 7b42b5f ] Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 065ec3d ] These unit tests are not exhaustive although they still provide a decent coverage. However, it provides a good structure to implement more tests in the future such as complex peer connectivity issue scenarios or simply regression tests for bugs that are fixed. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 7b7845a ] This commit removes `Target()` from the peer `ClientBuilder` interface and, instead, adds the target as a parameter to `Client()`. As a consequence, Hubble Relay code is adapted to the change. The peer pool manager now has a new option to provide the address of the peer gRPC service. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 029ca1f ] The debug option was only used to set the log level of the logger. As the caller now has control over the logger to be used, this option is no longer necessary. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 4297a79 ] This change makes tests more reliable and is cleaner at the same time as once the function returns, the pool manager is really shut down. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 4d557d9 ] This commit tests the pool manager logger output for certain tests. This change revealed some flakiness in the tests that have been addressed. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 6404664 ] The peers map is expected to be more frequently read than written to, especially with the `List()` method being exposed and used by Hubble Relay for every request it handles. Therefore, a RWLock is more approriate than a regular lock. Suggested-by: André Martins <andre@cilium.io> Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 9fdcca2 ] The name `pool` clashes with the package `pool`. It is not a problem per se but creates a bit of confusion so rename it to `pm`. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 37af0ea ] The `access-log` option was removed in 1.8, so move it from the "Deprecated options" section to the correct "Removed options" section in the upgrade guide. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 794c34b ] Commit 19ad311 incorrectly changed the return value to nil instead of returning an error. Fixes: 19ad311 ("kvstore: Fix Watch() to return when client is closed or context is cancelled") Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit bf8e432 ] The firstSession channel has been used to allow blocking until the initial session has been established. However, because callers have not been handling errors, it was not possible to close the channel if the etcd client was shut down before the session was ever established and thus the channel was leaked. This was even more problematic for etcd operations performed without a context with a timeout as most operations currently block on firstSession or context. Consolidate the errChan and firstSession as they effectively served the same purpose. The only slight change is that the firstSession channel is closed *after* the etcd version has been verified which is the right behavior anyway. Having the error condition returned via the firstSession channel allows to return it to owning controllers as well when renewing sessions for better visibility. Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit e959c9f ] Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 589bdf3 ] Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 02a1810 ] Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 0262854 ] The context passed into NewSession() was supposed to enforce a timeout on the NewSession() operation which is triggering a Grant() and KeepAlive() instruction. However, the context passed into NewSession() will also be associated with the resulting lease. As per etcd documentation, this will result in: > If the context is canceled before Close() completes, the session's lease will > be abandoned and left to expire instead of being revoked. Because of this, any session renewal triggering a new session would create a session that is immediately closed again due to the context passed into NewSession being cancelled when the controller run ends successfully. This resulted in any renewed session to have an effective lifetime of 10 milliseconds before requiring renewal again. Fixes: #12619 Fixes: 8524fca ("kvstore: Add session renew backoff") Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit fff58ef ] Follow-up to #12305, where we raised the timeout from 100ms to 500ms. This seemed to suppress most of the flakes reported in #12042, but we saw one again recently: Try restoring the timeout value to its original value of 1 second. Most of the time the RTT time for the exchange is way below 100ms anyway and we won't have a difference on tests duration. In the worst and very unlikely case where all DNS TCP exchanges are super-slow, we only have 5 exchanges in the tests and cannot spend more than a total 5 seconds on them (or one would timeout and the test fail). Fixes: #12042 Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>

qmonnet

#12606 -- fqdn/dnsproxy/proxy_test: increase again timeout for DNS TCP exchanges (@qmonnet)

Looks good 👍

tklauser

👍 for my changes

rolinh

Looks good for my changes as well 👍

brb

LGTM for my changes.

As per the discussions from #12493, we should not increment the v1.8 schema version to 1.22 as this would conflict with early currently v1.9 cycle. Fix it to 1.21.1 instead. Fixes: 336791f ("Fix small CRD issue with toGroups") Signed-off-by: André Martins <andre@cilium.io>

pchaigno · 2020-07-23T15:16:18Z

test-backport-1.8

aanm · 2020-07-23T20:17:01Z

test-upstream-k8s

pchaigno and others added 30 commits July 23, 2020 16:13

docs: Remove mention of state transfer for DSR

05ad687

[ upstream commit b9496e1 ] 7796070 removed bpf_lb.o and any LB state transfers across nodes for DSR. Fixes: 7796070 ("bpf: Remove bpf_lb.c") Reported-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Paul Chaignon <paul@cilium.io>

hubble/relay: use peer.ClientBuilder to build Peer service clients

beb4d75

[ upstream commit 5168499 ] Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

hubble/relay: make the connection check interval configurable

7e4adfb

[ upstream commit 3120dc1 ] Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

hubble/relay: report offline peers to the pool manager

987f917

[ upstream commit 24f65cc ] Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

hubble/relay: do not call disconnect before connection attempt

5ef269b

[ upstream commit 08e7f3d ] This is not necessary as the connect function closes pre-existing connections. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

hubble: move FakePeerNotifyServer to testutils

dd98303

[ upstream commit 2064bdc ] Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

hubble/testutils: implement fakers for peer/types/{Client,ClientBuilder}

ae574e2

[ upstream commit df12d9e ] Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

hubble/testutils: add FakeGRPCClientStream

a2d4976

[ upstream commit 304bced ] This will allow mocking gRPC clients which is useful for unit testing. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

hubble/testutils: implement FakePeerNotifyClient

6442018

[ upstream commit 7b42b5f ] Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

tgraf and others added 7 commits July 23, 2020 16:15

etcd: Print info message when connection is established

3c5170b

[ upstream commit e959c9f ] Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>

etcd: Report reason when lock acquisition in status check fails

859903a

[ upstream commit 589bdf3 ] Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>

etcd: Lock e.session when renewing session

a203274

[ upstream commit 02a1810 ] Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>

pchaigno requested a review from a team as a code owner July 23, 2020 14:18

pchaigno added backport/1.8 kind/backports This PR provides functionality previously merged into master. labels Jul 23, 2020

pchaigno requested review from brb, qmonnet, rolinh, tgraf and tklauser July 23, 2020 14:18

pchaigno added the priority/release-blocker label Jul 23, 2020

qmonnet approved these changes Jul 23, 2020

View reviewed changes

tklauser approved these changes Jul 23, 2020

View reviewed changes

rolinh approved these changes Jul 23, 2020

View reviewed changes

brb approved these changes Jul 23, 2020

View reviewed changes

christarazi self-assigned this Jul 23, 2020

pchaigno removed the request for review from tgraf July 23, 2020 21:09

maintainer-s-little-helper Bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 23, 2020

christarazi merged commit 2c595e3 into v1.8 Jul 23, 2020

christarazi deleted the pr/v1.8-backport-2020-07-23 branch July 23, 2020 21:32

pchaigno added release-blocker/1.8 and removed priority/release-blocker labels Jan 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.8 backports 2020-07-23#12627

v1.8 backports 2020-07-23#12627
christarazi merged 42 commits intov1.8from
pr/v1.8-backport-2020-07-23

pchaigno commented Jul 23, 2020 •

edited by aanm

Loading

Uh oh!

qmonnet left a comment

Uh oh!

tklauser left a comment

Uh oh!

rolinh left a comment

Uh oh!

brb left a comment

Uh oh!

pchaigno commented Jul 23, 2020

Uh oh!

aanm commented Jul 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

pchaigno commented Jul 23, 2020 • edited by aanm Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qmonnet left a comment

Choose a reason for hiding this comment

Uh oh!

tklauser left a comment

Choose a reason for hiding this comment

Uh oh!

rolinh left a comment

Choose a reason for hiding this comment

Uh oh!

brb left a comment

Choose a reason for hiding this comment

Uh oh!

pchaigno commented Jul 23, 2020

Uh oh!

aanm commented Jul 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

pchaigno commented Jul 23, 2020 •

edited by aanm

Loading