Skip to content

test/K8sServices: Fix externalTrafficPolicy=Local with kube-proxy (on GKE)#12709

Merged
gandro merged 3 commits intomasterfrom
pr/gandro/ci-fix-externaltrafficpolicy-local-tests-on-gke
Jul 31, 2020
Merged

test/K8sServices: Fix externalTrafficPolicy=Local with kube-proxy (on GKE)#12709
gandro merged 3 commits intomasterfrom
pr/gandro/ci-fix-externaltrafficpolicy-local-tests-on-gke

Conversation

@gandro
Copy link
Copy Markdown
Member

@gandro gandro commented Jul 29, 2020

This commit fixes the test harness for externalTrafficPolicy=Local in
the case where we are accessing a node IP without a local backend from a
node with host reachable services enabled.

This is a known incompatibility between our kube-proxy replacement and
upstream kube-proxy. The existing test harness assumed that we only need
to handle this case if we are running without kube-proxy. This
assumption however is wrong, as we are running these tests in hybrid
mode, where we are running with both kube-proxy and Cilium's kube-proxy
replacement.

This has not been hit in our existing test suites up until recently,
because we did not have a test setup with both kube-proxy and the
kube-proxy replacement enabled at the same time. As GKE has been
upgraded to Linux 4.19, it is now using the described hybrid setup which has
caused the tests to break.

The test matrix as of writing now looks as follows:

Test Suite kube-proxy kube-proxy replacement
K8s-1.18-kernel-4.9 Yes No (Kernel 4.9.x)
K8s-1.17-Kernel-4.19 No Yes (Kernel 4.19.57)
K8s-1.12-Kernel-netnext No Yes (Kernel 5.8.0-rc1+)
Cilium-PR-Ginkgo-Tests-k8s Yes No (Kernel 4.9.x)
Cilium-PR-K8s-GKE Yes Yes (Kernel 4.19.112+)

Fixes: 67f85e3 ("tests: enable additional externalTrafficPolicy=Local tests")

@gandro gandro added release-note/ci This PR makes changes to the CI. needs-backport/1.8 labels Jul 29, 2020
@gandro gandro requested review from brb and nebril July 29, 2020 15:51
@gandro gandro requested review from a team as code owners July 29, 2020 15:51
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Jul 29, 2020

test-me-please

@coveralls
Copy link
Copy Markdown

coveralls commented Jul 29, 2020

Coverage Status

Coverage decreased (-0.009%) to 36.634% when pulling bee0976 on pr/gandro/ci-fix-externaltrafficpolicy-local-tests-on-gke into ea53587 on master.

@nebril
Copy link
Copy Markdown
Member

nebril commented Jul 29, 2020

gke seems to be failing due to unrelated cause, will restart

@nebril
Copy link
Copy Markdown
Member

nebril commented Jul 29, 2020

test-gke

2 similar comments
@nebril
Copy link
Copy Markdown
Member

nebril commented Jul 30, 2020

test-gke

@gandro
Copy link
Copy Markdown
Member Author

gandro commented Jul 30, 2020

test-gke

Comment thread test/k8sT/Services.go
// kube-proxy free case since we'll hit the wildcard rule in bpf_sock
// In-cluster connectivity from k8s2 to k8s1 IP will still work with
// HostRechableServices (regardless of if we are running with or
// without kube-proxy) since we'll hit the wildcard rule in bpf_sock
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you extend the comment saying why it won't work on kube-proxy?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the following comment:

// This is a known incompatibility with kube-proxy:
// kube-proxy 1.15+ will only load-balance requests from k8s1 to k8s1,
// but not from k8s2 to k8s1. In the k8s2 to k8s1 case, kube-proxy
// would send traffic to k8s1, where it would be subsequently
// dropped, because k8s1 has no service backend.
// However, if HostReachableServices is enabled, then Cilium does
// the service translation already on the client node, bypassing
// kube-proxy completely.

@gandro gandro force-pushed the pr/gandro/ci-fix-externaltrafficpolicy-local-tests-on-gke branch from f9dc804 to 6c7fed3 Compare July 30, 2020 11:06
Copy link
Copy Markdown
Member

@brb brb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@gandro
Copy link
Copy Markdown
Member Author

gandro commented Jul 30, 2020

test-me-please

@maintainer-s-little-helper maintainer-s-little-helper Bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 30, 2020
@nebril
Copy link
Copy Markdown
Member

nebril commented Jul 30, 2020

test-gke

@gandro gandro changed the title test/K8sServices: Fix externalTrafficPolicy=Local with kube-proxy test/K8sServices: Fix externalTrafficPolicy=Local with kube-proxy (on GKE) Jul 30, 2020
@nebril
Copy link
Copy Markdown
Member

nebril commented Jul 30, 2020

test-me-please

`--validate` helm flag introduced in 82cc7c3 caused ci to fail in gke
where we enable node init daemonset. It needs to be cleared before
cilium installation.

Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
@gandro gandro removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 30, 2020
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/ci-fix-externaltrafficpolicy-local-tests-on-gke branch from 6c7fed3 to 9575367 Compare July 30, 2020 19:32
@gandro gandro changed the base branch from master to pr/delete-node-init-ci July 30, 2020 19:32
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Jul 30, 2020

Rebased on #12725 - as GKE can not pass otherwise.

Naturally blocked on #12725 getting merged first.

@gandro
Copy link
Copy Markdown
Member Author

gandro commented Jul 30, 2020

test-me-please

Base automatically changed from pr/delete-node-init-ci to master July 30, 2020 20:44
@nebril
Copy link
Copy Markdown
Member

nebril commented Jul 30, 2020

#12725 is merged, please rebase

ninja edit - didn't refresh the pr page

@maintainer-s-little-helper maintainer-s-little-helper Bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 30, 2020
@nebril
Copy link
Copy Markdown
Member

nebril commented Jul 30, 2020

test-gke

@gandro gandro removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 31, 2020
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Jul 31, 2020

Looks like GetCiliumPodOnNode returned an empty string on GKE?

Edit: That function probably doesn't work on GKE anways. Switched to GetCiliumPodOnNodeWithLabel

This commit fixes the test harness for `externalTrafficPolicy=Local` in
the case where we are accessing a node IP without a local backend from a
node with host reachable services enabled.

This is a known incompatibility between our kube-proxy replacement and
upstream kube-proxy. The existing test harness assumed that we only need
to handle this case if we are running without kube-proxy. This
assumption however is wrong, as we are running these tests in hybrid
mode, where we are running with both kube-proxy and Cilium's kube-proxy
replacement.

This has not been hit in our existing test suites up until recently,
because we did not have a test setup with both kube-proxy and the
kube-proxy replacement enabled at the same time. As GKE has been
upgraded to Linux 4.19, it is now using the described setup which has
caused the tests to break.

The test matrix as of writing now looks as follows:

| Test Suite                 | kube-proxy | kube-proxy replacement  |
| -------------------------- | ---------- | ----------------------- |
| K8s-1.18-kernel-4.9        | Yes        | No  (Kernel 4.9.x)      |
| K8s-1.17-Kernel-4.19       | No         | Yes (Kernel 4.19.57)    |
| K8s-1.12-Kernel-netnext    | No         | Yes (Kernel 5.8.0-rc1+) |
| Cilium-PR-Ginkgo-Tests-k8s | Yes        | No  (Kernel 4.9.x)      |
| Cilium-PR-K8s-GKE          | Yes        | Yes (Kernel 4.19.112+)  |

Fixes: 67f85e3 ("tests: enable additional externalTrafficPolicy=Local tests")

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/ci-fix-externaltrafficpolicy-local-tests-on-gke branch from 9575367 to bee0976 Compare July 31, 2020 07:25
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Jul 31, 2020

test-me-please

@gandro
Copy link
Copy Markdown
Member Author

gandro commented Jul 31, 2020

GKE hit Suite-k8s-1.15.K8sChaosTest Connectivity demo application Endpoint can still connect while Cilium is not running which seems unrelated.

@gandro
Copy link
Copy Markdown
Member Author

gandro commented Jul 31, 2020

test-gke

@maintainer-s-little-helper maintainer-s-little-helper Bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 31, 2020
@gandro gandro removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 31, 2020
@maintainer-s-little-helper maintainer-s-little-helper Bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 31, 2020
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Jul 31, 2020

Again hit Suite-k8s-1.15.K8sChaosTest Connectivity demo application Endpoint can still connect while Cilium is not running (on GKE)

@gandro gandro removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 31, 2020
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Jul 31, 2020

I'm merging this. It seems like it did fix the externalTrafficPolicy=Local issue on GKE (it passed two times) as well as all other CI. The GKE failure in the Chaos test is unrelated.

@gandro gandro merged commit ac9732a into master Jul 31, 2020
@gandro gandro deleted the pr/gandro/ci-fix-externaltrafficpolicy-local-tests-on-gke branch July 31, 2020 11:03
@maintainer-s-little-helper maintainer-s-little-helper Bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants