Skip to content

v1.7 backports 2020-06-04#11906

Merged
joestringer merged 4 commits intov1.7from
pr/v1.7-backport-2020-06-04
Jun 10, 2020
Merged

v1.7 backports 2020-06-04#11906
joestringer merged 4 commits intov1.7from
pr/v1.7-backport-2020-06-04

Conversation

@christarazi
Copy link
Copy Markdown
Member

@christarazi christarazi commented Jun 4, 2020

Skipped due to non-trivial conflicts:

Skipped as it depends on #11766:
* #11804 -- fix(datarace): Fix possible nil pointer dereference (@sayboras)
The above PR doesn't need to be backported, see here. Removing.

Skipped as it will be handled by original author:

Once this PR is merged, you can update the PR labels via:

$ for pr in 11858 11879 11863 11762; do contrib/backporting/set-labels.py $pr done 1.7; done

@christarazi christarazi requested a review from a team as a code owner June 4, 2020 23:51
@christarazi christarazi added backport/1.7 kind/backports This PR provides functionality previously merged into master. labels Jun 4, 2020
@christarazi
Copy link
Copy Markdown
Member Author

test-backport-1.7

@christarazi christarazi force-pushed the pr/v1.7-backport-2020-06-04 branch from dbcb14a to 9f81d41 Compare June 5, 2020 01:20
@christarazi
Copy link
Copy Markdown
Member Author

test-backport-1.7

@christarazi christarazi force-pushed the pr/v1.7-backport-2020-06-04 branch from 9f81d41 to bcc2d65 Compare June 5, 2020 01:36
@christarazi
Copy link
Copy Markdown
Member Author

test-backport-1.7

@christarazi christarazi force-pushed the pr/v1.7-backport-2020-06-04 branch from bcc2d65 to c4fe341 Compare June 5, 2020 01:48
@christarazi
Copy link
Copy Markdown
Member Author

test-backport-1.7

1 similar comment
@christarazi
Copy link
Copy Markdown
Member Author

test-backport-1.7

@christarazi
Copy link
Copy Markdown
Member Author

christarazi commented Jun 5, 2020

test-missed-k8s

Edit: timed out

@christarazi
Copy link
Copy Markdown
Member Author

christarazi commented Jun 5, 2020

restart-ginkgo

Edit: timed out

Copy link
Copy Markdown
Member

@nebril nebril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for my changes!

@christarazi
Copy link
Copy Markdown
Member Author

restart-ginkgo

@christarazi
Copy link
Copy Markdown
Member Author

test-missed-k8s

@christarazi
Copy link
Copy Markdown
Member Author

restart-ginkgo

@christarazi
Copy link
Copy Markdown
Member Author

test-missed-k8s

[ upstream commit 03602e3 ]

Due to bug in jenkins, nesting timeout in retry block causes build to
abort. Work around by using shell-based timeout

Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
@christarazi
Copy link
Copy Markdown
Member Author

christarazi commented Jun 7, 2020

The ManagedEtcd tests were failing legitimately because the anti-affinity changes caused the cilium-etcd-operator to be stuck in the pending scheduling state because the anti-affinity is set as a Helm global. This caused unsuspecting Charts such as cilium-etcd-operator's deployment YAML which references Values.global.affinity to apply anti-affinity rules. (Certain tests provision cilium-etcd-operator with global.affinity set, which is why that existed in its deployment YAML to begin with.)

When these rules were applied, they rendered either the Cilium daemonset or the cilium-etcd-operator deployment to get stuck pending to schedule, depending on who races first to be deployed on a node. The other would then get stuck pending, ultimately, causing the test to timeout. Working on a fix.

master was not failing due to this PR #11544.

Update: fix is in commit 1c6bf9f

@christarazi christarazi force-pushed the pr/v1.7-backport-2020-06-04 branch from e3de12f to ed1b51b Compare June 7, 2020 22:42
@christarazi
Copy link
Copy Markdown
Member Author

test-backport-1.7

@christarazi
Copy link
Copy Markdown
Member Author

restart-ginkgo

@christarazi
Copy link
Copy Markdown
Member Author

test-focus K8sDatapathConfig.Encapsulation Check connectivity.

@christarazi
Copy link
Copy Markdown
Member Author

christarazi commented Jun 9, 2020

test-focus K8sDatapathConfig.Encapsulation.(Check connectivity with sockops|Check connectivity with VXLAN|Check connectivity with Geneve)

Edit: qauy hit 502: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated-Focus/248/console

@christarazi
Copy link
Copy Markdown
Member Author

christarazi commented Jun 9, 2020

test-focus K8sDatapathConfig.Encapsulation.(Check connectivity with sockops|Check connectivity with VXLAN|Check connectivity with Geneve)

Edit: quay might be down :( ...

@christarazi
Copy link
Copy Markdown
Member Author

christarazi commented Jun 9, 2020

test-focus K8sDatapathConfig.Encapsulation.(Check connectivity with sockops|Check connectivity with VXLAN|Check connectivity with Geneve)

Edit: still failing on the above tests: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated-Focus/250/

The same tests have passed locally 35 times in a row...not sure what's going on.

Copy link
Copy Markdown
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#11863 looks fine. We also need #11952 eventually as a follow-up before we cut a 1.7 release.

@christarazi
Copy link
Copy Markdown
Member Author

Temporarily reverting #11863 to see if it causes the encryption tests to fail (shot in the dark).

@christarazi
Copy link
Copy Markdown
Member Author

christarazi commented Jun 9, 2020

test-focus K8sDatapathConfig.Encapsulation.(Check connectivity with sockops|Check connectivity with VXLAN|Check connectivity with Geneve)

Edit: looks like it has failed again (1.11 netnext), but passes for 1.17
Edit2: on second thought, it may be that the CI trigger phrases for backport PRs run the wrong tests. Trying the whole suite now...

@christarazi christarazi force-pushed the pr/v1.7-backport-2020-06-04 branch from 46460ae to 27d4d4b Compare June 9, 2020 17:48
@christarazi
Copy link
Copy Markdown
Member Author

christarazi commented Jun 9, 2020

test-backport-1.7

Edit: trying without net-next label

@christarazi
Copy link
Copy Markdown
Member Author

test-backport-1.7

@errordeveloper
Copy link
Copy Markdown
Contributor

From the last Cilium-Ginkgo-Tests run:

02:47:15  • Failure [124.782 seconds]
02:47:15  K8sDatapathConfig
02:47:15  /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:395
02:47:15    Encapsulation
02:47:15    /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:395
02:47:15      Check connectivity with sockops and VXLAN encapsulation [It]
02:47:15      /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:430
02:47:15  
02:47:15      Did not find expected number of entries in BPF tunnel map
[2020-06-10T01:47:15.531Z]     Expected
[2020-06-10T01:47:15.531Z]         <int>: 5
[2020-06-10T01:47:15.531Z]     to equal
[2020-06-10T01:47:15.531Z]         <int>: 3
02:47:15  
02:47:15      /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/k8sT/DatapathConfiguration.go:213
02:49:00  • Failure [104.960 seconds]
02:49:00  K8sDatapathConfig
02:49:00  /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:395
02:49:00    Encapsulation
02:49:00    /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:395
02:49:00      Check connectivity with VXLAN encapsulation [It]
02:49:00      /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:430
02:49:00  
02:49:00      Did not find expected number of entries in BPF tunnel map
[2020-06-10T01:49:00.445Z]     Expected
[2020-06-10T01:49:00.445Z]         <int>: 5
[2020-06-10T01:49:00.445Z]     to equal
[2020-06-10T01:49:00.445Z]         <int>: 3
02:49:00  
02:49:00      /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/k8sT/DatapathConfiguration.go:213
02:50:49  • Failure [108.853 seconds]
02:50:49  K8sDatapathConfig
02:50:49  /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:395
02:50:49    Encapsulation
02:50:49    /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:395
02:50:49      Check connectivity with Geneve encapsulation [It]
02:50:49      /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:430
02:50:49  
02:50:49      Did not find expected number of entries in BPF tunnel map
[2020-06-10T01:50:49.965Z]     Expected
[2020-06-10T01:50:49.965Z]         <int>: 5
[2020-06-10T01:50:49.965Z]     to equal
[2020-06-10T01:50:49.965Z]         <int>: 3
02:50:49  
02:50:49      /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/k8sT/DatapathConfiguration.go:213
03:35:26  • Failure [156.322 seconds]
03:35:26  K8sServicesTest
03:35:26  /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:395
03:35:26    Checks service across nodes
03:35:26    /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:395
03:35:26      Tests NodePort BPF
03:35:26      /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:395
03:35:26        Tests with direct routing
03:35:26        /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:395
03:35:26          Tests GH#10983 [It]
03:35:26          /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:430
03:35:26  
03:35:26          Can not connect to service "http://192.168.36.12:31466" from outside cluster
[2020-06-10T02:35:26.119Z]         Expected command: kubectl exec -n kube-system log-gatherer-gbnrk -- curl --path-as-is -s -D /dev/stderr --fail --connect-timeout 5 --max-time 8 --local-port 64002 http://192.168.36.12:31466 -w "time-> DNS: '%{time_namelookup}(%{remote_ip})', Connect: '%{time_connect}',Transfer '%{time_starttransfer}', total '%{time_total}'" 
[2020-06-10T02:35:26.119Z]         To succeed, but it failed:
[2020-06-10T02:35:26.119Z]         Exitcode: 28 
[2020-06-10T02:35:26.119Z]         Stdout:
[2020-06-10T02:35:26.119Z]          	 time-> DNS: '0.000018()', Connect: '0.000000',Transfer '0.000000', total '5.001164'
[2020-06-10T02:35:26.119Z]         Stderr:
[2020-06-10T02:35:26.119Z]          	 command terminated with exit code 28
[2020-06-10T02:35:26.119Z]         	 
[2020-06-10T02:35:26.119Z]         
03:35:26  
03:35:26          /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/k8sT/Services.go:749
03:39:53  • Failure [252.718 seconds]
03:39:53  K8sServicesTest
03:39:53  /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:395
03:39:53    Checks service across nodes
03:39:53    /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:395
03:39:53      Tests NodePort BPF
03:39:53      /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:395
03:39:53        Tests with direct routing and DSR [It]
03:39:53        /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:430
03:39:53  
03:39:53        NAT entry was not evicted
[2020-06-10T02:39:53.155Z]       Expected
[2020-06-10T02:39:53.155Z]           <string>: 
[2020-06-10T02:39:53.155Z]       not to be empty
03:39:53  
03:39:53        /home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/k8sT/Services.go:807

@errordeveloper
Copy link
Copy Markdown
Contributor

Closed by accident... apologies.

@errordeveloper
Copy link
Copy Markdown
Contributor

test-backport-1.7

@christarazi
Copy link
Copy Markdown
Member Author

Ah, finally was able to reproduce locally. The difference between local and CI was that CI is running 3 K8s nodes. I was only running 2. Looking closer into fix now

seanmwinn and others added 3 commits June 10, 2020 11:36
[ upstream commit 110ecb4 ]

Fixes: #11821

Signed-off-by: Sean Winn <sean@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
[ upstream commit f7b0378 ]

This fixes an issue with the `HealthCheckNodePort` server where it
would non-deterministically sometimes return a non-zero
`localEndpoints` count on nodes which do not have local endpoints.

Because Cilium internally creates a service object per frontend IP, we
end up with multiple services sharing the same name. In the case where
a `LoadBalancer` service has `externalTrafficPolicy=Local` with no
local backends, Cilium will still create a `ClusterIP` sibling service
which retains the non-local backends. In that case, we must take care
to not incooperate the `ClusterIP` backends into the `localEndpoints`
count intended for external traffic. The final count is dependent on
the order in which services are added to the service manager, which
explains why the occurence of this bug was non-deterministic.

This commit fixes this issue by checking that the service may only
contain local backends before its count is added to the
`HealthCheckNodePort` server.

Fixes: #11043

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
[ upstream commit 7d26df1 ]

This commit is an attempt to add retry logic to Helm operations in the
Kubernetes test suite.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
@christarazi
Copy link
Copy Markdown
Member Author

Synced with @nebril offline and we decided to remove his PR #11830. He'll take over backporting that and ensuring that the 1.7 CI hasn't regressed.

@christarazi christarazi force-pushed the pr/v1.7-backport-2020-06-04 branch from 27d4d4b to ec47f62 Compare June 10, 2020 18:41
@christarazi
Copy link
Copy Markdown
Member Author

christarazi commented Jun 10, 2020

test-backport-1.7

EDIT(@joestringer): This didn't seem to trigger for some reason, retrying.

@joestringer
Copy link
Copy Markdown
Member

test-backport-1.7

@christarazi
Copy link
Copy Markdown
Member Author

Looks like the previous regression is gone and only failures are known flakes #10442. This is probably good to go, @joestringer please double-check

@joestringer
Copy link
Copy Markdown
Member

I agree that these are caused by the known flake. Two tests fail specifically with the symptoms and another two tests fail in the BeforeAll, seemingly because Cilium is already in a state that the testsuite believes is not ready so it cannot proceed with those tests.

Merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/backports This PR provides functionality previously merged into master.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants