Skip to content

operator: Wait for CRDs before running informers#10899

Merged
joestringer merged 4 commits intomasterfrom
operator-crd-fix
Jun 17, 2020
Merged

operator: Wait for CRDs before running informers#10899
joestringer merged 4 commits intomasterfrom
operator-crd-fix

Conversation

@vadorovsky
Copy link
Copy Markdown
Member

@vadorovsky vadorovsky commented Apr 8, 2020

Running CRD informers was sometimes resulting in errors when
cilium-operator was launched before cilium-agent created CRDs. This
change adds the waitForCRD() function which waits for the given CRD with
the timeout.

Fixes: #10814

Signed-off-by: Michal Rostecki mrostecki@opensuse.org

cilium-operator: Wait for CRDs before running Informers

@maintainer-s-little-helper
Copy link
Copy Markdown

Please set the appropriate release note label.

@vadorovsky vadorovsky added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Apr 8, 2020
@vadorovsky vadorovsky added area/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. area/operator Impacts the cilium-operator component labels Apr 8, 2020
@coveralls
Copy link
Copy Markdown

coveralls commented Apr 8, 2020

Coverage Status

Coverage increased (+0.06%) to 37.119% when pulling 799ee91 on operator-crd-fix into 97e3fc4 on master.

@vadorovsky vadorovsky force-pushed the operator-crd-fix branch 2 times, most recently from aea3b30 to 0416fac Compare April 17, 2020 19:21
@vadorovsky vadorovsky marked this pull request as ready for review April 17, 2020 19:23
@vadorovsky vadorovsky requested a review from a team as a code owner April 17, 2020 19:23
@vadorovsky
Copy link
Copy Markdown
Member Author

test-me-please

@aanm aanm requested a review from a team April 20, 2020 17:32
Comment thread operator/crd.go Outdated
Comment thread operator/crd.go Outdated
@vadorovsky
Copy link
Copy Markdown
Member Author

test-me-please

@vadorovsky vadorovsky marked this pull request as draft April 23, 2020 22:54
@vadorovsky vadorovsky force-pushed the operator-crd-fix branch 3 times, most recently from 2482f2a to 85112f5 Compare May 12, 2020 17:52
@vadorovsky vadorovsky marked this pull request as ready for review May 12, 2020 18:45
@vadorovsky vadorovsky requested a review from a team as a code owner May 12, 2020 18:45
@vadorovsky vadorovsky requested a review from a team May 12, 2020 18:45
@vadorovsky
Copy link
Copy Markdown
Member Author

test-me-please

@aanm
Copy link
Copy Markdown
Member

aanm commented May 12, 2020

please wait on #11083 to be merged first.

@christarazi christarazi added dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. and removed conflicts-with-pr labels May 14, 2020
Copy link
Copy Markdown
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks good, aside for a few minor comments

Comment thread operator/main.go
Comment thread operator/crd.go Outdated
Comment thread operator/crd.go Outdated
Comment thread operator/crd.go Outdated
@vadorovsky
Copy link
Copy Markdown
Member Author

test-me-please

@vadorovsky
Copy link
Copy Markdown
Member Author

vadorovsky commented Jun 16, 2020

Doesn't look like a flake observed anywhere else:

https://jenkins.cilium.io/job/Cilium-PR-K8s-oldest-net-next/803/consoleFull

16:35:47  • Failure [78.397 seconds]
16:35:47  K8sServicesTest
16:35:47  /home/jenkins/workspace/Cilium-PR-K8s-oldest-net-next/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:476
16:35:47    Checks service across nodes
16:35:47    /home/jenkins/workspace/Cilium-PR-K8s-oldest-net-next/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:476
16:35:47      Tests NodePort BPF
16:35:47      /home/jenkins/workspace/Cilium-PR-K8s-oldest-net-next/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:476
16:35:47        Tests with TC, direct routing and Hybrid [It]
16:35:47        /home/jenkins/workspace/Cilium-PR-K8s-oldest-net-next/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:512
16:35:47  
16:35:47        Request from k8s1 to service tftp://10.98.65.141:10069/hello failed
[2020-06-15T14:35:47.212Z]       Expected command: kubectl exec -n kube-system log-gatherer-ff75s -- /bin/bash -c 'fails=""; id=$RANDOM; for i in $(seq 1 10); do if curl --path-as-is -s -D /dev/stderr --fail --connect-timeout 5 --max-time 20 tftp://10.98.65.141:10069/hello -H "User-Agent: cilium-test-$id/$i"; then echo "Test round $id/$i exit code: $?"; else fails=$fails:$id/$i=$?; fi; done; if [ -n "$fails" ]; then echo "failed: $fails"; fi; cnt="${fails//[^:]}"; if [ ${#cnt} -gt 0 ]; then exit 42; fi' 
[2020-06-15T14:35:47.212Z]       To succeed, but it failed:
[2020-06-15T14:35:47.212Z]       Exitcode: 42 
[2020-06-15T14:35:47.212Z]       Stdout:
[2020-06-15T14:35:47.212Z]        	 
[2020-06-15T14:35:47.212Z]       	 Hostname: testds-lc54h
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Request Information:
[2020-06-15T14:35:47.212Z]       	 	client_address=192.168.36.11
[2020-06-15T14:35:47.212Z]       	 	client_port=60970
[2020-06-15T14:35:47.212Z]       	 	real path=/hello
[2020-06-15T14:35:47.212Z]       	 	request_scheme=tftp
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Test round 8698/1 exit code: 0
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Hostname: testds-sbckv
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Request Information:
[2020-06-15T14:35:47.212Z]       	 	client_address=10.0.0.232
[2020-06-15T14:35:47.212Z]       	 	client_port=60917
[2020-06-15T14:35:47.212Z]       	 	real path=/hello
[2020-06-15T14:35:47.212Z]       	 	request_scheme=tftp
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Test round 8698/2 exit code: 0
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Hostname: testds-lc54h
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Request Information:
[2020-06-15T14:35:47.212Z]       	 	client_address=192.168.36.11
[2020-06-15T14:35:47.212Z]       	 	client_port=40519
[2020-06-15T14:35:47.212Z]       	 	real path=/hello
[2020-06-15T14:35:47.212Z]       	 	request_scheme=tftp
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Test round 8698/3 exit code: 0
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Hostname: testds-lc54h
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Request Information:
[2020-06-15T14:35:47.212Z]       	 	client_address=192.168.36.11
[2020-06-15T14:35:47.212Z]       	 	client_port=44661
[2020-06-15T14:35:47.212Z]       	 	real path=/hello
[2020-06-15T14:35:47.212Z]       	 	request_scheme=tftp
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Test round 8698/4 exit code: 0
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Hostname: testds-lc54h
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Request Information:
[2020-06-15T14:35:47.212Z]       	 	client_address=192.168.36.11
[2020-06-15T14:35:47.212Z]       	 	client_port=44781
[2020-06-15T14:35:47.212Z]       	 	real path=/hello
[2020-06-15T14:35:47.212Z]       	 	request_scheme=tftp
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Test round 8698/5 exit code: 0
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Hostname: testds-lc54h
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Request Information:
[2020-06-15T14:35:47.212Z]       	 	client_address=192.168.36.11
[2020-06-15T14:35:47.212Z]       	 	client_port=36642
[2020-06-15T14:35:47.212Z]       	 	real path=/hello
[2020-06-15T14:35:47.212Z]       	 	request_scheme=tftp
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Test round 8698/6 exit code: 0
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Hostname: testds-sbckv
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Request Information:
[2020-06-15T14:35:47.212Z]       	 	client_address=10.0.0.232
[2020-06-15T14:35:47.212Z]       	 	client_port=45078
[2020-06-15T14:35:47.212Z]       	 	real path=/hello
[2020-06-15T14:35:47.212Z]       	 	request_scheme=tftp
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Test round 8698/7 exit code: 0
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Hostname: testds-sbckv
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Request Information:
[2020-06-15T14:35:47.212Z]       	 	client_address=10.0.0.232
[2020-06-15T14:35:47.212Z]       	 	client_port=57279
[2020-06-15T14:35:47.212Z]       	 	real path=/hello
[2020-06-15T14:35:47.212Z]       	 	request_scheme=tftp
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Test round 8698/8 exit code: 0
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Hostname: testds-sbckv
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Request Information:
[2020-06-15T14:35:47.212Z]       	 	client_address=10.0.0.232
[2020-06-15T14:35:47.212Z]       	 	client_port=43503
[2020-06-15T14:35:47.212Z]       	 	real path=/hello
[2020-06-15T14:35:47.212Z]       	 	request_scheme=tftp
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       	 Test round 8698/9 exit code: 0
[2020-06-15T14:35:47.212Z]       	 failed: :8698/10=28
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       Stderr:
[2020-06-15T14:35:47.212Z]        	 command terminated with exit code 42
[2020-06-15T14:35:47.212Z]       	 
[2020-06-15T14:35:47.212Z]       
16:35:47  

Copy link
Copy Markdown
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@aanm
Copy link
Copy Markdown
Member

aanm commented Jun 17, 2020

hit #8945

@aanm
Copy link
Copy Markdown
Member

aanm commented Jun 17, 2020

test-net-next --focus="K8sServicesTest"

@aanm aanm added release-note/bug This PR fixes an issue in a previous release of Cilium. and removed release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Jun 17, 2020
@aanm
Copy link
Copy Markdown
Member

aanm commented Jun 17, 2020

retest-net-next

@vadorovsky
Copy link
Copy Markdown
Member Author

It's green now, thanks a lot!

Copy link
Copy Markdown
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done!

Copy link
Copy Markdown
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Policy / helm changes LGTM.

Copy link
Copy Markdown
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the core change here looks right; there's definitely some points for follow up noted below.

EDIT: Given that my unit-test concern is mitigated, these are more "nice-to-haves".

Comment thread operator/crd.go Outdated
Comment thread operator/crd_test.go Outdated
Comment thread operator/crd.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followup: would be nice to be able to configure this in case it lines up poorly with any other timeouts.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do in a separate PR.

@joestringer
Copy link
Copy Markdown
Member

All tests passed, we requested @mrostecki via Slack to just bump the timeout to 5 minutes. Plan is to merge this imminently to make it for the 1.8.0 train 🚅

Running CRD informers was sometimes resulting in errors when
cilium-operator was launched before cilium-agent created CRDs. This
change adds the waitForCRD() function which waits for the given CRD with
the timeout.

Fixes: #10814

Signed-off-by: Michal Rostecki <mrostecki@opensuse.org>
Signed-off-by: Michal Rostecki <mrostecki@opensuse.org>
Signed-off-by: Michal Rostecki <mrostecki@opensuse.org>
@joestringer
Copy link
Copy Markdown
Member

All basic sanity checks passed, along with prior full CI pass this is good to go. Thanks @mrostecki !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. area/operator Impacts the cilium-operator component release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Errors while starting operator

7 participants